• Stars
    star
    101
  • Rank 338,166 (Top 7 %)
  • Language
    Python
  • License
    MIT License
  • Created about 13 years ago
  • Updated about 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

feedparser but faster and worse

speedparser

Speedparser is a black-box "style" reimplementation of the Universal Feed Parser. It uses some feedparser code for date and authors, but mostly re-implements its data normalization algorithms based on feedparser output. It uses lxml for feed parsing and for optional HTML cleaning. Its compatibility with feedparser is very good for a strict subset of fields, but poor for fields outside that subset. See tests/speedparsertests.py for more information on which fields are more or less compatible and which are not.

On an Intel(R) Core(TM) i5 750, running only on one core, feedparser managed 2.5 feeds/sec on the test feed set (roughly 4200 "feeds" in tests/feeds.tar.bz2), while speedparser manages around 65 feeds/sec with HTML cleaning on and 200 feeds/sec with cleaning off.

installing

pip install speedparser

usage

Usage is similar to feedparser:

>>> import speedparser
>>> result = speedparser.parse(feed)
>>> result = speedparser.parse(feed, clean_html=False)

differences

There are a few interface differences and many result differences between speedparser and feedparser. The biggest similarity is that they both return a FeedParserDict() object (with keys accessible as attributes), they both set the bozo key when an error is encountered, and various aspects of the feed and entries keys are likely to be identical or very similar.

speedparser uses different (and in some cases less or none; buyer beware) data cleaning algorithms than feedparser. When it is enabled, lxml's html.cleaner library will be used to clean HTML and give similar but not identical protection against various attributes and elements. If you supply your own Cleaner element to the "clean_html kwarg, it will be used by speedparser to clean the various attributes of the feed and entries.

speedparser does not attempt to fix character encoding by default because this processing can take a long time for large feeds. If the encoding value of the feed is wrong, or if you want this extra level of error tollerance, you can either use the chardet module to detect the encoding based on the document or pass encoding=True to speedparser.parse and it will fall back to encoding detection if it encounters encoding errors.

If your application is using feedparser to consume many feeds at once and CPU is becoming a bottleneck, you might want to try out speedparser as an alternative (using feedparser as a backup). If you are writing an application that does not ingest many feeds, or where CPU is not a problem, you should use feedparser as it is flexible with bad or malformed data and has a much better test suite.

More Repositories

1

sqlx

general purpose extensions to golang's database/sql
Go
16,176
star
2

humanize

python humanize functions
Python
1,677
star
3

jsonq

simple json field access for golang
Go
586
star
4

modl

golang database modelling library
Go
479
star
5

johnny-cache

johnny cache django caching framework
Python
305
star
6

monet

golang blog
Go
196
star
7

gowiki

single-file single-executable wiki written in golang
Go
77
star
8

jigo

an attempt at a jinja2 implementation in go
Go
71
star
9

dmc

dmc runs commands via ssh on multiple machines
Go
34
star
10

django-slow-log

django slow request log
Python
32
star
11

micromongo

tiny fast python orm-ish tools for mongodb
Python
27
star
12

mandira

language agnostic logic-light template system
Go
27
star
13

arachne

a complex but scalable web spider
Python
25
star
14

aranha

simple python gevent web spider
Python
23
star
15

python-github

simple python github api2 library
Python
16
star
16

cm

very simple config manager
Go
15
star
17

par2ools

par2 tools (par2ls, par2mv, par2unrar)
Python
12
star
18

contact-form

server that implements an email contact form for an otherwise static website
Go
11
star
19

what-the-sql

online sql test
Go
11
star
20

euler.go

project euler solutions in go
Go
9
star
21

jmoiron.net

personal blog + website written with flask
Python
8
star
22

dotfiles

dotfiles + cm
Vim Script
8
star
23

jinkies

jinkies is a simple jenkins cli
Python
6
star
24

ongaku

http music player for local files streamable to a chromecast via chrome tab streaming
Go
6
star
25

gaspar

generic eventlet+zmq worker library
Python
5
star
26

kokuen

django-app performance tracking w/ statsd + graphite
Python
5
star
27

pdf2zip

converts image-based pdfs to a zipfile full of extracted images
Python
5
star
28

gevent-memcache-bench

python memcached speed tests w/ and w/o gevent
Python
4
star
29

etod

epoch to date
Go
3
star
30

redtape

simple program to create simple html docs from simple markdown files
Python
3
star
31

uromkan

python unicode romaji to kana conversion
Python
3
star
32

iris

python command-line photo management thing
Python
2
star
33

daneel

Python
2
star
34

chapman

chapman python backup nonsense
Python
2
star
35

nvidia-gpu-temp

Python
1
star
36

terminal-schemer

portable style scheme application for mate-terminal and gnome-terminal
Go
1
star
37

cs101

simplistic explorations of basic CS data structures & algorithms
Go
1
star
38

pctilde

pctilde emulates the behavior of the zsh PS1 var '%~'
Go
1
star
39

golang-sdk

Go
1
star
40

ulv-covers-modern

ulv covers for gregtech ceu modern
Java
1
star