• Stars
    star
    187
  • Rank 206,464 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 11 years ago
  • Updated almost 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Python parser for Adblock Plus filters

adblockparser

PyPI Version License Build Status Code Coverage

adblockparser is a package for working with Adblock Plus filter rules. It can parse Adblock Plus filters and match URLs against them.

Installation

pip install adblockparser

Python 2.7 and Python 3.3+ are supported.

If you plan to use this library with a large number of filters installing pyre2 library is highly recommended: the speedup for a list of default EasyList filters can be greater than 1000x.

pip install 're2 >= 0.2.21'

Note that pyre2 library requires C++ re2 library installed. On OS X you can get it using homebrew (brew install re2).

Usage

To learn about Adblock Plus filter syntax check these links:

  1. Get filter rules somewhere: write them manually, read lines from a file downloaded from EasyList, etc.:

    >>> raw_rules = [
    ...     "||ads.example.com^",
    ...     "@@||ads.example.com/notbanner^$~script",
    ... ]
    
  2. Create AdblockRules instance from rule strings:

    >>> from adblockparser import AdblockRules
    >>> rules = AdblockRules(raw_rules)
    
  3. Use this instance to check if an URL should be blocked or not:

    >>> rules.should_block("http://ads.example.com")
    True
    

    Rules with options are ignored unless you pass a dict with options values:

    >>> rules.should_block("http://ads.example.com/notbanner")
    True
    >>> rules.should_block("http://ads.example.com/notbanner", {'script': False})
    False
    >>> rules.should_block("http://ads.example.com/notbanner", {'script': True})
    True
    

Consult with Adblock Plus docs for options description. These options allow to write filters that depend on some external information not available in URL itself.

Performance

Regex engines

AdblockRules class creates a huge regex to match filters that don't use options. pyre2 library works better than stdlib's re with such regexes. If you have pyre2 installed then AdblockRules should work faster, and the speedup can be dramatic - more than 1000x in some cases.

Sometimes pyre2 prints something like re2/dfa.cc:459: DFA out of memory: prog size 270515 mem 1713850 to stderr. Give re2 library more memory to fix that:

>>> rules = AdblockRules(raw_rules, use_re2=True, max_mem=512*1024*1024)  # doctest: +SKIP

Make sure you are using re2 0.2.20 installed from PyPI, it doesn't work.

Parsing rules with options

Rules that have options are currently matched in a loop, one-by-one. Also, they are checked for compatibility with options passed by user: for example, if user didn't pass 'script' option (with a True or False value), all rules involving script are discarded.

This is slow if you have thousands of such rules. To make it work faster, explicitly list all options you want to support in AdblockRules constructor, disable skipping of unsupported rules, and always pass a dict with all options to should_block method:

>>> rules = AdblockRules(
...    raw_rules,
...    supported_options=['script', 'domain'],
...    skip_unsupported_rules=False
... )
>>> options = {'script': False, 'domain': 'www.mystartpage.com'}
>>> rules.should_block("http://ads.example.com/notbanner", options)
False

This way rules with unsupported options will be filtered once, when AdblockRules instance is created.

Limitations

There are some known limitations of the current implementation:

  • element hiding rules are ignored;
  • matching URLs against a large number of filters can be slow-ish, especially if pyre2 is not installed and many filter options are enabled;
  • match-case filter option is not properly supported (it is ignored);
  • document filter option is not properly supported;
  • rules are not validated before parsing, so invalid rules may raise inconsistent exceptions or silently work incorrectly.

It is possible to remove all these limitations. Pull requests are welcome if you want to make it happen sooner!

Contributing

In order to run tests, install tox and type

tox

from the source checkout.

The license is MIT.

More Repositories

1

portia

Visual scraping for Scrapy
Python
8,991
star
2

splash

Lightweight, scriptable browser as a service with an HTTP API
Python
3,898
star
3

dateparser

python parser for human readable dates
Python
2,525
star
4

frontera

A scalable frontier for web crawlers
Python
1,288
star
5

slackbot

A chat bot for Slack (https://slack.com).
Python
1,263
star
6

extruct

Extract embedded metadata from HTML markup
Python
832
star
7

scrapyrt

HTTP API for Scrapy spiders
Python
829
star
8

python-crfsuite

A python binding for crfsuite
Python
770
star
9

spidermon

Scrapy Extension for monitoring spiders execution.
Python
530
star
10

price-parser

Extract price amount and currency symbol from a raw text string
Python
307
star
11

article-extraction-benchmark

Article extraction benchmark: dataset and evaluation scripts
Python
268
star
12

webstruct

NER toolkit for HTML data
HTML
252
star
13

python-scrapinghub

A client interface for Scrapinghub's API
Python
195
star
14

js2xml

Convert Javascript code to an XML document
Python
186
star
15

testspiders

Useful test spiders for Scrapy
Python
183
star
16

scrapy-training

Scrapy Training companion code
Python
171
star
17

skinfer

Skinfer is a tool for inferring and merging JSON schemas
Python
140
star
18

sample-projects

Sample projects showcasing Scrapinghub tech
Python
137
star
19

shub

Scrapinghub Command Line Client
Python
125
star
20

python-simhash

An efficient simhash implementation for python
C
122
star
21

scrapy-poet

Page Object pattern for Scrapy
Python
119
star
22

number-parser

Parse numbers written in natural language
Python
108
star
23

mdr

A python library detect and extract listing data from HTML page.
C
106
star
24

web-poet

Web scraping Page Objects core library
Python
95
star
25

aile

Automatic Item List Extraction
HTML
87
star
26

wappalyzer-python

UNMAINTAINED Python wrapper for Wappalyzer (utility that uncovers the technologies used on websites)
Python
82
star
27

pydepta

A python implementation of DEPTA
C
80
star
28

scrapinghub-stack-scrapy

Software stack with latest Scrapy and updated deps
Dockerfile
60
star
29

aduana

Frontera backend to guide a crawl using PageRank, HITS or other ranking algorithms based on the link structure of the web graph, even when making big crawls (one billion pages).
C
55
star
30

scrapy-autoextract

Zyte Automatic Extraction integration for Scrapy
Python
55
star
31

scrapy-autounit

Automatic unit test generation for Scrapy.
Python
55
star
32

learn.scrapinghub.com

Scrapinghub Learning Center. Report issues in Jira: Report issues in Jira: https://scrapinghub.atlassian.net/projects/WEB
CSS
55
star
33

portia2code

Python
49
star
34

arche

Analyze scraped data
Python
47
star
35

scmongo

MongoDB extensions for Scrapy
Python
44
star
36

exporters

Exporters is an extensible export pipeline library that supports filter, transform and several sources and destinations
Python
40
star
37

webpager

Paginating the web
C
35
star
38

scrapy-frontera

More flexible and featured Frontera scheduler for Scrapy
Python
35
star
39

page_clustering

A simple algorithm for clustering web pages, suitable for crawlers
HTML
34
star
40

flatson

Tool to flatten stream of JSON-like objects, configured via schema
Python
33
star
41

scaws

Extensions for using Scrapy on Amazon AWS
Python
32
star
42

docker-images

Dockerfile
32
star
43

scrapylib

Collection of Scrapy utilities (extensions, middlewares, pipelines, etc)
Python
31
star
44

pycon-speakers

Speakers Spider (PyCon 2014 sprint)
Python
30
star
45

docker-devpi

pypi caching service using devpi and docker
Shell
28
star
46

crawlera-tools

Crawlera tools
Python
26
star
47

scrapinghub-entrypoint-scrapy

Scrapy entrypoint for Scrapinghub job runner
Python
25
star
48

scrapy-mosquitera

Restrict crawl and scraping scope using matchers.
Python
25
star
49

andi

Library for annotation-based dependency injection
Python
20
star
50

kafka-scanner

High Level Kafka Scanner
Python
19
star
51

autoextract-spiders

Pre-built Scrapy spiders for AutoExtract
Python
19
star
52

python-cld2

Python bindings for CLD2.
Python
17
star
53

product-extraction-benchmark

Jupyter Notebook
16
star
54

python-hubstorage

Deprecated HubStorage client library - please use python-scrapinghub>=1.9.0 instead
Python
16
star
55

shublang

Pluggable DSL that uses pipes to perform a series of linear transformations to extract data
Python
15
star
56

shub-workflow

Python
13
star
57

shubc

Go bindings for Scrapinghub HTTP API and a sweet command line tool for Scrapy Cloud
Go
13
star
58

scrapinghub-stack-portia

Software stack used to run Portia spiders in Scrapinghub cloud
Python
10
star
59

tutorials

Python
8
star
60

pastebin

Python
8
star
61

navscraper

Vanguard ETF NAV scraper
Python
8
star
62

varanus

A command line spider monitoring tool
Python
8
star
63

hcf-backend

Crawl Frontier HCF backend
Python
7
star
64

pydatanyc

Python
7
star
65

autoextract-poet

web-poet definitions for AutoExtract
Python
6
star
66

collection-scanner

HubStorage collection scanner library
Python
5
star
67

locode

Python
5
star
68

adblockgoparser

Golang parser for Adblock Plus filters
Go
4
star
69

autoextract-examples

Jupyter Notebook
4
star
70

webstruct-demo

HTTP demo for https://github.com/scrapinghub/webstruct
Python
4
star
71

shub-image

Deprecated client side tool to prepare docker images to run crawlers in Scrapinghub - please use shub>=2.5.0 instead
Python
4
star
72

docker-cloudera-manager

Run Cloudera Manager in docker
Dockerfile
3
star
73

custom-images-examples

Examples of custom images running on Scrapinghub platform
3
star
74

hubstorage-frontera

Hubstorage crawl frontier backend for Frontera
Python
3
star
75

httpation

Erlang
3
star
76

xpathcsstutorial

[Work in progress] XPath & CSS for web scraping tutorial
Jupyter Notebook
3
star
77

epmdless_dist

Erlang
2
star
78

egraylog

Erlang
2
star
79

scrapinghub-conda-recipes

Conda packages for scrapinghub channel
Shell
2
star
80

pydaybot

Demo bot for Python Day Uruguay 2011
Python
2
star
81

erl-iputils

Erlang
1
star
82

jupyterhub-stacks

A docker images for jhub cluster
Python
1
star
83

cld2

Compact Language Detector 2
C++
1
star
84

scrapinghub-stack-hworker

[DEPRECATED] Software stack fully compatible with Scrapy Cloud 1.0
Python
1
star
85

crawlera.com

crawlera.com website
HTML
1
star
86

discourse-sso-google

Use Google as Single-Sign-On provider for Discourse
Python
1
star
87

pkg-opengrok

Ubuntu packaging for OpenGrok
Shell
1
star