• Stars
    star
    178
  • Rank 213,719 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 13 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Modern robots.txt Parser for Python

Robots Exclusion Protocol Parser for Python

Build Status

Robots.txt parsing in Python.

Goals

  • Fetching -- helper utilities for fetching and parsing robots.txts, including checking cache-control and expires headers
  • Support for newer features -- like Crawl-Delay and Sitemaps
  • Wildcard matching -- without using regexes, no less
  • Performance -- with >100k parses per second, >1M URL checks per second once parsed
  • Caching -- utilities to help with the caching of robots.txt responses

Installation

reppy is available on pypi:

pip install reppy

When installing from source, there are submodule dependencies that must also be fetched:

git submodule update --init --recursive
make install

Usage

Checking when pages are allowed

Two classes answer questions about whether a URL is allowed: Robots and Agent:

from reppy.robots import Robots

# This utility uses `requests` to fetch the content
robots = Robots.fetch('http://example.com/robots.txt')
robots.allowed('http://example.com/some/path/', 'my-user-agent')

# Get the rules for a specific agent
agent = robots.agent('my-user-agent')
agent.allowed('http://example.com/some/path/')

The Robots class also exposes properties expired and ttl to describe how long the response should be considered valid. A reppy.ttl policy is used to determine what that should be:

from reppy.ttl import HeaderWithDefaultPolicy

# Use the `cache-control` or `expires` headers, defaulting to a 30 minutes and
# ensuring it's at least 10 minutes
policy = HeaderWithDefaultPolicy(default=1800, minimum=600)

robots = Robots.fetch('http://example.com/robots.txt', ttl_policy=policy)

Customizing fetch

The fetch method accepts *args and **kwargs that are passed on to requests.get, allowing you to customize the way the fetch is executed:

robots = Robots.fetch('http://example.com/robots.txt', headers={...})

Matching Rules and Wildcards

Both * and $ are supported for wildcard matching.

This library follows the matching 1996 RFC describes. In the case where multiple rules match a query, the longest rules wins as it is presumed to be the most specific.

Checking sitemaps

The Robots class also lists the sitemaps that are listed in a robots.txt

# This property holds a list of URL strings of all the sitemaps listed
robots.sitemaps

Delay

The Crawl-Delay directive is per agent and can be accessed through that class. If none was specified, it's None:

# What's the delay my-user-agent should use
robots.agent('my-user-agent').delay

Determining the robots.txt URL

Given a URL, there's a utility to determine the URL of the corresponding robots.txt. It preserves the scheme and hostname and the port (if it's not the default port for the scheme).

# Get robots.txt URL for http://[email protected]:8080/path;params?query#fragment
# It's http://example.com:8080/robots.txt
Robots.robots_url('http://[email protected]:8080/path;params?query#fragment')

Caching

There are two cache classes provided -- RobotsCache, which caches entire reppy.Robots objects, and AgentCache, which only caches the reppy.Agent relevant to a client. These caches duck-type the class that they cache for the purposes of checking if a URL is allowed:

from reppy.cache import RobotsCache
cache = RobotsCache(capacity=100)
cache.allowed('http://example.com/foo/bar', 'my-user-agent')

from reppy.cache import AgentCache
cache = AgentCache(agent='my-user-agent', capacity=100)
cache.allowed('http://example.com/foo/bar')

Like reppy.Robots.fetch, the cache constructory accepts a ttl_policy to inform the expiration of the fetched Robots objects, as well as *args and **kwargs to be passed to reppy.Robots.fetch.

Caching Failures

There's a piece of classic caching advice: "don't cache failures." However, this is not always appropriate in certain circumstances. For example, if the failure is a timeout, clients may want to cache this result so that every check doesn't take a very long time.

To this end, the cache module provides a notion of a cache policy. It determines what to do in the case of an exception. The default is to cache a form of a disallowed response for 10 minutes, but you can configure it as you see fit:

# Do not cache failures (note the `ttl=0`):
from reppy.cache.policy import ReraiseExceptionPolicy
cache = AgentCache('my-user-agent', cache_policy=ReraiseExceptionPolicy(ttl=0))

# Cache and reraise failures for 10 minutes (note the `ttl=600`):
cache = AgentCache('my-user-agent', cache_policy=ReraiseExceptionPolicy(ttl=600))

# Treat failures as being disallowed
cache = AgentCache(
    'my-user-agent',
    cache_policy=DefaultObjectPolicy(ttl=600, lambda _: Agent().disallow('/')))

Development

A Vagrantfile is provided to bootstrap a development environment:

vagrant up

Alternatively, development can be conducted using a virtualenv:

virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

Tests

Tests may be run in vagrant:

make test

Development

Environment

To launch the vagrant image, we only need to vagrant up (though you may have to provide a --provider flag):

vagrant up

With a running vagrant instance, you can log in and run tests:

vagrant ssh
make test

Running Tests

Tests are run with the top-level Makefile:

make test

PRs

These are not all hard-and-fast rules, but in general PRs have the following expectations:

  • pass Travis -- or more generally, whatever CI is used for the particular project
  • be a complete unit -- whether a bug fix or feature, it should appear as a complete unit before consideration.
  • maintain code coverage -- some projects may include code coverage requirements as part of the build as well
  • maintain the established style -- this means the existing style of established projects, the established conventions of the team for a given language on new projects, and the guidelines of the community of the relevant languages and frameworks.
  • include failing tests -- in the case of bugs, failing tests demonstrating the bug should be included as one commit, followed by a commit making the test succeed. This allows us to jump to a world with a bug included, and prove that our test in fact exercises the bug.
  • be reviewed by one or more developers -- not all feedback has to be accepted, but it should all be considered.
  • avoid 'addressed PR feedback' commits -- in general, PR feedback should be rebased back into the appropriate commits that introduced the change. In cases, where this is burdensome, PR feedback commits may be used but should still describe the changed contained therein.

PR reviews consider the design, organization, and functionality of the submitted code.

Commits

Certain types of changes should be made in their own commits to improve readability. When too many different types of changes happen simultaneous to a single commit, the purpose of each change is muddled. By giving each commit a single logical purpose, it is implicitly clear why changes in that commit took place.

  • updating / upgrading dependencies -- this is especially true for invocations like bundle update or berks update.
  • introducing a new dependency -- often preceeded by a commit updating existing dependencies, this should only include the changes for the new dependency.
  • refactoring -- these commits should preserve all the existing functionality and merely update how it's done.
  • utility components to be used by a new feature -- if introducing an auxiliary class in support of a subsequent commit, add this new class (and its tests) in its own commit.
  • config changes -- when adjusting configuration in isolation
  • formatting / whitespace commits -- when adjusting code only for stylistic purposes.

New Features

Small new features (where small refers to the size and complexity of the change, not the impact) are often introduced in a single commit. Larger features or components might be built up piecewise, with each commit containing a single part of it (and its corresponding tests).

Bug Fixes

In general, bug fixes should come in two-commit pairs: a commit adding a failing test demonstrating the bug, and a commit making that failing test pass.

Tagging and Versioning

Whenever the version included in setup.py is changed (and it should be changed when appropriate using http://semver.org/), a corresponding tag should be created with the same version number (formatted v<version>).

git tag -a v0.1.0 -m 'Version 0.1.0

This release contains an initial working version of the `crawl` and `parse`
utilities.'

git push --tags origin

More Repositories

1

shovel

Rake, for Python
Python
664
star
2

simhash-py

Simhash and near-duplicate detection
Python
377
star
3

qless

Queue / Pipeline Management
Ruby
292
star
4

pyreBloom

Fast Redis Bloom Filters in Python
Python
286
star
5

interpol

A toolkit for working with API endpoint definition files, giving you a stub app, a schema validation middleware, and browsable documentation.
HTML
187
star
6

word2gauss

Gaussian word embeddings
Python
186
star
7

SEOmozAPISamples

Mozscape API sample code
Java
158
star
8

simhash-cpp

Simhashing in C++
C++
121
star
9

url-py

URL Transformation, Sanitization
Python
102
star
10

qless-core

Core Lua Scripts for qless
Python
83
star
11

simhash-db-py

Python API for Various DB-Backed Simhash Clusters
Python
63
star
12

qless-py

Python Bindings for qless
Python
48
star
13

qdr

Query-Document Relevance
Python
43
star
14

dragnet_data

Training/test data for Dragnet
Shell
41
star
15

publicsuffix-elixir

Elixir library providing public suffix logic based on publicsuffix.org data
Elixir
38
star
16

linkscape-gem

Provides an interface to SEOmoz's suite of APIs, including the free and site intelligence APIs.
Ruby
38
star
17

simhash-cluster

A cluster implementation of simhash near-duplicate detection
Python
33
star
18

Social-Authority-SDK

Ruby
33
star
19

s3po

Your Friendly Asynchronous S3 Upload Protocol Droid
Python
30
star
20

GWT-keyword-analysis

Analysis of Google Webmaster Tools search data
Python
25
star
21

g-crawl-py

Gevent Crawling in Python, with Utilities
Python
23
star
22

mozsci

Data science tools from Moz
Python
22
star
23

url-cpp

C++ bindings for url parsing and sanitization
C++
19
star
24

vocab

Vocabulary using n-grams
Python
16
star
25

uri_parser

A fast URI parser that wraps Google's chromium URL canonicalization library
C++
13
star
26

downpour

Fetch urls quickly and asynchronously with Twisted, honoring politeness.
Python
13
star
27

rep-cpp

Robot exclusion protocol in C++
C++
12
star
28

mltk

mltk - Moz Language Tool Kit
Python
12
star
29

plines

Easily create job pipelines out of declared job dependencies using Qless.
Ruby
10
star
30

awssh

AWSSH Config
Python
9
star
31

roger-mesos

A complete mesos cluster setup with automatic load balancing
Python
8
star
32

linkscape-py

Python Bindings for Linkscape's API
Python
5
star
33

qless-js

Node.js bindings for qless
JavaScript
5
star
34

roger-bamboo

Roger's internal load balancer and frontend proxy. Based on https://github.com/QubitProducts/bamboo
Go
5
star
35

gzippy

Gzip files in python
Python
4
star
36

asis

Lightweight As-Is Server
Python
4
star
37

awscpp

AWS C++ Bindings
C++
3
star
38

rack-authenticate

Rack middleware that handles basic auth and HMAC auth
Ruby
3
star
39

elasticsearch-utils

Some elasticsearch utilities I've put together / been using in investigating elasticsearch performance
Python
3
star
40

pyjudy

Python bindings to libJudy
Python
3
star
41

resque-unfairly

A Resque plugin for processing queues from random jobs based on queue weightings. Inspired by resque-fairly.
Ruby
3
star
42

roger-monitoring

Monitoring stack for RogerOS
Python
3
star
43

crawl-curio-cabinet

A Curio Cabinet of the Odd Behaviors We've Seen on the Internet
HTML
3
star
44

qless-docker

Create a qless docker image!
Ruby
2
star
45

irobot

robots.txt file inspection
Ruby
2
star
46

bloomfilter-py

Simple and fast Bloom filter
Python
2
star
47

docker-sortdb

Docker setup for SortDB
Shell
1
star
48

qless-java

qless java binding
Java
1
star
49

zendesk-search

Search for tags and such in zendesk
JavaScript
1
star
50

deb-swift

1
star
51

fiji

Cell schemas and schema versioning for HBase
HTML
1
star
52

p5-Webservice-Followerwonk-SocialAuthority

Perl Client for The Followerwonk Social Authority API
Perl
1
star
53

qless-util-py

Utilities for use with qless-py
Python
1
star
54

process_tree_dictionary

Implements a dictionary that is scoped to a process tree for Erlang and Elixir.
Elixir
1
star
55

moz_nav

DEPRECATED. Common navigation and layout across all SEOmoz applications
Ruby
1
star
56

logtools

Stuff for reading crawler log files. Probably not of much interest to those outside of SeoMOZ.
Python
1
star