• Stars
    star
    102
  • Rank 333,587 (Top 7 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 12 years ago
  • Updated over 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

URL Transformation, Sanitization

URL

URL parsing done reasonably.

Build Status Status: Production Team: Big Data Scope: External Open Source: MIT Critical: Yes

Moz crawls. We crawl lots. In fact, you might say that crawling is our business.

The internet's also a messy place. We've encountered some pretty crazy implementations and servers and URLs and HTML. Over the course of this discovery, we've found ourselves repeating certain URL sanitization tasks over and over, so we've put them in a repo to share with the world.

At the heart of the url package is the URL object. You can get one by passing in a unicode or string object into the top-level parse method. If the string is encoded, you can provide that encoding (otherwise it's assumed to be utf-8):

import url

# It knows about unicode
myurl = url.parse(u'http://foo.com')

# It knows about other encodings that Python supports
myurl = url.parse(..., 'some encoding')

Internally, everything is stored as UTF-8 until you ask for a string back. The workflow is that you'll chain a number of permutations together to get the type of URL you're after, and then call a final method to give you a string.

# Defrag, remove some parameters and give me a unicode string
url.parse(...).defrag().deparam(['utm_source']).unicode()

# Escape the path, and punycode the host, and give me a UTF-8 string
url.parse(...).escape().punycode().utf8

# Give me the absolute path url as some encoding
url.parse(...).abspath().encode('some encoding')

URL Equivalence

URL objects compared with == are interpreted very strictly, but for a more lax interpretation, consider using equiv to test if two urls are functionally equivalent:

a = url.parse('https://fรถo.com:443/a/../b/.?b=2&&&&&&a=1')
b = url.parse('https://xn--fo-fka.COM/b/?a=1&b=2')

# These urls are not equal
assert(a != b)
# But they are equivalent
assert(a.equiv(b))
assert(b.equiv(a))

This equivalence test takes default ports for common schemes into account (so if both urls are the same scheme, but one explicitly specifies the default port), punycoding, case of the host name, and parameter order.

Absolute URLs

You can perform many operations on relative urls (those without a hostname), but punycoding and unpunycoding are not among them. You can also tell whether or not a url is absolute:

a = url.parse('foo/bar.html')
assert(not a.absolute())

Chaining

Many of the methods on the URL class can be chained to produce a number of effects in sequence:

import url

# Create a url object
myurl = url.URL.parse('http://www.FOO.com/bar?utm_source=foo#what')
# Remove some parameters and the fragment, spit out utf-8
print myurl.defrag().deparam(['utm_source']).utf8

In fact, unless the function explicitly returns a string, then the method may be chained:

strip

Removes semantically meaningless excess '?', '&', and ';' characters from query and params:

>>> url.parse('http://example.com/????query=param&&&&foo=bar').strip().utf8
'http://example.com/?query=param&foo=bar'

canonical

According to the RFC, the order of parameters is not supposed to matter. In practice, it can (depending on how the server matches URL routes), but it's also helpful to be able to put parameters in a canonical ordering. This ordering happens to be alphabetical order:

>>> url.parse('http://foo.com/?b=2&a=1&d=3').canonical().utf8
'http://foo.com/?a=1&b=2&d=3'

defrag

Remove any fragment identifier from the url. This isn't part of the reuqest that gets sent to an HTTP server, and so it's often useful to remove the fragment when doing url comparisons.

>>> url.parse('http://foo.com/#foo').defrag().utf8
'http://foo.com/'

deparam

Some parameters are commonly added to urls that we may not be interested in. Or they may be misleading. Common examples include referrering pages, utm_source and session ids. To strip out all such parameters from your url:

>>> url.parse('http://foo.com/?do=1&not=2&want=3&this=4').deparam(['do', 'not', 'want']).utf8
'http://foo.com/?this=4'

abspath

Like its os.path namesake, this makes sure that the path of the url is absolute. This includes removing redundant forward slashes, . and ...

>>> url.parse('http://foo.com/foo/./bar/../a/b/c/../../d').abspath().utf8
'http://foo.com/foo/a/d'

escape

Non-ASCII characters in the path are typically encoded as UTF-8 and then escaped as %HH where H are hexidecimal values. It's important to note that the escape function is idempotent, and can be called repeatedly

>>> url.parse(u'http://foo.com/รผmlaut').escape().utf8
'http://foo.com/%C3%BCmlaut'
>>> url.parse(u'http://foo.com/รผmlaut').escape().escape().utf8
'http://foo.com/%C3%BCmlaut'

unescape

If you have a URL that might have been escaped before it was given to you, but you'd like to display something a little more meaningful than %C3%BCmlaut, you can unescape the path:

>>> print url.parse('http://foo.com/%C3%BCmlaut').unescape().unicode()
http://foo.com/รผmlaut

relative

Evaluate a relative path given a base url:

>>> url.parse('http://foo.com/a/b/c').relative('../foo').utf8
'http://foo.com/a/foo'

punycode

For non-ASCII hostnames, they must be punycoded before a DNS request is made for them. To this end, there's the punycode function:

>>> url.parse('http://รผmlaut.com').punycode().utf8
'http://xn--mlaut-jva.com/'

unpunycode

If a url may have been punycoded before it's been handed to you, and you'd like to be able to display something nicer than http://xn--mlaut-jva.com/:

>>> print url.parse('http://xn--mlaut-jva.com/').unpunycode().utf8
http://รผmlaut.com/

Other Functions

Not all functions are chainable -- some return a value other than a URL object:

  • encode(...) -- return a version of the url in an arbitrary encoding

Public Suffix List

This library comes bundled with a version of the public suffix list. However, it may not suit your needs (whether you need to stay pinned to an old list, or need to update to a new list). As such, you can provide the PSL you'd like to use, as a UTF-8 string:

import url

# Read it from a file
with open('path/to/my/psl') as fin:
    url.set_psl(fin.read())

# Grab it from the PSL site
import requests
url.set_psl(requests.get('https://publicsuffix.org/list/public_suffix_list.dat').content)

Properties

Many attributes are available on URL objects:

  • scheme -- empty string if URL is relative
  • host -- None if URL is relative
  • hostname -- like host, but empty string if URL is relative
  • pld -- the pay-level domain, or an empty string if URL is relative
  • tld -- the top-level domain, or an empty string if URL is relative
  • port -- None if absent (or removed)
  • path -- always with a leading /
  • params -- string of params following the ; (with extra ;'s removed)
  • query -- string of queries following the ? (with extra ?'s and &'s removed)
  • fragment -- empty string if absent
  • absolute -- a bool indicating whether the URL is absolute
  • unicode -- a unicode version of the URL
  • utf8 -- a utf-8 verison of the URL

Contentious Issues

Some questions that I still have outstanding:

Strip ?'s From Query Names?

If I have a query string ?a=1&?b=2, and I sanitize the params, should the resulting query string be ?a=1&?b=2 or ?a=1&b=2 (note the missing ? before the b in the second version).

If not in the above example, what about in ?????a=1? Should the resulting query string be a mere ?a=1?

Properties

I'd like to support lazily-evaluated properties like hostname, netloc, etc.

Dictionary Access

I'd like to support dictionary-style access to parameters and query arguments, though I'm not sure how to best to do it. My current thinking is that there will be one way of getting params, one for queries, and then one for either.

Authors

This represents code samples, unit tests and functions from Mozzers, including:

  • David Barts
  • Brandon Forehand
  • Dan Lecocq

More Repositories

1

shovel

Rake, for Python
Python
664
star
2

simhash-py

Simhash and near-duplicate detection
Python
377
star
3

qless

Queue / Pipeline Management
Ruby
292
star
4

pyreBloom

Fast Redis Bloom Filters in Python
Python
286
star
5

interpol

A toolkit for working with API endpoint definition files, giving you a stub app, a schema validation middleware, and browsable documentation.
HTML
187
star
6

word2gauss

Gaussian word embeddings
Python
186
star
7

reppy

Modern robots.txt Parser for Python
Python
178
star
8

SEOmozAPISamples

Mozscape API sample code
Java
158
star
9

simhash-cpp

Simhashing in C++
C++
121
star
10

qless-core

Core Lua Scripts for qless
Python
83
star
11

simhash-db-py

Python API for Various DB-Backed Simhash Clusters
Python
63
star
12

qless-py

Python Bindings for qless
Python
48
star
13

qdr

Query-Document Relevance
Python
43
star
14

dragnet_data

Training/test data for Dragnet
Shell
41
star
15

publicsuffix-elixir

Elixir library providing public suffix logic based on publicsuffix.org data
Elixir
38
star
16

linkscape-gem

Provides an interface to SEOmoz's suite of APIs, including the free and site intelligence APIs.
Ruby
38
star
17

simhash-cluster

A cluster implementation of simhash near-duplicate detection
Python
33
star
18

Social-Authority-SDK

Ruby
33
star
19

s3po

Your Friendly Asynchronous S3 Upload Protocol Droid
Python
30
star
20

GWT-keyword-analysis

Analysis of Google Webmaster Tools search data
Python
25
star
21

g-crawl-py

Gevent Crawling in Python, with Utilities
Python
23
star
22

mozsci

Data science tools from Moz
Python
22
star
23

url-cpp

C++ bindings for url parsing and sanitization
C++
19
star
24

vocab

Vocabulary using n-grams
Python
16
star
25

uri_parser

A fast URI parser that wraps Google's chromium URL canonicalization library
C++
13
star
26

downpour

Fetch urls quickly and asynchronously with Twisted, honoring politeness.
Python
13
star
27

rep-cpp

Robot exclusion protocol in C++
C++
12
star
28

mltk

mltk - Moz Language Tool Kit
Python
12
star
29

plines

Easily create job pipelines out of declared job dependencies using Qless.
Ruby
10
star
30

awssh

AWSSH Config
Python
9
star
31

roger-mesos

A complete mesos cluster setup with automatic load balancing
Python
8
star
32

linkscape-py

Python Bindings for Linkscape's API
Python
5
star
33

qless-js

Node.js bindings for qless
JavaScript
5
star
34

roger-bamboo

Roger's internal load balancer and frontend proxy. Based on https://github.com/QubitProducts/bamboo
Go
5
star
35

gzippy

Gzip files in python
Python
4
star
36

asis

Lightweight As-Is Server
Python
4
star
37

awscpp

AWS C++ Bindings
C++
3
star
38

rack-authenticate

Rack middleware that handles basic auth and HMAC auth
Ruby
3
star
39

elasticsearch-utils

Some elasticsearch utilities I've put together / been using in investigating elasticsearch performance
Python
3
star
40

pyjudy

Python bindings to libJudy
Python
3
star
41

resque-unfairly

A Resque plugin for processing queues from random jobs based on queue weightings. Inspired by resque-fairly.
Ruby
3
star
42

roger-monitoring

Monitoring stack for RogerOS
Python
3
star
43

crawl-curio-cabinet

A Curio Cabinet of the Odd Behaviors We've Seen on the Internet
HTML
3
star
44

qless-docker

Create a qless docker image!
Ruby
2
star
45

irobot

robots.txt file inspection
Ruby
2
star
46

bloomfilter-py

Simple and fast Bloom filter
Python
2
star
47

docker-sortdb

Docker setup for SortDB
Shell
1
star
48

qless-java

qless java binding
Java
1
star
49

zendesk-search

Search for tags and such in zendesk
JavaScript
1
star
50

deb-swift

1
star
51

fiji

Cell schemas and schema versioning for HBase
HTML
1
star
52

p5-Webservice-Followerwonk-SocialAuthority

Perl Client for The Followerwonk Social Authority API
Perl
1
star
53

qless-util-py

Utilities for use with qless-py
Python
1
star
54

process_tree_dictionary

Implements a dictionary that is scoped to a process tree for Erlang and Elixir.
Elixir
1
star
55

moz_nav

DEPRECATED. Common navigation and layout across all SEOmoz applications
Ruby
1
star
56

logtools

Stuff for reading crawler log files. Probably not of much interest to those outside of SeoMOZ.
Python
1
star