monperrus/crawler-user-agents

Stars
999
Rank 45,913 (Top 1.0 %)
Language
Python
License
MIT License
Created over 10 years ago
Updated about 1 year ago

monperrus/crawler-user-agents

monperrus

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Syntactic patterns of HTTP user-agents used by bots / robots / crawlers / scrapers / spiders. pull-request welcome ⭐

crawler-user-agents

This repository contains a list of of HTTP user-agents used by robots, crawlers, and spiders as in single JSON file.

Install

Direct download

Download the crawler-user-agents.json file from this repository directly.

Npm / Yarn

crawler-user-agents is deployed on npmjs.com: https://www.npmjs.com/package/crawler-user-agents

To use it using npm or yarn:

npm install --save crawler-user-agents
# OR
yarn add crawler-user-agents

In Node.js, you can require the package to get an array of crawler user agents.

const crawlers = require('crawler-user-agents');
console.log(crawlers);

Usage

Each pattern is a regular expression. It should work out-of-the-box wih your favorite regex library:

JavaScript: if (RegExp(entry.pattern).test(req.headers['user-agent']) { ... }
PHP: add a slash before and after the pattern: if (preg_match('/'.$entry['pattern'].'/', $_SERVER['HTTP_USER_AGENT'])): ...
Python: if re.search(entry['pattern'], ua): ...

Contributing

I do welcome additions contributed as pull requests.

The pull requests should:

contain a single addition
specify a discriminant relevant syntactic fragment (for example "totobot" and not "Mozilla/5 totobot v20131212.alpha1")
contain the pattern (generic regular expression), the discovery date (year/month/day) and the official url of the robot
result in a valid JSON file (don't forget the comma between items)

Example:

{
  "pattern": "rogerbot",
  "addition_date": "2014/02/28",
  "url": "http://moz.com/help/pro/what-is-rogerbot-",
  "instances" : ["rogerbot/2.3 example UA"]
}

License

The list is under a MIT License. The versions prior to Nov 7, 2016 were under a CC-SA license.

Related work

There are a few wrapper libraries that use this data to detect bots:

Voight-Kampff (Ruby)
isbot (Ruby)
crawlers (Clojure)
crawlerflagger (Go)
isBot (Node.JS)

Other systems for spotting robots, crawlers, and spiders that you may want to consider are:

Crawler-Detect (PHP)
BrowserDetector (PHP)
browscap (JSON files)

ExpandAnimations

LibreOffice/OpenOffice.org extension to expand animations before exporting to PDF. Looking for maintainers.

bibtexbrowser

Beautiful publication lists with bibtex and PHP (standalone or in Wordpress)

jskomment

open source AJAX commenting system

jexast

Enables the extraction of Java AST nodes with plain JDT

misc

alloy-quick-reference

A helper document about the Alloy specification language

apache-svn-commits

1,7 million commits of the main Apache SVN repository. Searchable thanks to Github.

content-assist-example

an example code completion system for Eclipse

bots.yml

Specification of bots.yml for software bots

Asus-NovaGo-TP370QL

Debug information about Asus-NovaGo-TP370QL

git-api-diff

Computes Git diff at the API level (new methods, modified methods, etc)

pascal3g

ANTLR v3 grammar for Pascal

travis-metronome

A piece of software art

litjava

A literate programming tool in Java for Java

dl-gdocs

Downloads and backups Google Documents (texts, spreadsheets, presentations)

real-bug-fixes-icse-2015

Open-science repository for the bug fix commit dataset of "An Empirical Study on Real Bug Fixes" (ICSE 2015).

academia.json

Metadata about academic publications in JSON. Pull-requests welcome.

roundingsat

travis-metronome-tok

The counter-weight of https://github.com/monperrus/travis-metronome

typeusage

extracts type-usages from Java bytecode or source code using Soot

gakoci

Self-hosted continuous integration server for Github, based on webhooks

fun-with-travis

Fun experiments with travis

kyss

btc-supply-chain

Database of SHA256 of software packages for the Bitcoin software supply chain

one-million-branches

wp-publications

Repository of the wordpress plugin wp-publications (pull requests welcome :)

dataset-diff-gumtree

Dataset of diff files used in "Fine-grained and Accurate Source Code Differencing"

exgen

copy from public artefact https://drive.google.com/file/d/10unHPpARh9FBrVIyarkiSVMKSd3qfGV5/view

bug-fixes-saner16

Extraction of the 34836 diffs of https://github.com/xuanbachle/data-bugfixes/blob/master/all.zip

airport-by-foot

Instructions on the feasibility and pleasantness of reaching airports by foot ✈✈✈✈✈

rugrat

Fork of the RUGRAT Random Program Generator (aka Carfast)