Python Diffbot API Client
Preface
Identify and extract the important parts of any web page in Python! This client currently supports calls to Diffbot's Automatic APIs and Crawlbot.
Installation To install activate a new virtual environment and run the following command:
$ pip install -r requirements.txt
Configuration
To run the example, you must first configure a working API token in config.py:
$ cp config.py.example config.py; vim config.py;
Then replace the string "SOME_TOKEN" with your API token. Finally, to run the example:
$ python example.py
Usage
Article API
An example call to the Article API:
diffbot = DiffbotClient()
token = "SOME_TOKEN"
version = 2
url = "http://shichuan.github.io/javascript-patterns/"
api = "article"
response = diffbot.request(url, token, api, version=2)
Product API
An example call to the Product API:
diffbot = DiffbotClient()
token = "SOME_TOKEN"
version = 2
url = "http://www.overstock.com/Home-Garden/iRobot-650-Roomba-Vacuuming-Robot/7886009/product.html"
api = "product"
response = diffbot.request(url, token, api, version=version)
Image API
An example call to the Image API:
diffbot = DiffbotClient()
token = "SOME_TOKEN"
version = 2
url = "http://www.google.com/"
api = "image"
response = diffbot.request(url, token, api, version=version)
Analyze API
An example call to the Analyze API:
diffbot = DiffbotClient()
token = "SOME_TOKEN"
version = 2
url = "http://www.twitter.com/"
api = "analyze"
response = diffbot.request(url, token, api, version=version)
Crawlbot API
To start a new crawl, specify a crawl name, seed URLs, and the API via which URLs should be processed. An example call to the Crawlbot API:
token = "SOME_TOKEN"
name = "sampleCrawlName"
seeds = "http://www.twitter.com/"
api = "analyze"
sampleCrawl = DiffbotCrawl(token,name,seeds=seeds,api=api)
Omit "seeds" and "api" to load an existing crawl, or create a crawl as a placeholder.
To check the status of a crawl:
sampleCrawl.status()
To update a crawl:
maxToCrawl = 100
upp = "diffbot"
sampleCrawl.update(maxToCrawl=maxToCrawl,urlProcessPattern=upp)
To delete or restart a crawl:
sampleCrawl.delete()
sampleCrawl.restart()
To download crawl data:
sampleCrawl.download() # returns JSON by default
sampleCrawl.download(data_format="csv")
To pass additional arguments to a crawl:
sampleCrawl = DiffbotCrawl(token,name,seeds,apiUrl,maxToCrawl=100,maxToProcess=50,notifyEmail="[email protected]")
Testing
First install the test requirements with the following command:
$ pip install -r test_requirements.txt
Currently there are some simple unit tests that mock the API calls and return data from fixtures in the filesystem. From the project directory, simply run:
$ nosetests