scrapinghub/portia

Web Crawling

Stars
8,991
Rank 4,026 (Top 0.08 %)
Language
Python
License
BSD 3-Clause "New...
Created over 10 years ago
Updated about 1 year ago

scrapinghub/portia

scrapinghub

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Visual scraping for Scrapy

Portia

Portia is a tool that allows you to visually scrape websites without any programming knowledge required. With Portia you can annotate a web page to identify the data you wish to extract, and Portia will understand based on these annotations how to scrape data from similar pages.

Running Portia

The easiest way to run Portia is using Docker:

You can run Portia using Docker & official Portia-image by running:

docker run -v ~/portia_projects:/app/data/projects:rw -p 9001:9001 scrapinghub/portia

You can also set up a local instance with Docker-compose by cloning this repo & running from the root of the folder:

docker-compose up

For more detailed instructions, and alternatives to using Docker, see the Installation docs.

Documentation

Documentation can be found from Read the docs. Source files can be found in the docs directory.

splash

Lightweight, scriptable browser as a service with an HTTP API

dateparser

python parser for human readable dates

frontera

A scalable frontier for web crawlers

slackbot

A chat bot for Slack (https://slack.com).

extruct

Extract embedded metadata from HTML markup

scrapyrt

HTTP API for Scrapy spiders

python-crfsuite

A python binding for crfsuite

spidermon

Scrapy Extension for monitoring spiders execution.

price-parser

Extract price amount and currency symbol from a raw text string

article-extraction-benchmark

Article extraction benchmark: dataset and evaluation scripts

webstruct

NER toolkit for HTML data

python-scrapinghub

A client interface for Scrapinghub's API

adblockparser

Python parser for Adblock Plus filters

js2xml

Convert Javascript code to an XML document

testspiders

Useful test spiders for Scrapy

scrapy-training

Scrapy Training companion code

skinfer

Skinfer is a tool for inferring and merging JSON schemas

sample-projects

Sample projects showcasing Scrapinghub tech

shub

Scrapinghub Command Line Client

python-simhash

An efficient simhash implementation for python

scrapy-poet

Page Object pattern for Scrapy

number-parser

Parse numbers written in natural language

mdr

A python library detect and extract listing data from HTML page.

web-poet

Web scraping Page Objects core library

aile

Automatic Item List Extraction

wappalyzer-python

UNMAINTAINED Python wrapper for Wappalyzer (utility that uncovers the technologies used on websites)

pydepta

A python implementation of DEPTA

scrapinghub-stack-scrapy

Software stack with latest Scrapy and updated deps

aduana

Frontera backend to guide a crawl using PageRank, HITS or other ranking algorithms based on the link structure of the web graph, even when making big crawls (one billion pages).

scrapy-autoextract

Zyte Automatic Extraction integration for Scrapy

scrapy-autounit

Automatic unit test generation for Scrapy.

learn.scrapinghub.com

Scrapinghub Learning Center. Report issues in Jira: Report issues in Jira: https://scrapinghub.atlassian.net/projects/WEB

portia2code

arche

Analyze scraped data

scmongo

MongoDB extensions for Scrapy

exporters

Exporters is an extensible export pipeline library that supports filter, transform and several sources and destinations

webpager

Paginating the web

scrapy-frontera

More flexible and featured Frontera scheduler for Scrapy

page_clustering

A simple algorithm for clustering web pages, suitable for crawlers

flatson

Tool to flatten stream of JSON-like objects, configured via schema

scaws

Extensions for using Scrapy on Amazon AWS

docker-images

scrapylib

Collection of Scrapy utilities (extensions, middlewares, pipelines, etc)

pycon-speakers

Speakers Spider (PyCon 2014 sprint)

docker-devpi

pypi caching service using devpi and docker

crawlera-tools

scrapinghub-entrypoint-scrapy

Scrapy entrypoint for Scrapinghub job runner

scrapy-mosquitera

Restrict crawl and scraping scope using matchers.

andi

Library for annotation-based dependency injection

kafka-scanner

High Level Kafka Scanner

autoextract-spiders

Pre-built Scrapy spiders for AutoExtract

python-cld2

Python bindings for CLD2.

product-extraction-benchmark

Jupyter Notebook

python-hubstorage

Deprecated HubStorage client library - please use python-scrapinghub>=1.9.0 instead

shublang

Pluggable DSL that uses pipes to perform a series of linear transformations to extract data

shub-workflow

shubc

Go bindings for Scrapinghub HTTP API and a sweet command line tool for Scrapy Cloud

scrapinghub-stack-portia

Software stack used to run Portia spiders in Scrapinghub cloud

tutorials

pastebin

navscraper

Vanguard ETF NAV scraper

varanus

A command line spider monitoring tool

hcf-backend

Crawl Frontier HCF backend

pydatanyc

autoextract-poet

web-poet definitions for AutoExtract

collection-scanner

HubStorage collection scanner library

locode

adblockgoparser

Golang parser for Adblock Plus filters

autoextract-examples

Jupyter Notebook

webstruct-demo

HTTP demo for https://github.com/scrapinghub/webstruct

shub-image

Deprecated client side tool to prepare docker images to run crawlers in Scrapinghub - please use shub>=2.5.0 instead

docker-cloudera-manager

Run Cloudera Manager in docker

custom-images-examples

Examples of custom images running on Scrapinghub platform

hubstorage-frontera

Hubstorage crawl frontier backend for Frontera

httpation

xpathcsstutorial

[Work in progress] XPath & CSS for web scraping tutorial

Jupyter Notebook

epmdless_dist

egraylog

scrapinghub-conda-recipes

Conda packages for scrapinghub channel

pydaybot

Demo bot for Python Day Uruguay 2011

erl-iputils

jupyterhub-stacks

A docker images for jhub cluster

cld2

Compact Language Detector 2

scrapinghub-stack-hworker

[DEPRECATED] Software stack fully compatible with Scrapy Cloud 1.0

crawlera.com

crawlera.com website

discourse-sso-google

Use Google as Single-Sign-On provider for Discourse

pkg-opengrok

Ubuntu packaging for OpenGrok