• Stars
    star
    461
  • Rank 95,028 (Top 2 %)
  • Language
    Ruby
  • License
    GNU Affero Genera...
  • Created almost 11 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Take the hassle out of web scraping

Build Status Code Climate

morph.io: A scraping platform

  • A Heroku for Scrapers
  • All code and collaboration through GitHub
  • Write your scrapers in Ruby, Python, PHP, Perl or JavaScript (NodeJS, PhantomJS)
  • Simple API to grab data
  • Schedule scrapers or run manually
  • Process isolation via Docker
  • Email alerts for broken scrapers

Dependencies

Ruby, Docker, MySQL, SQLite 3, Redis, mitmproxy. (See below for more details about installing Docker)

Development is supported on Linux (Ubuntu 20.04) and Mac OS X.

Repositories

User-facing:

Docker images:

Installing Docker

On Linux

Just follow the instructions on the Docker site.

Your user account should be able to manipulate Docker (just add your user to the docker group).

On Mac OS X

Install Docker for Mac.

Starting up Elasticsearch

Morph needs Elasticsearch to run. We've made things easier for development by using docker to run Elasticsearch.

docker-compose up

To Install Morph

bundle install
cp config/database.yml.example config/database.yml
cp env-example .env

Edit config/database.yml with your database settings

Tunnel GitHub webhook traffic back to your local development machine

We use "ngrok" a tool that makes tunnelling internet traffic to a local development machine easy. First download ngrok if you don't have it already. Then,

ngrok http 5100

Make note of the http://*.ngrok.io forwarding URL.

Creating Github Application

You'll need to create an application on GitHub So that morph.io can talk to GitHub. We've pre-filled most of the important fields for a few different configurations below:

You will need to add add and change a few values manually:

  • Disable "Expire user authorization tokens"
  • Add an image - you can use the standard logo at app/assets/images/logo.png (you can add this after the app is created)
  • If the webhooks are active and being used in production (currently not the case) then you'll also need to add a "Webhook secret" for security.

Next you'll need to fill in some values in the .env file which come from the GitHub App that you've just created.

  • GITHUB_APP_ID - Look for "App ID" near the top of the page. This should be an integer
  • GITHUB_APP_NAME - Look for "Public link". The name is what appears after "https://github.com/apps/". It's essentially a url happy version of the name you gave the app.
  • GITHUB_APP_CLIENT_ID - Look for "Client ID" near the top of the page.
  • GITHUB_APP_CLIENT_SECRET - Go to "Generate a new client secret".

Also, a private key for the GitHub app is needed. This can be generated by clicking the "Generate a private key" button and will be automatically downloaded. Move and rename it to config/morph-github-app.private-key.pem.

Now setup the databases:

bundle exec dotenv rake db:setup

Now you can start the server

bundle exec dotenv foreman start

and point your browser at http://127.0.0.1:3000

To get started, log in with GitHub. There is a simple admin interface accessible at http://127.0.0.1:3000/admin. To access this, run the following to give your account admin rights:

bundle exec rake app:promote_to_admin

Running tests

If you're running guard (see above) the tests will also automatically run when you change a file.

By default, RSpec will skip tests that have been tagged as being slow. To change this behaviour, add the following to your .env:

RUN_SLOW_TESTS=1

By default, RSpec will run certain tests against a running Docker server. These tests are quite slow, but not have been tagged as slow. To stop Rspec from running these tests, add the following to your .env:

DONT_RUN_DOCKER_TESTS=1

Guard Livereload

We use Guard and Livereload so that whenever you edit a view in development the web page gets automatically reloaded. It's a massive time saver when you're doing design or lots of work in the view. To make it work run

bundle exec guard

Guard will also run tests when needed. Some tests do integration tests against a running docker server. These particular tests are very slow. If you want to disable them,

DONT_RUN_DOCKER_TESTS=1 bundle exec guard

Mail in development

By default in development mails are sent to Mailcatcher. To install

gem install mailcatcher

Deploying to production

This section will not be relevant to most people. It will however be relevant if you're deploying to a production server.

Ansible Vault

We're using Ansible Vault to encrypt certain files, like the private key for the SSL certificate.

To make this work you will need to put the password in a file at ~/.infrastructure_ansible_vault_pass.txt. This is the same password as used in the openaustralia/infrastructure GitHub repository.

Restarting Discourse

Discourse runs in a container and should usually be restarted automatically by docker.

However, if the container goes away for some reason, it can be restarted:

root@morph:/var/discourse# ./launcher rebuild app

This will pull down the latest docker image, rebuild, and restart the container.

Production devops development

This method defaults to creating a 4Gb VirtualBox VM, which can strain an 8Gb Mac. We suggest tweaking the Vagrantfile to restrict ram usage to 2Gb at first, or using a machine with at least 12Gb ram.

Install Vagrant, VirtualBox and Ansible.

Install a couple of Vagrant plugins: vagrant plugin install vagrant-hostsupdater vagrant-disksize

Install rbenv and ruby-build.

If on Ubuntu, install libreadline-dev: sudo apt install libreadline-dev libsqlite3-dev

Install the required ruby version: rbenv install

Install capistrano: gem install capistrano

Run make roles to install some required ansible roles.

Run vagrant up local. This will build and provision a box that looks and acts like production at dev.morph.io.

Once the box is created and provisioned, deploy the application to your Vagrant box:

cap local deploy

Now visit https://dev.morph.io/

Production provisioning and deployment

To deploy morph.io to production, normally you'll just want to deploy using Capistrano:

cap production deploy

When you've changed the Ansible playbooks to modify the infrastructure you'll want to run:

make ansible

SSL certificates

We're using Let's Encrypt for SSL certificates. It's not 100% automated. On a completely fresh install (with a new domain) as root:

certbot --nginx certonly -m [email protected] --agree-tos

It should show something like this:

Which names would you like to activate HTTPS for?
-------------------------------------------------------------------------------
1: morph.io
2: api.morph.io
3: faye.morph.io
4: help.morph.io

Leave your answer your blank which will install the certificate for all of them

Installing certificates for local vagrant build

sudo certbot certonly --manual -d dev.morph.io --preferred-challenges dns -d api.dev.morph.io -d faye.dev.morph.io -d help.dev.morph.io

Scraper<->mitmdump SSL

Scrapers talk out to the internet by being routed through the mitmdump2 proxy container. The default container you'll get on a devops install has no SSL certificates. This makes it easy for traffic to get out, but means we can't replicate some problems that occur when the SSL validation fails.

To work around this, you'll have to rebuild the mitmdump container. Look in /var/www/current/docker_images/morph-mitmdump; there's a Makefile that will aid in building the new image.

Once that's done, you'll need to build a new version of the openaustralia/buildstep:

  • cd
  • git clone https://github.com/openaustralia/buildstep.git
  • cd buildstep
  • cp /var/www/current/docker_images/morph-mitmdump/mitmproxy/mitmproxy-ca-cert.pem .
  • docker image build -t openaustralia/buildstep:latest .

You should now be able to see in docker image list --all that your new image is ready. The next time you run a scraper it will be rebuilt using the new buildstep image.

How to contribute

If you find what looks like a bug:

  • Check the GitHub issue tracker to see if anyone else has reported issue.
  • If you don't see anything, create an issue with information on how to reproduce it.

If you want to contribute an enhancement or a fix:

  • Fork the project on GitHub.
  • Make your changes with tests.
  • Commit the changes without making changes to any files that aren't related to your enhancement or fix.
  • Send a pull request.

We maintain a list of issues that are easy fixes. Fixing one of these is a great way to get started while you get familiar with the codebase.

Copyright & License

Copyright OpenAustralia Foundation Limited. Licensed under the Affero GPL. See LICENSE file for more details.

More Repositories

1

theyvoteforyou

Making parliamentary voting information accessible, understandable, and easy to use so that you can hold your elected representatives to account.
Ruby
137
star
2

planningalerts

Find out and have your say about what's being built and knocked down in your area.
Ruby
95
star
3

openaustralia-parser

Parser component for OpenAustralia.org
Ruby
28
star
4

openaustralia

Looking for the front-end or parser? See the README. Looking for issues? You're in the right repo.
Ruby
23
star
5

electionleaflets

Live election leaflet monitoring for Australia.
PHP
18
star
6

twfy

Web Application component for Open Australia (twfy module)
PHP
16
star
7

morph-cli

Commandline interface for morph
Ruby
9
star
8

planningalerts-parsers

[Phased out] PlanningAlerts scrapers
Ruby
8
star
9

infrastructure

Automated setup and configuration for most of OpenAustralia Foundation's servers
HCL
8
star
10

yinyo

A wonderfully simple API driven service to reliably execute many long running scrapers in a super scaleable way
Go
6
star
11

australian_local_councillors_popolo

Popolo data for Local Councillors in Australia
Ruby
6
star
12

newsletter

This was a monthly newsletter by the OpenAustralia Foundation for people who love civic tech (Feb 2015-Nov 2015).
5
star
13

mapit-australia-docker

Dockerized Mapit with Australian data
Shell
5
star
14

oaf

OpenAustralia Foundation
4
star
15

alaveteli_versions

Scrapes details of every public Alaveteli deployment
Ruby
3
star
16

mapit-docker

Dockerized MapIt
Shell
3
star
17

example_ruby_chrome_headless_scraper

Example scraper showing how to use Chrome headless from a ruby scraper
Ruby
3
star
18

morph-docker-base

Base docker image for all images on Morph
Shell
2
star
19

rblib

Web Application component for Open Australia (rblib module)
Ruby
2
star
20

openaustralia-chef

RETIRED. Was: Chef recipes used for configuring openaustralia.org's server
Ruby
2
star
21

morph_popolo

A little Sinatra app to output Popolo data from the morph.io API.
Ruby
2
star
22

morph-docker-python

Docker image for running Python scrapers in Morph
Shell
2
star
23

morph-docker-php

Docker image to run php scrapers in Morph
Shell
2
star
24

scraperwiki-morph

ScraperWiki / Morph compatibility layer
Ruby
2
star
25

example_ruby_phantomjs_scraper

Ruby
1
star
26

sa_lg_councillors

South Australian Councillors
Ruby
1
star
27

homebrew-yinyo

Ruby
1
star
28

morph-docker-ruby

Docker image for running Ruby scrapers in Morph
Ruby
1
star
29

right_to_know_requests

Data about requests on Right To Know for analysis
Ruby
1
star
30

morph-docker-buildstep-base

Base docker image for all buildstep based images on Morph
1
star
31

planningalerts_xml_data_feed

Fetches bulk data from the PlanningAlerts API and converts it to XML for use in your legacy systems.
Ruby
1
star
32

perllib

Web Application component for Open Australia (perllib module)
Perl
1
star
33

ukraine_verkhovna_rada_deputies

Members of the Ukrainian parliament, the Verkhovna Rada of Ukraine
Ruby
1
star
34

phplib

Web Application component for Open Australia (phplib module)
PHP
1
star
35

australian_local_councillors_images

Stores a copy of Australian local councillor images from Popolo on S3. Runs on morph.io
Ruby
1
star
36

jacaranda

a watchful tree and slack messenger to keep you informed of the use of your civic tech projects, like PlanningAlerts and Right To Know
Ruby
1
star
37

publicwhip-matthew

Fork of UK code that runs www.publicwhip.org.uk for tracking parliamentary voting information
PHP
1
star
38

jacaranda-righttoknow

a watchful tree and slack messenger to keep you informed of the use of Right To Know
Ruby
1
star
39

members_of_parliament_queensland_australia

A list of current Members of the Parliament of Queensland, Australia
Ruby
1
star
40

vic_lg_directory_councillors

Names of councillors for Victorian councils
Ruby
1
star
41

shlib

Web Application component for Open Australia (shlib module)
1
star
42

all_alaveteli_requests

Get all the requests on an Alaveteli site and put them in a single gigantic pdf
Ruby
1
star
43

nsw_greens_councillors_contact_details

Contact details of local councillors in NSW that are members of the Green party scraped from the NSW Greens website
Ruby
1
star
44

ukraine_verkhovna_rada_votes

Votes by deputies in the Ukrainian Parliament
Ruby
1
star