• Stars
    star
    1,262
  • Rank 37,159 (Top 0.8 %)
  • Language
    Python
  • Created about 4 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

πŸ€– Scrape data from HTML websites automatically by just providing examples

mlscraper: Scrape data from HTML pages automatically

CI status PyPI version PyPI python version

mlscraper allows you to extract structured data from HTML automatically instead of manually specifying nodes or css selectors. You train it by providing a few examples of your desired output. It will then figure out the extraction rules for you automatically and afterwards you'll be able to extract data from any new page you provide.

Image showing how mlscraper turns html into data objects

Background Story

Many services for crawling and scraping automation allow you to select data in a browser and get JSON results in return. No need to specify CSS selectors or anything else.

I've been wondering for a long time why there's no Open Source solution that does something like this. So here's my attempt at creating a python library to enable automatic scraping.

All you have to do is define some examples of scraped data. mlscraper will figure out everything else and return clean data.

How it works

After you've defined the data you want to scrape, mlscraper will:

  • find your samples inside the HTML DOM
  • determine which rules/methods to apply for extraction
  • extract the data for you and return it in a dictionary

Getting started

mlscraper is currently shortly before version 1.0. If you want to check the new release, use pip install --pre mlscraper to test the release candidate. You can also install the latest (unstable) development version of mlscraper via pip install git+https://github.com/lorey/mlscraper#egg=mlscraper, e.g. to check new features or to see if a bug has been fixed already. Please note that until the 1.0 release pip install mlscraper will return an outdated 0.* version.

To get started with a simple scraped, check out a basic sample below.

import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

# fetch the page to train
einstein_url = 'http://quotes.toscrape.com/author/Albert-Einstein/'
resp = requests.get(einstein_url)
assert resp.status_code == 200

# create a sample for Albert Einstein
# please add at least two samples in practice to get meaningful rules!
training_set = TrainingSet()
page = Page(resp.content)
sample = Sample(page, {'name': 'Albert Einstein', 'born': 'March 14, 1879'})
training_set.add_sample(sample)

# train the scraper with the created training set
scraper = train_scraper(training_set)

# scrape another page
resp = requests.get('http://quotes.toscrape.com/author/J-K-Rowling')
result = scraper.get(Page(resp.content))
print(result)
# returns {'name': 'J.K. Rowling', 'born': 'July 31, 1965'}

Check the examples directory for usage examples until further documentation arrives.

Development

See CONTRIBUTING.rst

Related work

I originally called this autoscraper but while working on it someone else released a library named exactly the same. Check it out here: autoscraper. Also, while initially driven by Machine Learning, using statistics to search for heuristics turned out to be faster and requires less training data. But since the name is memorable, I'll keep it.

More Repositories

1

social-media-profiles-regexs

πŸ“‡ Extract social media profiles and more with regular expressions
Python
602
star
2

github-stars-by-topic

⭐ Generate a list of your GitHub stars by topic - automatically!
Python
68
star
3

socials

πŸ‘¨β€πŸ‘©β€πŸ‘¦ Social account detection and extraction in Python, e.g. for crawling/scraping.
Python
47
star
4

personal-crm

πŸ—‚ Minimalist personal CRM to keep in touch with contacts
Python
38
star
5

list-of-countries

List of all countries in different formats (ISO, tld, capital, language, population)
PHP
22
star
6

totally-not-jarvis

πŸ€– My personal assistant
Python
21
star
7

top-regional-repositories

🌍 The most-relevant repositories for all countries and many cities worldwide.
20
star
8

obsi

πŸ’Ž supercharge your note-taking with index pages, Anki decks, calendar pages, and more.
Python
18
star
9

socials-api

πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ (Rest) API to extract social media profiles from websites or specific URLs
Python
17
star
10

resume

πŸ“„ Karl Lorey's resume
TeX
7
star
11

envato-cli

◼️ command line interface for envato market (e.g. themeforest)
Python
7
star
12

programmermap

Find programmers and interesting projects near you and worldwide.
HTML
7
star
13

hubspot-contact-import

πŸ‘₯ Import Xing contacts and vCards into Hubspot CRM
Python
6
star
14

awesome-hubspot

Awesome list of HubSpot tools and libraries
6
star
15

hubspot-reporting

πŸ“ˆ Creating diagrams from HubSpot automatically.
Python
6
star
16

meeting-bot

πŸ“” Telegram bot that reminds you to create meeting notes in your Hubspot CRM
Python
3
star
17

data-intensive-latex-documents

Python framwork for data-intensive LaTeX documents.
Python
2
star
18

pflichtenheft

Ein Pflichtenheft in LaTeX
2
star
19

karllorey.com

πŸ‘€ My personal website built with Next.js
JavaScript
2
star
20

laravel-latex

A Laravel package for handling LaTeX input and output
PHP
2
star
21

lorey

About me
2
star
22

dotfiles

⚫ dotfiles for awesomewm, zsh, vimperator
Lua
1
star
23

roadgenius.de

CSS
1
star
24

screeps

My Screeps bot or "me writing JavaScript is like trying to write poems as a first-grader"
JavaScript
1
star
25

mlscraper-experiments

HTML
1
star
26

pretendtobeworking.com

4 hour venture - a project completed within four hours at PionierGarage, KIT, Karlsruhe
JavaScript
1
star
27

find-underpriced-cars

πŸš™ Python tool that uses a scraper and Machine Learning to find underpriced cars online.
Python
1
star