• Stars
    star
    328
  • Rank 128,352 (Top 3 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created over 9 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

🌅 next generation web crawling using machine intelligence

sky is a web scraping framework, implemented with the latest python versions in mind (3.5+). It uses the asynchronous asyncio framework, as well as many popular modules and extensions.

Most importantly, it aims for next generation web crawling where machine intelligence is used to speed up the development/maintainance/reliability of crawling.

It mainly does this by considering the user to be interested in content from domains, not just a collection of single pages (templating approach).

See it live in action with a news website YOU propose:

  • Locally (view demo)
  • Remotely (needs online hosting)

Demo

Note that the following is only meant as a demo of some kind of app that could be built upon the scraping framework.

Make no mistake: the goal is to provide a smart-scraper, not some ugly UI.

Run:

  • Install using pip: pip3 install -U sky
  • Run sky view at the command line (use -port PORT to change port)
  • Visit localhost:7900
  • Enter a Domain/URL and see the result after clicking [>>>].

The demo uses a standard configuration that can easily be improved on when setting up a project.



Similar data (title, body, publish_date, images etc) will be very easy to use in your own applications.

Features/Goals

These are the features/goals of sky. Checkmarks have been accomplished:

  • Really fast, due to Python 3.5+ new asyncio/aiohttp libraries, based on 500lines/crawler
  • Smart, due to considering crawling of websites instead of single pages
  • Boilerplate FREE, removes crappy content (images, text, etc) that does not belong on pages
  • Nice API, carefully crafted, easily extendible
  • Open-source, democracy driven, with actual support
  • Free, versus enormous costs for even medium scale projects using (worse) online services
  • Link-graph-analysis, find out how a domain "looks" like
  • Include Batteries, Crawl any news website without any configuration
  • Automatic Natural Language Processing, detecting keywords in text automatically

Installation

Use pip to install sky:

pip3 install -U sky

This will install only the required packages. Storing data on the local system does not require any other packages.

To store data, the following optional backends are currently available: elasticsearch, cloudant and ZODB.

Using the package

To setup a project/crawling service, visit this readme for a "Getting started".

Contribute

It is very much appreciated if you'd like to contribute in one or more of the following areas:

  • More Backends
  • Documentations/tests
  • Improvement of detection
  • NLP

Templating approach

By considering crawl content to originate from a domain, rather than individual pages, the following willl be possible:

  • ✓ Drop duplicate content (menus, texts, images)
  • ✓ Provide error checking tools (making sure no bad documents slip by)
  • Detect whether a website changed the layout (causing non-sky scrapers to fail)
  • Understand sections of a website, such as comments, forum posts, related links etc
  • Consider which pages are linked to which (star graph)
  • Figure out the content pages by just pointing at the domain
  • Relate pages (page A is related by content to page B)
  • Consider an optimal re-crawling path

More Repositories

1

whereami

Uses WiFi signals 📶 and machine learning to predict where you are
Python
5,100
star
2

yagmail

Send email in Python conveniently for gmail using yagmail
Python
2,639
star
3

neural_complete

A neural network trained to help writing neural network code using autocomplete
Python
1,152
star
4

gittyleaks

💧 Find sensitive information for a git repo
Python
741
star
5

contractions

Fixes contractions such as `you're` to `you are`
Python
308
star
6

access_points

Scan your WiFi and get access point information and signal quality
Python
187
star
7

textsearch

Find strings/words in text; convenience and C speed 🎆
Python
126
star
8

brightml

Convenient Machine-Learned Auto Brightness (Linux)
Python
120
star
9

shrynk

Using Machine Learning to learn how to Compress ⚡
Python
109
star
10

loco

Share localhost through SSH. Local/Remote port forwarding made safe and easy.
Python
106
star
11

cliche

Build a simple command-line interface from your functions 💻
Python
105
star
12

tok

Fast and customizable tokenization 🚤
Python
64
star
13

just

Just is a wrapper to automagically read/write a file based on extension
Python
50
star
14

aserve

Easily mock an API ☕
Python
50
star
15

spacy_api

Server/Client around Spacy to load spacy only once
Python
46
star
16

xtoy

Automated Machine Learning: go from 'X' to 'y' without effort.
Python
46
star
17

requests_viewer

View requests objects with style
Python
42
star
18

cant

For those who can't remember how to get a result
Python
34
star
19

aioyagmail

makes sending emails very easy by doing all the magic for you, asynchronously
Python
29
star
20

sysdm

Scripts as a service. Builds on systemd (for Linux)
Python
21
star
21

deep_eye2mouse

Move the mouse by your webcam + eyes
Python
20
star
22

reddit_ml_challenge

Reddit Machine Learning: Tagging Challenge
Python
19
star
23

inthenews.io

Get the latest and greatest in news (on Python)
CSS
19
star
24

crtime

Get creation time of files for any platform - no external dependencies ⏰
Python
16
star
25

natura

Find currencies / money talk in natural text
Python
15
star
26

rebrand

✨ Refactor your software using programming language independent, case-preserving string replacement 💄
Python
15
star
27

emacs-kooten-theme

Dark color theme by kootenpv
Emacs Lisp
14
star
28

justdb

Just a thread/process-safe, file-based, fast, database.
Python
8
star
29

fastlang

Fast Detection of Language without Dependencies
Python
7
star
30

quickpip

A template for creating a quick, maintainable and high quality pypi project
Python
7
star
31

xdb

Ambition: Single API for any database in Python
Python
6
star
32

nostalgia_chrome

Self tracking your online life!
Python
5
star
33

cnn_basics

NLP using CNN on Cornell Movie Ratings
Python
4
star
34

kootenpv.github.io

Pascal van Kooten's website hosted on github.io
CSS
3
star
35

gittraffic

Save your gittrafic data so it won't get lost!
Python
3
star
36

flymake-solidity

flymake for solidity, using flymake-easy: live feedback on writing solidity contracts
Emacs Lisp
3
star
37

ppm

Safe password manager
C
2
star
38

automl_presentation

Example code for the presentation "Automated Machine Learning"
Python
2
star
39

dot_access

Makes nested python objects easy to go through
Python
1
star
40

feedview

View a feed url with `feedview <URL>`
Python
1
star
41

PassMan

android app for ppm
C
1
star
42

mockle

Automatic Mocking by Pickles
Python
1
star
43

emoji-picker

Python
1
star