• Stars
    star
    1,511
  • Rank 31,023 (Top 0.7 %)
  • Language
    Go
  • License
    BSD 2-Clause "Sim...
  • Created over 9 years ago
  • Updated almost 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A simple, higher level interface for Go web scraping.

scrape

A simple, higher level interface for Go web scraping.

When scraping with Go, I find myself redefining tree traversal and other utility functions.

This package is a place to put some simple tools which build on top of the Go HTML parsing library.

For the full interface check out the godoc GoDoc

Sample

Scrape defines traversal functions like Find and FindAll while attempting to be generic. It also defines convenience functions such as Attr and Text.

// Parse the page
root, err := html.Parse(resp.Body)
if err != nil {
    // handle error
}
// Search for the title
title, ok := scrape.Find(root, scrape.ByTag(atom.Title))
if ok {
    // Print the title
    fmt.Println(scrape.Text(title))
}

A full example: Scraping Hacker News

package main

import (
	"fmt"
	"net/http"

	"github.com/yhat/scrape"
	"golang.org/x/net/html"
	"golang.org/x/net/html/atom"
)

func main() {
	// request and parse the front page
	resp, err := http.Get("https://news.ycombinator.com/")
	if err != nil {
		panic(err)
	}
	root, err := html.Parse(resp.Body)
	if err != nil {
		panic(err)
	}

	// define a matcher
	matcher := func(n *html.Node) bool {
		// must check for nil values
		if n.DataAtom == atom.A && n.Parent != nil && n.Parent.Parent != nil {
			return scrape.Attr(n.Parent.Parent, "class") == "athing"
		}
		return false
	}
	// grab all articles and print them
	articles := scrape.FindAll(root, matcher)
	for i, article := range articles {
		fmt.Printf("%2d %s (%s)\n", i, scrape.Text(article), scrape.Attr(article, "href"))
	}
}

More Repositories

1

rodeo

A data science IDE for Python
JavaScript
3,925
star
2

ggpy

ggplot port for python
Python
3,695
star
3

pandasql

sqldf for pandas
Python
1,321
star
4

db.py

db.py is an easier way to interact with your databases
Python
1,221
star
5

DataGotham2013

Python
211
star
6

python-naive-bayes

Naive Bayes in Python
Python
85
star
7

wsutil

Go WebSocket reverse proxy
Go
64
star
8

ws

A WebSocket cli tool.
Python
43
star
9

benchdb

Store go test bench data in a database
Go
30
star
10

yhat-client

Python client for ScienceOps
Python
29
star
11

db.r

db.r provides a way to interactively explore databases
R
28
star
12

currency-portfolio-optimization

Currency Portfolio Optimization - IPython notebook and data
Python
25
star
13

beer-bandit

Flask app to run a bandit algorithm testing different beer recommenders
CSS
25
star
14

yhat-examples

Some examples of Yhat
R
23
star
15

vim-docstring

Fold your Python docstrings
Vim Script
18
star
16

yhatr

wrapper for the yhat API
R
17
star
17

electron-release-manager

For managing updates and releases to Rodeo
CSS
16
star
18

semi-autonomous-drone

CSS
15
star
19

go-docker

Golang Docker remote API client
Go
10
star
20

housing-predictor

JavaScript
10
star
21

demo-image-recognizer

Jupyter Notebook
9
star
22

bash-nb

Naive Bayes in bash
Shell
8
star
23

demo-lending-club

CSS
8
star
24

bandit

Python
7
star
25

urlquery

A Go package (two functions) for marshalling and unmarshalling url query values
Go
7
star
26

demo-housing-predictor

HTML
6
star
27

logjam

Jam all of your logs into an event-stream
JavaScript
6
star
28

terragon

A better pickle (fork of the python cloud package)
Python
6
star
29

Beer-Rec-Flask

CSS
6
star
30

yhat-ruby

A ruby wrapper for the Yhat API.
Ruby
5
star
31

demo-churn-pred

CSS
5
star
32

flask-beer

Python flask app pulling data from a beer recommender.
CSS
5
star
33

gooper

Simple dependency management for Go Github packages.
Go
4
star
34

demo-lead-scoring

Lead scoring with ScienceOps Batch.
Python
4
star
35

pandaslite

Python
4
star
36

osx-excel

Visual Basic
4
star
37

hova

Docker based release script for go binaries & node apps.
Shell
4
star
38

yhat-node

A node js client for the yhat API
JavaScript
4
star
39

Yhat.js

Javascript Library for connecting to yhat API
JavaScript
3
star
40

jsonviews

Streaming JSON filters
Go
3
star
41

demo-beer-rec

CSS
3
star
42

chatbot

Yhat ChatBot using NLTK
CSS
3
star
43

phash

Simple password hashing in Go
Go
3
star
44

demo-handwriting

JavaScript
2
star
45

busby

Parse a csv file and send through a websocket.
Python
2
star
46

ops-photo-tagger-web

HTML
2
star
47

yhat-java

ScienceOps java client
Java
1
star
48

bandit-demos

HTML
1
star
49

wesanderson.py

Python
1
star
50

filestr

Convert files to string or byte slice variables.
Go
1
star
51

dummipy

Categorical variables for pandas DataFrames and scikit-learn
Python
1
star
52

demo-twitter-tagger

1
star
53

giveupthefunc

A Golang function profiler
Go
1
star
54

digit-recognizer

JavaScript
1
star
55

longpoll

A Go package for long polling
Go
1
star
56

filedb

Create a file that pretends to be a database query
Python
1
star
57

donkey_kong

Send mandrill templates from the command line.
Python
1
star
58

use-cases

More in depth discussions of yhat use cases
1
star
59

resize

An app for resizing EC2 instances
Go
1
star
60

sshutil

Utility functions for Go's ssh library
Go
1
star
61

sb-magic

An IPython notebook magic for running sciencebox commands
Python
1
star
62

certdump

Dump information about SSL certificate files.
Go
1
star
63

banditr

R
1
star
64

ignore

Go
1
star
65

s3sync

Go
1
star
66

structr

python-like lists and dicts in R
R
1
star