• Stars
    star
    2,036
  • Rank 22,556 (Top 0.5 %)
  • Language
    Go
  • License
    MIT License
  • Created over 7 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Web Scraper in Go, similar to BeautifulSoup

soup

Build Status GoDoc Go Report Card

Web Scraper in Go, similar to BeautifulSoup

soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSoup.

Exported variables and functions implemented till now :

var Headers map[string]string // Set headers as a map of key-value pairs, an alternative to calling Header() individually
var Cookies map[string]string // Set cookies as a map of key-value  pairs, an alternative to calling Cookie() individually
func Get(string) (string,error) {} // Takes the url as an argument, returns HTML string
func GetWithClient(string, *http.Client) {} // Takes the url and a custom HTTP client as arguments, returns HTML string
func Post(string, string, interface{}) (string, error) {} // Takes the url, bodyType, and payload as an argument, returns HTML string
func PostForm(string, url.Values) {} // Takes the url and body. bodyType is set to "application/x-www-form-urlencoded"
func Header(string, string) {} // Takes key,value pair to set as headers for the HTTP request made in Get()
func Cookie(string, string) {} // Takes key, value pair to set as cookies to be sent with the HTTP request in Get()
func HTMLParse(string) Root {} // Takes the HTML string as an argument, returns a pointer to the DOM constructed
func Find([]string) Root {} // Element tag,(attribute key-value pair) as argument, pointer to first occurence returned
func FindAll([]string) []Root {} // Same as Find(), but pointers to all occurrences returned
func FindStrict([]string) Root {} //  Element tag,(attribute key-value pair) as argument, pointer to first occurence returned with exact matching values
func FindAllStrict([]string) []Root {} // Same as FindStrict(), but pointers to all occurrences returned
func FindNextSibling() Root {} // Pointer to the next sibling of the Element in the DOM returned
func FindNextElementSibling() Root {} // Pointer to the next element sibling of the Element in the DOM returned
func FindPrevSibling() Root {} // Pointer to the previous sibling of the Element in the DOM returned
func FindPrevElementSibling() Root {} // Pointer to the previous element sibling of the Element in the DOM returned
func Children() []Root {} // Find all direct children of this DOM element
func Attrs() map[string]string {} // Map returned with all the attributes of the Element as lookup to their respective values
func Text() string {} // Full text inside a non-nested tag returned, first half returned in a nested one
func FullText() string {} // Full text inside a nested/non-nested tag returned
func SetDebug(bool) {} // Sets the debug mode to true or false; false by default
func HTML() {} // HTML returns the HTML code for the specific element

Root is a struct, containing three fields :

  • Pointer containing the pointer to the current html node
  • NodeValue containing the current html node's value, i.e. the tag name for an ElementNode, or the text in case of a TextNode
  • Error containing an error in a struct if one occurrs, else nil is returned. A detailed text explaination of the error can be accessed using the Error() function. A field Type in this struct of type ErrorType will denote the kind of error that took place, which will consist of either of the following
    • ErrUnableToParse
    • ErrElementNotFound
    • ErrNoNextSibling
    • ErrNoPreviousSibling
    • ErrNoNextElementSibling
    • ErrNoPreviousElementSibling
    • ErrCreatingGetRequest
    • ErrInGetRequest
    • ErrReadingResponse

Installation

Install the package using the command

go get github.com/anaskhan96/soup

Example

An example code is given below to scrape the "Comics I Enjoy" part (text and its links) from xkcd.

More Examples

package main

import (
	"fmt"
	"github.com/anaskhan96/soup"
	"os"
)

func main() {
	resp, err := soup.Get("https://xkcd.com")
	if err != nil {
		os.Exit(1)
	}
	doc := soup.HTMLParse(resp)
	links := doc.Find("div", "id", "comicLinks").FindAll("a")
	for _, link := range links {
		fmt.Println(link.Text(), "| Link :", link.Attrs()["href"])
	}
}

Contributions

This package was developed in my free time. However, contributions from everybody in the community are welcome, to make it a better web scraper. If you think there should be a particular feature or function included in the package, feel free to open up a new issue or pull request.

More Repositories

1

litfs

A FUSE file system in Go extended with persistent file storage
Go
125
star
2

go-password-encoder

Go package to encode (with random generated salt) and verify passwords
Go
100
star
3

sesh

A (very) simple elegant shell in Go
Go
33
star
4

Crash-Alert

Android app that uses accelerometer in mobiles to detect collisions while driving and sends location with nearby hospital's number to emergency contacts.
Java
23
star
5

base58check

Go implementation of base58check to encode Bitcoin addresses
Go
11
star
6

Image-Compression-Huffman

Converts images to text, then compresses the text file using Huffman compression
C
9
star
7

ngo-project

Web portal for the NGO Sri Vidyaniketan School built on Node/Express/Mongoose
JavaScript
8
star
8

newtonmath

Rust wrapper for Newton API
Rust
5
star
9

r2ic

A front end compiler to compile basic constructs in Rust to an intermediate code (quadruples) with optimizations
Python
3
star
10

github-stats-bot

Reddit bot giving short description of github repos
Go
3
star
11

go-vapidkeys

Go Package to generate VAPID public and private keys
Go
2
star
12

ODAssignments

Assignments given by OneDirect.
Java
2
star
13

HealthLedger

A virtual ledger built with NodeJS + Mongo to hold all patient records.
HTML
2
star
14

consolia-api

npm module to fetch consolia comics
JavaScript
2
star
15

check-base-encoding

npm module to check base encoding of a particular string
JavaScript
2
star
16

ezlisp

Basic lisp interpreter (Scheme dialect)
Python
1
star
17

Go-AlGo

Basic algorithms implemented in Golang
Go
1
star
18

gitSlack

New received user events on GitHub (user activity feed) posted on Slack
Python
1
star
19

cloud-complete

Unified testing suite for testing applications for Cloud Compliance
HTML
1
star
20

ElectroLight-Explorer

File explorer
Java
1
star
21

config

Personal config on Mac and Linux systems
Shell
1
star
22

anaskhan96.github.io

HTML
1
star
23

SelfieLessActs

Cross Platform Mobile App for SelfieLessActs
JavaScript
1
star
24

DA-Project

Data Analytics project repository, PES University '17
Python
1
star
25

XKCD-Comic-Extension

A Chrome extension that lets you view the latest as well as any random comic from xkcd.
JavaScript
1
star