• Stars
    star
    256
  • Rank 159,219 (Top 4 %)
  • Language
    Go
  • License
    MIT License
  • Created almost 8 years ago
  • Updated about 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library

goq

Build Status GoDoc Coverage Status Go Report Card

Example

import (
	"log"
	"net/http"

	"astuart.co/goq"
)

// Structured representation for github file name table
type example struct {
	Title string `goquery:"h1"`
	Files []string `goquery:"table.files tbody tr.js-navigation-item td.content,text"`
}

func main() {
	res, err := http.Get("https://github.com/andrewstuart/goq")
	if err != nil {
		log.Fatal(err)
	}
	defer res.Body.Close()

	var ex example
	
	err = goq.NewDecoder(res.Body).Decode(&ex)
	if err != nil {
		log.Fatal(err)
	}

	log.Println(ex.Title, ex.Files)
}

Details

goq

-- import "astuart.co/goq"

Package goq was built to allow users to declaratively unmarshal HTML into go structs using struct tags composed of css selectors.

I've made a best effort to behave very similarly to JSON and XML decoding as well as exposing as much information as possible in the event of an error to help you debug your Unmarshaling issues.

When creating struct types to be unmarshaled into, the following general rules apply:

  • Any type that implements the Unmarshaler interface will be passed a slice of *html.Node so that manual unmarshaling may be done. This takes the highest precedence.

  • Any struct fields may be annotated with goquery metadata, which takes the form of an element selector followed by arbitrary comma-separated "value selectors."

  • A value selector may be one of html, text, or [someAttrName]. html and text will result in the methods of the same name being called on the *goquery.Selection to obtain the value. [someAttrName] will result in *goquery.Selection.Attr("someAttrName") being called for the value.

  • A primitive value type will default to the text value of the resulting nodes if no value selector is given.

  • At least one value selector is required for maps, to determine the map key. The key type must follow both the rules applicable to go map indexing, as well as these unmarshaling rules. The value of each key will be unmarshaled in the same way the element value is unmarshaled.

  • For maps, keys will be retreived from the same level of the DOM. The key selector may be arbitrarily nested, though. The first level of children with any number of matching elements will be used, though.

  • For maps, any values must be nested below the level of the key selector. Parents or siblings of the element matched by the key selector will not be considered.

  • Once used, a "value selector" will be shifted off of the comma-separated list. This allows you to nest arbitrary levels of value selectors. For example, the type []map[string][]string would require one selector for the map key, and take an optional second selector for the values of the string slice.

  • Any struct type encountered in nested types (e.g. map[string]SomeStruct) will override any remaining "value selectors" that had not been used. For example, given:

    struct S { F string goquery:",[bang]" }

    struct { T map[string]S goquery:"#someId,[foo],[bar],[baz]" }

[foo] will be used to determine the string map key,but [bar] and [baz] will be ignored, with the [bang] tag present S struct type taking precedence.

Usage

func NodeSelector

func NodeSelector(nodes []*html.Node) *goquery.Selection

NodeSelector is a quick utility function to get a goquery.Selection from a slice of *html.Node. Useful for performing unmarshaling, since the decision was made to use []*html.Node for maximum flexibility.

func Unmarshal

func Unmarshal(bs []byte, v interface{}) error

Unmarshal takes a byte slice and a destination pointer to any interface{}, and unmarshals the document into the destination based on the rules above. Any error returned here will likely be of type CannotUnmarshalError, though an initial goquery error will pass through directly.

func UnmarshalSelection

func UnmarshalSelection(s *goquery.Selection, iface interface{}) error

UnmarshalSelection will unmarshal a goquery.goquery.Selection into an interface appropriately annoated with goquery tags.

type CannotUnmarshalError

type CannotUnmarshalError struct {
	Err      error
	Val      string
	FldOrIdx interface{}
}

CannotUnmarshalError represents an error returned by the goquery Unmarshaler and helps consumers in programmatically diagnosing the cause of their error.

func (*CannotUnmarshalError) Error

func (e *CannotUnmarshalError) Error() string

type Decoder

type Decoder struct {
}

Decoder implements the same API you will see in encoding/xml and encoding/json except that we do not currently support proper streaming decoding as it is not supported by goquery upstream.

func NewDecoder

func NewDecoder(r io.Reader) *Decoder

NewDecoder returns a new decoder given an io.Reader

func (*Decoder) Decode

func (d *Decoder) Decode(dest interface{}) error

Decode will unmarshal the contents of the decoder when given an instance of an annotated type as its argument. It will return any errors encountered during either parsing the document or unmarshaling into the given object.

type Unmarshaler

type Unmarshaler interface {
	UnmarshalHTML([]*html.Node) error
}

Unmarshaler allows for custom implementations of unmarshaling logic

TODO

  • Callable goquery methods with args, via reflection

More Repositories

1

hn

A hackernews ncurses CLI written in GO
Go
272
star
2

vim-kubernetes

vim-kubernetes
Vim Snippet
161
star
3

go-robinhood

A golang library for interacting with the Robinhood private API
Go
68
star
4

openai

A go client and cli for the openai APIs, focused on developer friendliness and convenience atop the basic building blocks for the OpenAI apis
Go
64
star
5

limio

A rate limiting library for Go centered around intuitive and idiomatic interfaces, and designed to limit silly window syndrome.
Go
54
star
6

servicenow

A golang client for ServiceNow
Go
24
star
7

sparknode

Allows node.js to interface with a sparkcore.
JavaScript
19
star
8

edgeos-rest

An EdgeOS REST client in Go.
Go
17
star
9

go-sse

A golang sse client.
Go
14
star
10

kube-gen-certs

Generate kubernetes ingress TLS certificates automatically via Vault
Go
11
star
11

go-jasypt

Golang functions and structs for decrypting Jasypt-encrypted values
Go
9
star
12

bstest

A simple CLI to show how simply test coverage metrics can be gamed.
Go
7
star
13

dlite

An NNTP search/downloader written in go
Go
6
star
14

readinglist

CLI reading list
Go
6
star
15

vpki

A Vault TLS library for more convenient use of the Vault PKI backend
Go
6
star
16

go-iracing

iRacing API wrapper for Go apps
Go
5
star
17

yenc

Implementations of yenc reader and writer in Go.
Go
5
star
18

helm-charts

My helm charts
4
star
19

modesty

Replace vanity URLs with their resolved equivalents, for those pesky corporate firewalls.
Go
4
star
20

go-nzb

Golang nzb parsing
Go
3
star
21

rplace

Go
3
star
22

multierrgroup

Go
3
star
23

nntp

A golang library for nntp io, client, and response types for making NNTP easier.
Go
3
star
24

2048.go

2048 server side implementation in Golang for a multiplayer version of 2048.
Go
3
star
25

buildmeta

Go
2
star
26

gistfs

A quick readable filesystem for any user's gists
Go
2
star
27

kube-configmap-updater

Annotate your pods with a label and have them automatically redeployed anytime a dependent configmap is updated.
Go
2
star
28

lightning

Speedy spark core library written in Go (aka Golang)
Go
2
star
29

go-oauth-prov

An oauth provider, which accepts SAML, backed by redis.
Go
2
star
30

ng-model-default

An angular directive for dynamically updating an input with a default value
JavaScript
1
star
31

p

Go
1
star
32

vpki-proxy

A reverse proxy that can leverage vault or lets-encrypt as a CA for certificate creation
Go
1
star
33

hr

A HackerRank cli client for faster and offline self-improvement
Go
1
star
34

soffit-go-poc

uPortal soffit implementation (and poc) in Go
Go
1
star
35

aoc2021

HTML
1
star
36

i3

A convenience library for writing i3 status bar applications
Go
1
star
37

grep-notify

A simple utility for searching a file stream and creating a desktop notification
Go
1
star
38

go-torrent

A pure-golang bittorrent library.
1
star
39

gopip

A pip-boy parsing proxy and client in golang
Go
1
star
40

bible-http-server

Go
1
star
41

outils

Golang oauth utils, including a simple on-disk cache
Go
1
star
42

generator-ng-portlet

uPortal Angular portlet generator
JavaScript
1
star
43

catscan

OSS certificate scanning and validation
Go
1
star
44

kube-etc-hosts

A local development add-on for augmenting /etc/hosts automatically with DNS names for cluster ingresses
Go
1
star