• Stars
    star
    214
  • Rank 178,032 (Top 4 %)
  • Language
    Go
  • License
    MIT License
  • Created almost 7 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Pluck text in a fast and intuitive way 🐓

pluck
Version Code Coverage Code Coverage

Pluck text in a fast and intuitive way. 🐓

pluck makes text extraction intuitive and fast. You can specify an extraction in nearly the same way you'd tell a person trying to extract the text by hand: "OK Bob, every time you find X and then Y, copy down everything you see until you encounter Z."

In pluck, X and Y are called activators and Z is called the deactivator. The file/URL being plucked is parsed (or streamed) byte-by-byte into a finite state machine. Once all activators are found, the following bytes are saved to a buffer, which is added to a list of results once the deactivator is found. Multiple queries are extracted simultaneously and there is no requirement on the file format (e.g. XML/HTML), as long as its text.

Why?

pluck was made as a simple alternative to xpath and regexp. Through simple declarations, pluck allows complex procedures like extracting text in nested HTML tags, or extracting the content of an attribute of a HTML tag. pluck may not work in all scenarios, so do not consider it a replacement for xpath or regexp.

Doesn't regex already do this?

Yes basically. Here is an (simple) example:

(?:(?:X.*Y)|(?:Y.*X))(.*)(?:Z)

Basically, this should try and match everything before a Z and after we've seen both X and Y, in any order. This is not a complete example, but it shows the similarity.

The benefit with pluck is simplicity. You don't have to worry about escaping the right characters, nor do you need to know any regex syntax (which is not simple). Also pluck is hard-coded for matching this specific kind of pattern simultaneously, so there is no cost for generating a new deterministic finite automaton from multiple regex.

Doesn't cascadia already do this?

Yes, there is already a command-line tool to extract structured information from XML/HTML. There are many benefits to cascadia, namely you can do a lot more complex things with structured data. If you don't have highly structured data, pluck is advantageous (it extracts from any file). Also, with pluck you don't need to learn CSS selection.

Getting Started

Install

If you have Go1.7+

go get github.com/schollz/pluck

or just download from the latest releases.

Basic usage

Lets say you want to find URLs in a HTML file.

$ wget nytimes.com -O nytimes.html
$ pluck -a '<' -a 'href' -a '"' -d '"' -l 10 -f nytimes.html
{
    "0": [
        "https://static01.nyt.com/favicon.ico",
        "https://static01.nyt.com/images/icons/ios-ipad-144x144.png",
        "https://static01.nyt.com/images/icons/ios-iphone-114x144.png",
        "https://static01.nyt.com/images/icons/ios-default-homescreen-57x57.png",
        "https://www.nytimes.com",
        "http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml",
        "http://mobile.nytimes.com",
        "http://mobile.nytimes.com",
        "https://typeface.nyt.com/css/zam5nzz.css",
        "https://a1.nyt.com/assets/homepage/20170731-135831/css/homepage/styles.css"
    ]
}

The -a specifies activators and can be specified multiple times. Once all activators are found, in order, the bytes are captured. The -d specifies a deactivator. Once a deactivator is found, then it terminates capturing and resets and begins searching again. The -l specifies the limit (optional), after reaching the limit (10 in this example) it stops searching.

Advanced usage

Parse URLs or Files

Files can be parsed with -f FILE and URLs can be parsed by instead using -u URL.

$ pluck -a '<' -a 'href' -a '"' -d '"' -l 10 -u https://nytimes.com

Use Config file

You can also specify multiple things to pluck, simultaneously, by listing the activators and the deactivator in a TOML file. For example, lets say we want to parse ingredients and the title of a recipe. Make a file config.toml:

[[pluck]]
name = "title"
activators = ["<title>"]
deactivator = "</title>"

[[pluck]]
name = "ingredients"
activators = ["<label","Ingredient",">"]
deactivator = "<"
limit = -1

The title follows normal HTML and the ingredients were determined by quickly inspecting the HTML source code of the target site. Then, pluck it with,

$ pluck -c config.toml -u https://goo.gl/DHmqmv
{
    "ingredients": [
        "1 pound medium (26/30) peeled and deveined shrimp, tails removed",
        "2 teaspoons chili powder",
        "Kosher salt",
        "2 tablespoons canola oil",
        "4 scallions, thinly sliced",
        "One 15-ounce can black beans, drained and rinsed well",
        "1/3 cup prepared chipotle mayonnaise ",
        "2 limes, 1 zested and juiced and 1 cut into wedges ",
        "One 14-ounce bag store-bought coleslaw mix (about 6 cups)",
        "1 bunch fresh cilantro, leaves and soft stems roughly chopped",
        "Sour cream or Mexican crema, for serving",
        "8 corn tortillas, warmed "
    ],
    "title": "15-Minute Shrimp Tacos with Spicy Chipotle Slaw Recipe | Food Network Kitchen | Food Network"
}

Extract structured data

Lets say you want to tell Bob "OK Bob, first look for W. Then, every time you find X and then Y, copy down everything you see until you encounter Z. Also, stop if you see U, even if you are not at the end." In this case, W, X, and Y are activators but W is a "Permanent" activator. Once W is found, Bob forgets about looking for it anymore. U is a "Finisher" which tells Bob to stop looking for anything and return whatever result was found.

You can extract information from blocks in pluck by using these two keywords: "permanent" and "finisher". The permanent number determines how many of the activators (from the left to right) will stay activated forever, once activated. The finisher keyword is a new string that will retire the current plucker when found and not capture anything in the buffer.

For example, suppose you want to only extract link3 and link4 from the following:

<h1>Section 1</h1>
<a href="link1">1</a>
<a href="link2">2</a>
<h1>Section 2</h1>
<a href="link3">3</a>
<a href="link4">4</a>
<h1>Section 3</h1>
<a href="link5">5</a>
<a href="link6">6</a>

You can add "Section 2" as an activator and set permanent to 1 so that only the first activator ("Section 2") will continue to remain activated after finding the deactivator. Then you want to finish the plucker when it hits "Section 3", so we can set the finisher keyword as this. Then config.toml is

[[pluck]]
activators = ["Section 2","a","href",'"']
permanent = 1     # designates that the first 1 activators will persist
deactivator = '"'
finisher = "Section 3"

will result in the following:

{
    "0": [
        "link3",
        "link4",
    ]
}

More examples

See EXAMPLES.md for more examples.

Use as a Go package

Import pluck as "github.com/schollz/pluck/pluck" and you can use it in your own project. See the tests for more info.

Development

$ go get -u github.com/schollz/pluck/...
$ cd $GOPATH/src/github.com/schollz/pluck/pluck
$ go test -cover

Current benchmark

The state of the art for xpath is lxml, based on libxml2. Here is a comparison for plucking the same data from the same file, run on Intel i5-4310U CPU @ 2.00GHz × 4. (Run Python benchmark cd pluck/test && python3 main.py).

Language Rate
lxml (Python3.5) 300 / s
pluck 1270 / s

A real-world example I use pluck for is processing 1,200 HTML files in parallel, compared to running lxml in parallel:

Language Rate
lxml (Python3.6) 25 / s
pluck 430 / s

I'd like to benchmark a Perl regex, although I don't know how to write this kind of regex! Send a PR if you do :)

To Do

  • Allow OR statements (e.g '|").
  • Quotes match to quotes (single or double)?
  • Allow piping from standard in?
  • API to handle strings, e.g. PluckString(s string)
  • Add parallelism

License

MIT

Acknowledgements

Graphics by: www.vecteezy.com

More Repositories

1

croc

Easily and securely send things from one computer to another 🐊 📦
Go
23,068
star
2

howmanypeoplearearound

Count the number of people around you 👨‍👨‍👦 by monitoring wifi signals 📡
Python
6,759
star
3

find

High-precision indoor positioning framework for most wifi-enabled devices.
Go
5,006
star
4

find3

High-precision indoor positioning framework, version 3.
Go
4,494
star
5

progressbar

A really basic thread-safe progress bar for Golang applications
Go
3,367
star
6

hostyoself

Host yo' self from your browser, your phone, your toaster.
Go
1,777
star
7

find-lf

Track the location of every Wi-Fi device (📱) in your house using Raspberry Pis and FIND
Go
977
star
8

rwtxt

A cms for absolute minimalists.
JavaScript
934
star
9

cowyo

A feature-rich wiki webserver for minimalists 🐮 💬
JavaScript
906
star
10

raspberry-pi-turnkey

How to make a Raspberry Pi image that can be deployed anywhere and assigned to a WiFi network without SSH 👌
Python
767
star
11

peerdiscovery

Pure-Go library for cross-platform local peer discovery using UDP multicast 👩 🔁 👩
Go
594
star
12

closestmatch

Golang library for fuzzy matching within a set of strings 📃
Go
407
star
13

gojot

A command-line journal that is distributed and encrypted, making it easy to jot notes 📓
Go
343
star
14

PIanoAI

Realtime piano learning and accompaniment from a Pi-powered AI 🎹
Go
331
star
15

spotifydownload

A dependency-free Spotify playlist downloader that should just work
Shell
318
star
16

poetry-generator

A Python2 based Backus-Naur poetry generator
Python
294
star
17

musicsaur

Music synchronization from your browser.
Go
280
star
18

offlinenotepad

An offline-first, secure, private notepad. 📔 ✏️
JavaScript
236
star
19

pake

PAKE library for generating a strong secret between parties over an insecure channel
Go
180
star
20

miti

miti is a musical instrument textual interface. Basically, its MIDI, but with human-readable text. 🎵
Go
157
star
21

meanrecipe

Get a consensus recipe for your next meal. 🍪 🍰
Go
156
star
22

find3-cli-scanner

The command-line scanner that supports Bluetooth and WiFi
Go
142
star
23

playlistfromsong

Create an offline music playlist from a single song 🎶
Python
136
star
24

recursive-recipes

Visualize the recursive nature of recipes 🍰 🍪
Go
135
star
25

teoperator

Make OP-1 and OP-Z drum and synth patches from any sound. 🎹
Go
134
star
26

jsonstore

Simple thread-safe in-memory JSON key-store with persistent backend
Go
130
star
27

getsong

Download any song mp3 with no dependencies except ffmpeg
Go
124
star
28

find3-android-scanner

An android app that scans Bluetooth and WiFi for FIND3
Java
119
star
29

linkcrawler

Cross-platform persistent and distributed web crawler 🔗
Go
111
star
30

ingredients

Extract recipe ingredients from any recipe website on the internet.
HTML
109
star
31

share

Simple file sharing from the browser and the command-line.
Go
106
star
32

faas

Make any Go function into a API (FaaS)
Go
96
star
33

find-maraudersmap

Internal positioning for everyone, in the style of Harry Potter
HTML
67
star
34

broadcast-server

A simple Go server that broadcasts any data/stream.
Go
65
star
35

streammyaudio

Easily stream audio from any computer to the internet.
HTML
63
star
36

cowyodel

Easily move things between computers with a code phrase and https://cowyo.com 🐮 💬
Go
62
star
37

extract_recipe

Extract recipes from websites, calculates cooking times, collects nutrition info from USDA database
Python
61
star
38

crawdad

Cross-platform persistent and distributed web crawler 🦀
Go
61
star
39

duct

Inspired by patchbay.pub
Go
59
star
40

svg2gcode

Converts svg to gcode for pen plotters
Go
53
star
41

snaptext

A simple webapp to send and receive self-destructing messages in real-time. ✉️ ⚡
Go
53
star
42

pywebsitechanges

Change detection with a simple Python script to email you whenever a website changes.
Python
51
star
43

rpi_ai

An AI developed for the Raspberry Pi
Python
50
star
44

websitechanges

Alerts you via email about a website change.
Go
47
star
45

kiki

An experimental social network that works offline.
Go
46
star
46

pikocore

source for the pikocore drum machine
C++
46
star
47

goagrep

agrep-like fuzzy matching, but made faster using Golang and precomputation.
Go
45
star
48

oooooo

digital tape loops for monome norns, x6.
Lua
43
star
49

readable

Making web pages readable in a browser and in the command line 🔗 📖
Go
41
star
50

onetwothree

A responsive minimalist theme for Hugo that is simple as 1, 2, 3
CSS
39
star
51

logue

A collection of Korg logue patches for the NTS-1 (or possibly minilogue XD)
C
37
star
52

midi2cv-pi

Use a simple Python script, a few wires, and a MCP4725 to convert any MIDI signal to a control voltage.
Python
37
star
53

sqlite3dump

A Golang library for dumping SQL text
Go
35
star
54

wifiscan

A platform-independent WiFi scanning library for getting BSSID + RSSI
Go
35
star
55

norns.online

online norns with norns.online
Go
35
star
56

bol

Command-line and web journal that stays synchronized and encrypted across devices
Go
34
star
57

syncdir

Automatically discover peers and synchronize a folder
Go
33
star
58

18000

18,000 seconds of music.
SuperCollider
32
star
59

beowulf_ssh_cluster

Skeleton program for a simple Beowulf cluster that uses ssh to communicate
Python
31
star
60

markdown2tufte

Process markdown into a nice Tufte-like website 📖
CSS
31
star
61

carp

Browser-based Korg NTS-1 chord arpeggiator (carp) sequencer
JavaScript
28
star
62

httpfileserver

Wrapper for Golang http.FileServer that is faster (serving from memory) and uses less bandwidth (gzips when possible)
Go
27
star
63

nyblcore

ATtiny85-based sample machine with tempo-based effects.
C++
27
star
64

browsersync

A simple live-reloading tool for developing HTML.
Go
26
star
65

tape-synth

Instructions to create a cassette synthesizer.
Go
26
star
66

squirrel

Like curl, or wget, but downloads directly go to a SQLite databse
Go
25
star
67

workshops

workshops
SuperCollider
23
star
68

patchitup

Backup your file to your remote server using minimum bandwidth.
Go
23
star
69

logger

Simplistic, opinionated logging for Golang
Go
21
star
70

mnemonicode

Go
20
star
71

boltdb-server

Fancy server and Go package for connecting to BoltDB databases
Go
20
star
72

zget

zack's wget
Go
19
star
73

wormhole

Transfer files over TCP in Go
Go
19
star
74

quotation-explorer

Explore and search over 120,000 quotations, with the click of a mouse 🌎💬
Go
19
star
75

_core

_core firmware for rp2040-based sample mangling devices
C
19
star
76

o-o-o

dot-connected fm synth and sequencer for norns
Lua
18
star
77

norns-desktop

norns in docker
Dockerfile
17
star
78

tapedeck

norns tape deck emulator.
Lua
17
star
79

amenbreak

a dedicated amen break script for norns.
Lua
16
star
80

prevent-link-rot

Simple utility to convert links in any file to permanent links via the https://archive.org/web/ or http://perma.cc
HTML
16
star
81

anonfiction

A CMS for reading and writing stories in a online magazine format. 📖
Go
16
star
82

raw

Go
16
star
83

mx.samples

like mr. radar or mr.coffee but for samples on norns.
Lua
15
star
84

supertonic

an instrospective drum machien
Lua
15
star
85

mx.synths

norns script for polyphonic synths
Lua
14
star
86

autojack

norns mod for automatically jacking in usb audio
Lua
14
star
87

fbdb

File based database
Go
14
star
88

broadcast

Lua
14
star
89

zxcvbn

a norns script for a tracker on norns.
Lua
14
star
90

heartbpm

Control the tempo of electronic instruments with your heart rate. 💗 🎵
JavaScript
14
star
91

web-archiver

A tiny Python clone of https://archive.org/web/ for your own personal websites.
Python
13
star
92

paracosms

norns script to play and sample many samples.
Lua
13
star
93

album-at-the-place

Open-sourcing my latest music album.
12
star
94

indeterminate-music

A framework for creating indeterminate music (in development) 🎵
HTML
12
star
95

ipfs-connect

Easily connect two computers in the IPFS
Go
11
star
96

googleit

Get results from search engines.
Go
11
star
97

stringsizer

A very simple way to encode short strings.
Go
11
star
98

supercollisions

collection of SuperCollider scripts
SuperCollider
11
star
99

amen

sampler & mangler for monome norns
Lua
11
star
100

string_matching

A simple and fast approach to selecting the best string in a list of strings despite errors or mispelling.
Python
11
star