• Stars
    star
    102
  • Rank 335,584 (Top 7 %)
  • Language
    Go
  • License
    GNU General Publi...
  • Created over 8 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Command line OAI harvester and client with built-in cache,

metha

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a low-barrier mechanism for repository interoperability. Data Providers are repositories that expose structured metadata via OAI-PMH. Service Providers then make OAI-PMH service requests to harvest that metadata. -- https://www.openarchives.org/pmh/

The metha command line tools can gather information on OAI-PMH endpoints and harvest data incrementally. The goal of metha is to make it simple to get access to data, its focus is not to manage it.

builds.sr.ht status DOI Project Status: Active – The project has reached a stable, usable state and is being actively developed.

The metha tool has been developed for Project finc at Leipzig University Library.

Why yet another OAI harvester?

  • I wanted to crawl Arxiv but found that existing tools would timeout.
  • Some harvesters would start to download all records anew, if I interrupted a running harvest.
  • There are many OAI endpoints out there. It is a widely used protocol and somewhat worth knowing.
  • I wanted something simple for the command line; also fast and robust - metha as it is implemented now, is relatively robust and more efficient than requesting all record one-by-one (there is one annoyance which will hopefully be fixed soon).

How it works

The functionality is spread accross a few different executables:

  • metha-sync for harvesting
  • metha-cat for viewing
  • metha-id for gathering data about endpoints
  • metha-ls for inspecting the local cache
  • metha-files for listing the associated files for a harvest

To harvest and endpoint in the default oai_dc format:

$ metha-sync http://export.arxiv.org/oai2
...

All downloaded files are written to a directory below a base directory. The base directory is ~/.cache/metha by default and can be adjusted with the METHA_DIR environment variable.

When the -dir flag is set, only the directory corresponding to a harvest is printed.

$ metha-sync -dir http://export.arxiv.org/oai2
/home/miku/.metha/I29haV9kYyNodHRwOi8vZXhwb3J0LmFyeGl2Lm9yZy9vYWky
$ METHA_DIR=/tmp/harvest metha-sync -dir http://export.arxiv.org/oai2
/tmp/harvest/I29haV9kYyNodHRwOi8vZXhwb3J0LmFyeGl2Lm9yZy9vYWky

The harvesting can be interrupted at any time and the HTTP client will automatically retry failed requests a few times before giving up.

Currently, there is a limitation which only allows to harvest data up to the last day. Example: If the current date would be Thu Apr 21 14:28:10 CEST 2016, the harvester would request all data since the repositories earliest date and 2016-04-20 23:59:59.

To stream the harvested XML data to stdout run:

$ metha-cat http://export.arxiv.org/oai2

You can emit records based on datestamp as well:

$ metha-cat -from 2016-01-01 http://export.arxiv.org/oai2

This will only stream records with a datestamp equal or after 2016-01-01.

To just stream all data really fast, use find and zcat over the harvesting directory.

$ find $(metha-sync -dir http://export.arxiv.org/oai2) -name "*gz" | xargs unpigz -c

To display basic repository information:

$ metha-id http://export.arxiv.org/oai2

To list all harvested endpoints:

$ metha-ls

Further examples can be found in the metha man page:

$ man metha

Installation

Use a deb, rpm release, PKGBUILD or the go tool:

$ go install -v github.com/miku/metha/cmd/...@latest

Limitations

Currently the endpoint URL, the format and the set are concatenated and base64 encoded to form the target directory, e.g:

$ echo "U291bmRzI29haV9kYyNodHRwOi8vY29wYWMuamlzYy5hYy51ay9vYWktcG1o" | base64 -d
Sounds#oai_dc#http://copac.jisc.ac.uk/oai-pmh

If you have very long set names or a very long URL and the target directory exceeds e.g. 255 chars (on ext4), the harvest won't work.

Harvesting Roulette

$ URL=$(shuf -n 1 <(curl -Lsf https://git.io/vKXFv)); metha-sync $URL && metha-cat $URL

In 0.1.27 a metha-fortune command was added, which fetches a random article description and displays it.

$ metha-fortune
Active Networking is concerned with the rapid definition and deployment of
innovative, but reliable and robust, networking services. Towards this end we
have developed a composite protocol and networking services architecture that
encourages re-use of protocol functions, is well defined, and facilitates
automatic checking of interfaces and protocol component properties. The
architecture has been used to implement common Internet protocols and services.
We will report on this work at the workshop.

    -- http://drops.dagstuhl.de/opus/phpoai/oai2.php

$ metha-fortune
In this paper we show that the Lempert property (i.e., the equality between the
Lempert function and the CarathΓ©odory distance) holds in the tetrablock, a
bounded hyperconvex domain which is not biholomorphic to a convex domain. The
question whether such an equality holds was posed by Abouhajar et al. in J.
Geom. Anal. 17(4), 717–750 (2007).

    -- http://ruj.uj.edu.pl/oai/request

$ metha-fortune
I argue that GΓΆdel's incompleteness theorem is much easier to understand when
thought of in terms of computers, and describe the writing of a computer
program which generates the undecidable GΓΆdel sentence.

    -- http://quantropy.org/cgi/oai2

$ metha-fortune
Nigeria, a country in West Africa, sits on the Atlantic coast with a land area
of approximately 90 million hectares and a population of more than 140 million
people. The southern part of the country falls within the tropical rainforest
which has now been largely depleted and is in dire need of reforestation. About
10 percent of the land area was constituted into forest reserves for purposes
of conservation but this has suffered perturbations over the years to the
extent that what remains of the constituted forest reserves currently is less
than 4 percent of the country land area. As at today about 382,000 ha have been
reforested with indigenous and exotic species representing about 4 percent of
the remaining forest estate. Regrettably, funding of the Forestry sector in
Nigeria has been critically low, rendering reforestation programme near
impossible, especially in the last two decades. To revive the forestry sector
government at all levels must re-strategize and involve the local communities
as co-managers of the forest estates in order to create mutual dependence and
interaction in resource conservation.

    -- http://journal.reforestationchallenges.org/index.php/REFOR/oai

Scrape all metadata in a best-effort way

Use an endless loop with a timeout to get out of any hanging connection (which happen). Example scrape, converted to JSON (40+ GB: 2023-06-15-metha-oai.ndjson.zst).

$ while true; do \
    timeout 180 bash -c "metha-sync -list | \
    shuf | parallel -j 96 -I {} 'metha-sync -T 10s {}'"; \
done

Errors this harvester can somewhat handle

  • responses with resumption tokens that lead to empty responses
  • gzipped responses, that are not advertised as such
  • funny (illegal) control characters in XML responses
  • repositories, that won't respond unless the dates are given with the exact granualarity
  • repositories with endless token loops
  • repositories that do not support selective harvesting, use -no-intervals flag
  • limited repositories, metha will try a few times with an exponential backoff
  • repositories, which throw occasional HTTP errors, although most of the responses look good, use -ignore-http-errors flag

Authors

Misc

Show formats of random repository:

$ shuf -n 1 <(curl -Lsf https://git.io/vKXFv) | xargs -I {} metha-id {} | jq .formats

A snippet from a 2010 publication:

The Open Archives Protocol for Metadata Harvesting (OAI-PMH) (Lagoze and van de Sompel, 2002) is currently implemented by more than 1,700 digital library reposi- tories world-wide and enables the exchange of metadata via HTTP. -- Interweaving OAI-PMH Data Sources with the Linked Data Cloud

Metha elsewhere

Asciicast

asciicast

More Repositories

1

zek

Generate a Go struct from XML.
Go
668
star
2

esbulk

Bulk indexing command line tool for elasticsearch
Go
269
star
3

binpic

Create a picture from any file.
Go
89
star
4

microblob

Serve millions of JSON documents via HTTP.
Go
65
star
5

gluish

Utils around luigi.
Python
63
star
6

xmlcutty

Select elements from large XML files, fast.
Go
52
star
7

solrbulk

SOLR bulk indexing utility for the command line.
Go
41
star
8

estab

Export elasticsearch as TSV or line delimited JSON.
Go
36
star
9

haystack

Haystack and seaweedfs lightning talk.
C
25
star
10

pgrk

Command line pagerank computation with Go.
Go
20
star
11

siskin

Tasks around metadata.
Python
20
star
12

parallel

Process lines in parallel.
Go
17
star
13

exploreio

Explore IO with Golang, workshop at Golab 2017
Go
17
star
14

mlgo

Machine Learning with Go (golang) Session Material for GOLAB 2018
Makefile
17
star
15

stardust

stardust, strdist. String distance and similarity measures for the command line.
Go
16
star
16

span

Span formats.
Go
15
star
17

dwstalk

A data web service, lightning talk.
15
star
18

kat

Kat is like Preview.app for the command-line.
Go
15
star
19

filterline

Command line tool to filter file by line number.
C
12
star
20

brew-completion

bash tab completion for homebrew package manager
Shell
11
star
21

activememory

A page to test short term memory.
JavaScript
11
star
22

ntto

Small n-triples to line delimited JSON converter and prefix cutter.
Go
11
star
23

issnlister

List of valid, registered ISSN
Python
10
star
24

nntour

Neural nets intro @lpyug
Python
10
star
25

esdump

Stream documents from elasticsearch with scroll (and HTTP GET only)
Go
9
star
26

rsampling

Reservoir sampling for the command line.
Go
8
star
27

workshops

A level of indirection.
7
star
28

batchdata

Batch data processing with luigi, 90min workshop at PyCon Balkan 2018, Belgrade.
Python
7
star
29

jquery-retype

Your friendly javascript keylogger.
JavaScript
7
star
30

dcdump

Datacite API bulk access.
Go
7
star
31

goforprogrammers

Go for Programmers, Spartakiade 2021
Go
6
star
32

cignotes

Notes on Concurrency in Go
Go
6
star
33

marc21

A MARC21 library for Go.
Go
6
star
34

clinker

Dumb link checker.
Go
6
star
35

golangintro

A one day introductory Golang workshop at http://devopenspace.de 2018
Go
6
star
36

go4x4

Go materials for a set of 4x4 sessions.
HTML
6
star
37

glamline

Glamorous command line
Go
5
star
38

isbngrep

Command line ISBN sniffer and normalizer.
Go
5
star
39

oaimi

No frills OAI PMH harvesting for the command line.
Go
5
star
40

makta

Create an sqlite3 database from tabular data (2-TSV).
Go
5
star
41

cachetools

Various Python caching, pickling and memoization functions.
5
star
42

oaicrawl

OAI crawler for strange endpoints.
Go
5
star
43

wikitools

Few tools for working with wikipedia XML dumps.
Go
5
star
44

unzippa

Unzip selected members from a zipfile 150x faster than unzip.
Go
4
star
45

urlbisect

For URLs with autoincrement ids, find the highest number using binary search.
Go
4
star
46

jsoninf

JSON schema inference
Go
4
star
47

clam

A templated shell helper library.
Go
4
star
48

es-hf-2014-05-28

Experimenting with the Elasticsearch completion suggester during elasticsearch hackfest.
JavaScript
4
star
49

jpul

Jobportal Uni Leipzig
PHP
4
star
50

kollektions

kollektions
Python
3
star
51

productivego

Three reasons why go is fun to work with (even after seven years).
Makefile
3
star
52

fuzzycat

Fuzzy matching publications for fatcat (wip).
Python
3
star
53

lpug-luigi

Material from luigi presentation at LPUG meeting on 10/11/2015.
Python
3
star
54

concgo

Concurrency in Go workshop, GOLAB 2019
HTML
3
star
55

aboutgo

Materials for learning and teaching various Go topics.
Go
3
star
56

flask-gae-stub

Google App Engine Flask Stub.
Python
3
star
57

packpy

Python packaging notes for PyCon Balkan 2018.
mIRC Script
3
star
58

goexp

Go Expedition
HTML
3
star
59

hurrly

Hurry, hurrly!
Go
3
star
60

httpgetaway

HTTP GETAWAY - clients and hops, transports and timeouts.
Go
3
star
61

ottily

Ottily executes a javascript snippet on each line of an input file in parallel.
Go
3
star
62

pyflow

Advanced Python concepts and examples.
mIRC Script
3
star
63

blobproc

Webhook server that can receive raw bytes and execute commands.
Go
2
star
64

grobidclient

A Go (golang) client for GROBID.
Go
2
star
65

memcmarc

Load/Set MARC records into memcache.
Go
2
star
66

khwarizmi

Python
2
star
67

groupcover

Like uniq, but worse.
Go
2
star
68

benchtrie

Benchmarking name lookups.
Go
2
star
69

rarara

Prime buffer cache for a file via readahead from the command line (linux only).
C
2
star
70

tableau

Data and Feedback.
JavaScript
2
star
71

istools

Finc Intermediate Schema tools (linter, licensing)
Go
2
star
72

ldjtab

Extract values and line numbers from LDJ files.
Go
2
star
73

ttarc

Minimalistic TikTok trending archiver.
HTML
2
star
74

padsync

Tracking etherpads in git repositories.
Go
2
star
75

io15min

Lightning talk about the io package and its interfaces.
2
star
76

waste

A cat in a container service.
Go
2
star
77

memcldj

Load JSON blobs into memcache or memcachedb quickly.
Go
2
star
78

dvmapp

Server (prototype) for Project Die Virtuelle Mittagsfrau (defunkt)
Go
2
star
79

goenergy

Go energy lightning talk
2
star
80

creativejupyter

Creative Jupyter, PyCon Balkan 2019
2
star
81

elasticsearch-slides

JavaScript
2
star
82

gndzero

GND cache. Zeroth prototype.
Python
2
star
83

marctojson

Command line MARC to JSON converter.
Java
2
star
84

magento-tooling

Small magento analgesics.
2
star
85

goai

Go OAI.
Go
2
star
86

sundaypython

Input session for Coding da Vinci Ost 2018: Python 101 for data processing.
Jupyter Notebook
2
star
87

s2gen

Generate code for representing SOLR documents in Go from schema.xml file.
Go
2
star
88

runpad

Run code from an etherpad
Go
2
star
89

solrcount

A proxy for solr requests, that will only reveal the number of results.
Go
2
star
90

sitemapped

Export all URLs from a sitemap
Go
2
star
91

scholkit

Assorted utitlies around scholarly metadata.
Go
2
star
92

datasets

Lists of datasets and dataset lists.
1
star
93

gows

Go workshop notes.
HTML
1
star
94

zeromq-slides

JavaScript
1
star
95

libai

Assorted notes on libraries and AI
1
star
96

golang6h

Golang in six hours. Language tour and tooling.
Go
1
star
97

picourse

A WIP Raspberry Pi and Python course.
1
star
98

vcprompt

Imported from https://bitbucket.org/gward/vcprompt
C
1
star
99

marc22

An experimental fork of marc21.
Go
1
star
100

evreg

JavaScript
1
star