• Stars
    star
    117
  • Rank 301,828 (Top 6 %)
  • Language
    Java
  • Created over 7 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A high performance bibliographic information service: https://biblio-glutton.readthedocs.io

biblio-glutton

License Demo cloud.science-miner.com/glutton SWH

Framework dedicated to bibliographic information. It includes:

  • a bibliographical reference matching service: from an input such as a raw bibliographical reference and/or a combination of key metadata, the service will return the disambiguated bibliographical object with in particular its DOI and a set of metadata aggregated from Crossref and other sources,
  • a fast metadata look-up service: from a "strong" identifier such as DOI, PMID, etc. the service will return a set of metadata aggregated from Crossref and other sources,
  • various mapping between DOI, PMID, PMC, ISTEX ID and ark, integrated in the bibliographical service,
  • Open Access resolver: Integration of Open Access links via the Unpaywall dataset from Impactstory,
  • Gap and daily update for Crossref resources (via the Crossref REST API), so that your glutton data service stays always in sync with Crossref,
  • MeSH classes mapping for PubMed articles.

The framework is designed both for speed (with several thousands request per second for look-up) and matching accuracy. It can be scaled horizontally as needed and can provide high availability.

Benchmarking against the Crossref REST API is presented below.

In the Glutton family, the following complementary tools are available for taking advantage of Open Access resources:

  • biblio-glutton-harvester: A robust, fault tolerant, Python utility for harvesting efficiently (multi-threaded) a large Open Access collection of PDF (Unpaywall, PubMed Central), with the possibility to upload content on Amazon S3,

  • biblio-glutton-extension: A browser extension (Firefox & Chrome) for providing bibliographical services, like identifying dynamically Open Access resources on web pages and providing contextual citation services.

Current stable version of biblio-glutton is 0.2. Working version is 0.3-SNAPSHOT.

The bibliographical look-up and matching REST API

Once the databases and index are built, the bibliographical REST API can be started. For building the databases and index, see the next sections below.

Build the service lookup

You need Java JDK 1.8 installed for building and running the tool.

cd lookup
./gradlew clean build

Start the server

cd lookup/
./gradlew clean build
java -jar build/libs/lookup-service-0.2-onejar.jar server

The service will use the default project configuration located under biblio-glutton/config/glutton.yml. If you want to use a configuration file in another location, you can can specify it as additional parameter:

cd lookup/
./gradlew clean build
java -jar build/libs/lookup-service-0.2-onejar.jar server /some/where/glutton.yml

To check if it works, you can view a report of the data used by the service at host:port/service/data. For instance:

curl localhost:8080/service/data

{
    "Metadata Lookup Crossref size":"{crossref_Jsondoc=127887559}",
    "ISTEX size":"{istex_doi2ids=21325261, istex_istex2ids=21401494, istex_pii2ids=6954799}",
    "Metadata Matching Crossref size":"127970581",
    "PMID lookup size":"{pmid_doi2ids=25661624, pmid_pmc2ids=7561377, pmid_pmid2ids=33761382}",
    "DOI OA size":"{unpayWall_doiOAUrl=30635446}"
}

Start optional additional GROBID service

biblio-glutton takes advantage of GROBID for parsing raw bibliographical references. This permits faster and more accurate bibliographical record matching. To use GROBID service:

  • first download and install GROBID as indicated in the documentation

  • start the service as documented here. You can change the port used by GROBID by updating the service config file under grobid/grobid-home/config/grobid.yaml

  • update if necessary the host and port information of GROBID in the biblio-glutton config file under biblio-glutton/config/glutton.yml (parameter grobidPath).

While GROBID is not required for running biblio-glutton, in particular if it is used only for bibliographical look-up, it is recommended for performing bibliographical record matching.

REST API

The service can be queried based on a strong identifier, likeDOI, PMID, etc. as follow:

  • match record by DOI

    • GET host:port/service/lookup?doi=DOI
    • GET host:port/service/lookup/doi/{DOI}
  • match record by PMID

    • GET host:port/service/lookup?pmid=PMID
    • GET host:port/service/lookup/pmid/{PMID}
  • match record by PMC ID

    • GET host:port/service/lookup?pmc=PMC
    • GET host:port/service/lookup/pmc/{PMC}
  • match record by ISTEX ID

    • GET host:port/service/lookup?istexid=ISTEXID
    • GET host:port/service/lookup/istexid/{ISTEXID}
  • match record by PII ID

    • GET host:port/service/lookup?pii=PII
    • GET host:port/service/lookup/pii/{PII}

The service can be queried with various metadata like article title (atitle), first author last name (firstAuthor), journal title (jtitle), volume (volume), first page (firstPage) and publication year (year)

  • match record by article title and first author lastname

    • GET host:port/service/lookup?atitle=ARTICLE_TITLE&firstAuthor=FIRST_AUTHOR_SURNAME
  • match record by journal title or abbreviated title, volume and first page

    • GET host:port/service/lookup?jtitle=JOURNAL_TITLE&volume=VOLUME&firstPage=FIRST_PAGE
  • match record by journal title or abbreviated title, volume, first page, and first author lastname

    • GET host:port/service/lookup?jtitle=JOURNAL_TITLE&volume=VOLUME&firstPage=FIRST_PAGE&firstAuthor=FIRST_AUTHOR_SURNAME

It's possible to query the service based on a raw citation string (biblio):

  • match record by raw citation string
    • GET host:port/service/lookup?biblio=BIBLIO_STRING&
    • POST host:port/service/lookup/biblio with ContentType=text/plain

Any combinations of these metadata and full raw citation string is possible, for instance:

- `GET host:port/service/lookup?biblio=BIBLIO_STRING&atitle=ARTICLE_TITLE&firstAuthor=FIRST_AUTHOR_SURNAME`

or:

- `GET host:port/service/lookup?jtitle=JOURNAL_TITLE&volume=VOLUME&firstPage=FIRST_PAGE&firstAuthor=FIRST_AUTHOR_SURNAME&atitle=ARTICLE_TITLE`

or:

- `GET host:port/service/lookup?biblio=BIBLIO_STRING&atitle=ARTICLE_TITLE&firstAuthor=FIRST_AUTHOR_SURNAME&year=YYYY`

It is also possible to combine a strong identifier with validation metadata. In this case, if the DOI appears conflicting with the provided metadata, no results will be returned, as a way to detect invalid DOI with post-validation:

- `GET host:port/service/lookup?doi=DOI&atitle=ARTICLE_TITLE&firstAuthor=FIRST_AUTHOR_SURNAME`

biblio-glutton will make the best use of all the parameters sent to retrieve in the fastest way a record and apply matching threshold to avoid false positive. It is advised to send as much metadata as possible to try to optimize the DOI matching in term of speed and accuracy, and when possible a full raw bibliographical string.

The more metadata are available in the query, the better. The original raw bibliographical string is also be exploited when available to control the bibliographical record matching.

For convenience, in case you are only interested by the Open Access URL for a bibliographical object, the open Access resolver API returns the OA PDF link (URL) only via an identifier:

  • return the best Open Access URL if available

    • GET host:port/service/oa?doi=DOI return the best Open Accss PDF url for a given DOI
    • GET host:port/service/oa?pmid=PMID return the best Open Accss PDF url for a given PMID
    • GET host:port/service/oa?pmc=PMC return the best Open Accss PDF url for a given PMC ID
    • GET host:port/service/oa?pii=PII return the best Open Accss PDF url for a given PII ID
  • return the best Open Access URL and ISTEX PDF URL if available

    • GET host:port/service/oa_istex?doi=DOI return the best Open Accss PDF url and ISTEX PDF url for a given DOI
    • GET host:port/service/oa_istex?pmid=PMID return the best Open Accss PDF url and ISTEX PDF url for a given PMID
    • GET host:port/service/oa_istex?pmc=PMC return the best Open Accss PDF url and ISTEX PDF url for a given PMC ID
    • GET host:port/service/oa_istex?pii=PII return the best Open Accss PDF url and ISTEX PDF url for a given PII ID

cURL examples

To illustrate the usage of the API, we provide some cURL example queries:

Bibliographical metadata lookup by DOI:

curl http://localhost:8080/service/lookup?doi=10.1484/J.QUAESTIO.1.103624

Matching with title and first author lastname:

curl "http://localhost:8080/service/lookup?atitle=Naturalizing+Intentionality+between+Philosophy+and+Brain+Science.+A+Survey+of+Methodological+and+Metaphysical+Issues&firstAuthor=Pecere"

curl "http://localhost:8080/service/lookup?atitle=Naturalizing+Intentionality+between+Philosophy+and+Brain+Science&firstAuthor=Pecere"

Matching with raw bibliographical reference string:

curl "http://localhost:8080/service/lookup?biblio=Baltz,+R.,+Domon,+C.,+Pillay,+D.T.N.+and+Steinmetz,+A.+(1992)+Characterization+of+a+pollen-specific+cDNA+from+sunflower+encoding+a+zinc+finger+protein.+Plant+J.+2:+713-721"

Bibliographical metadata lookup by PMID (note that only the number is expected):

curl http://localhost:8080/service/lookup?pmid=1605817

Bibliographical metadata lookup by PMC ID (note that the PMC prefix in the identifier is expected):

curl http://localhost:8080/service/lookup?pmc=PMC1017419

Bibliographical metadata lookup by PII ID:

curl http://localhost:8080/service/lookup?pii=

Bibliographical metadata lookup by ISTEX ID:

curl http://localhost:8080/service/lookup?istexid=E6CF7ECC9B002E3EA3EC590E7CC8DDBF38655723

Open Access resolver by DOI:

curl "http://localhost:8080/service/oa?doi=10.1038/nature12373"

Combination of Open Access resolver and ISTEX identifier by DOI:

curl "http://localhost:8080/service/oa_istex?doi=10.1038/nature12373"

Building the bibliographical data look-up and matching databases

Architecture

Below is an overview of the biblio-glutton architecture. The biblio-glutton server manages locally high performance LMDB databases for all metadata look-up tasks (several thousand requests per second with multiple threads). For the costly metadata matching tasks, an Elasticsearch cluster is used. For scaling this sort of queries, simply add more nodes in this Elasticsearch cluster, keeping a single biblio-glutton server instance.

Glutton architecture

Scaling evaluation

  1. Metadata Lookup

One glutton instance: 19,792,280 DOI lookup in 3156 seconds, ~ 6270 queries per second.

  1. Bibliographical reference matching

(to be completed with more nodes!)

Processing time for matching 17,015 raw bibliographical reference strings to DOI:

number of ES cluster nodes comment total runtime (second) runtime per bib. ref. (second) queries per second
1 glutton and Elasticsearch node share the same machine 2625 0.154 6.5
1 glutton and Elasticsearch node on two separate machines 1990 0.117 8.5
2 glutton and one of the Elasticsearch node sharing the same machine 1347 0.079 12.6

Machines have the same configuration Intel i7 4-cores, 8 threads, 16GB memory, SSD, on Ubuntu 16.04.

Loading resources

To set-up a functional biblio-glutton server, resources need to be loaded following these steps:

  1. Loading of a Crossref full metadata dump as embedded LMDB

  2. Loading the coverage gap between the Crossref dump and the current day (updates are then realized automatically daily as the service is up and running) in the embedded LMDB

  3. Loading the DOI to PMID and PMC ID mapping as embedded LMDB

  4. (Optional) Loading the Open Access information from an Unpaywall datset snapshot as embedded LMDB

  5. (Very optional) Loading the ISTEX ID mapping as embedded LMDB

  6. Creating the Elasticsearch index

Resources

For building the database and index used by service, you will need these resources:

We recommend to use a Crossref Metadata Plus snapshot in order to have a version of the Crossref metadata without large coverage gap. With the Crossref-Plus-API-Token, the following command for instance will download the full snapshot for the indicated year/month:

wget -c --header='Crossref-Plus-API-Token: Bearer __Crossref-Plus-API-Token-Here_' https://api.crossref.org/snapshots/monthly/YYYY/MM/all.json.tar.gz

Without Metadata Plus subscription, it's possible to use the Academic Torrents Crossref dumps. For instance, with the Linux command line aria2 and a high speed internet connection (e.g. 500Mb/s), the dump can be downloaded in a few minutes. However, the coverage gap will be important and updating these older snapshot via the normal Crossref Web API will take an enormous amount of time.

  • DOI to PMID and PMC mapping: available at Europe PMC and regularly updated at ftp://ftp.ebi.ac.uk/pub/databases/pmc/DOI/PMID_PMCID_DOI.csv.gz,

  • optionally, but recommended, the Unpaywall dataset to get Open Access links aggregated with the bibliographical metadata, see here to get the latest database snapshot.

  • optionally, usually not required, for getting ISTEX identifier information, you need to build the ISTEX ID mapping, see below.

The bibliographical matching service uses a combination of high-performance embedded databases (LMDB), for fast look-up and cache, and Elasticsearch for blocking via text-based search. As Elasticsearch is much slower than embedded databases, it is used only when absolutely required.

The databases and Elasticsearch index must first be built from the resource files. The full service needs around 300 GB of space for building this index and it is highly recommended to use SSD for best performance.

Build the embedded LMDB databases

Resource dumps will be compiled in high performance LMDB databases. The system can read compressed (gzip or .xz) or plain text files (json), so in practice you do not need to uncompress anything.

Build the data loader

cd lookup
./gradlew clean build

All the following commands need to be launched under the subdirectory lookup/. The loading of the following database can be done in parallel. The default configuration file under biblio-glutton/config/glutton.yml will be used if not indicated. To use a configuration file in another location, just add the full path as additional parameter like for running the sevice.

Crossref metadata

General command line pattern:

java -jar build/libs/lookup-service-0.2-onejar.jar crossref --input /path/to/crossref/json/file path/to/config/file/glutton.yml

Example with Crossref Metadata Plus snapshot (path to a .tar.gz file which archives many json files):

java -jar build/libs/lookup-service-0.2-onejar.jar crossref --input ~/tmp/crossref_metadata_plus.tar.gz ../config/glutton.yml

The last parameter is the project config file normally under biblio-glutton/config/glutton.yml:

java -jar build/libs/lookup-service-0.2-onejar.jar crossref --input /path/to/crossref/json/file ../config/glutton.yml

Example with Crossref dump Academic Torrent file (path to a repository of *.json.gz files):

java -jar build/libs/lookup-service-0.2-onejar.jar crossref --input ~/tmp/crossref_public_data_file_2021_01 ../config/glutton.yml

Example with xz-compressed file (e.g. GreeneLab dump):

java -jar build/libs/lookup-service-0.2-onejar.jar crossref --input crossref-works.2019-09-09.json.xz ../config/glutton.yml

Note: By default the abstract, the reference and the original indexed fields included in Crossref records are ignored to save some disk space. The reference field is particularly large as it lists all the citations for almost half of the DOI records. You can change the list of fields to be filtered out in the config file under biblio-glutton/config/glutton.yml, by editing the lines:

ignoreCrossRefFields:                                                   
  - reference
  - abstract
  - indexed

Example loading the Crossref Metadata Plus snapshot of March 2022, loading time around 4 hours (dump files on slow hard drive).

-- Counters --------------------------------------------------------------------
crossrefLookup_rejectedRecords
             count = 5472493

-- Meters ----------------------------------------------------------------------
crossrefLookup
             count = 126812507
         mean rate = 11368.21 events/second
     1-minute rate = 6520.71 events/second
     5-minute rate = 6403.19 events/second
    15-minute rate = 7240.26 events/second

The 5,472,493 rejected records correspond to all the DOI "components" (given to figures, tables, etc. part of document) which are filtered out. As a March 2022, we thus have 121,340,014 crossref article records.

Crossref metadata gap coverage

Once the main Crossref metadata snapshot has been loaded, the metadata and index will be updated daily automatically via the Crossref web API. However, there is always a gap of coverage between the last day covered by the used large snapshot image and the start of the daily update.

Currently new Crossref Metadata Plus snapshot are available on the 5th of every month, covering all the metadata updates for until the previous month. It means that in the best case, there will be a coverage gap of 5 days to be recovered. More generally, users of Crossref Metadata Plus snapshot can load first a snapshot of the last month, then an additional snapshot mid-month update is available with the registered content that has changed in the first half of the month. This permits to minimize the coverage gap usually to a few days.

Using the Crossref web API to cover the remaining gap (from the latest update day in the full snapshot to the current day) is done with the following command (still under biblio-glutton/lookup/):

java -jar build/libs/lookup-service-0.2-onejar.jar gap_crossref path/to/config/file/glutton.yml

For instance:

java -jar build/libs/lookup-service-0.2-onejar.jar gap_crossref ../config/glutton.yml

Be sure to indicate in the configution file glutton.yml your polite usage email and/or crossref matadata plus token for using the Crossref web API.

This command should thus be launched only one time after the loading of a full Crossref snapshot, it will resync the current metadata and index to the current day, and the daily update will then ensure everything remain in sync with the reference Crossref metadata as long the service is up and running.

Warning: If an older snapshot is used, like the Crossref dump Academic Torrent file, the coverage gap is not a few days, but usually several months or more than one year (since Crossref has not updated the Academic Torrent dump in 2022). Using the Crossweb API to cover such a long gap will unfortunately take an enormous amount of time (more than a week) due to API usage rate limitations and is likely not a acceptable solution. In addition, the Crossref web API is not always reliable, which might cause further delays.

PMID and PMC ID

java -jar build/libs/lookup-service-0.2-onejar.jar pmid --input /path/to/pmid/csv/file path/to/config/file/glutton.yml

Example:

java -jar build/libs/lookup-service-0.2-onejar.jar pmid --input PMID_PMCID_DOI.csv.gz ../config/glutton.yml

As of March 2022, the latest mapping covers 34,310,000 PMID, with 25,661,624 having a DOI (which means 8,648,376 PMID are not represented in Crossref).

OA via Unpaywall

java -jar build/libs/lookup-service-0.2-onejar.jar unpaywall --input /path/to/unpaywall/json/file path/to/config/file/glutton.yml

Example:

java -jar build/libs/lookup-service-0.2-onejar.jar unpaywall --input unpaywall_snapshot_2022-03-09T083001.jsonl.gz ../config/glutton.yml

As of March 2022, the Unpaywall snapshot should provide at least one Open Access link to 30,618,764 Crossref entries.

ISTEX

Note that ISTEX mapping is only relevant for ISTEX full text resource users, so only public research institutions in France. So you can generally skip this step.

java -jar build/libs/lookup-service-0.2-onejar.jar istex --input /path/to/istex/json/file path/to/config/file/glutton.yml

Example:

java -jar build/libs/lookup-service-0.2-onejar.jar istex --input istexIds.all.gz ../config/glutton.yml

Note: see bellow how to create this mapping file istexIds.all.gz.

Build the Elasticsearch index

Elasticsearch 7.* is required. node.js version 10 or more should work fine.

A node.js utility under the subdirectory indexing/ is used to build the Elasticsearch index. Indexing will take a few hours. For 116M crossref entries, the indexing takes around 6 hours (with SSD, 16GB RAM) and around 22GB of index space (per ES node if you plan to use several ES nodes).

Install and configure

You need first to install and start ElasticSearch, version 7.*. Edit the project configuration file biblio-glutton/config/glutton.yml to indicate the host name and port of the Elasticsearch server. In this configuration file, it is possible to specify the name of the index ('default crossref) and the batch size of the bulk indexing.

Install the node.js module:

cd indexing/
npm install

Build the index

Usage information for indexing:

cd biblio-glutton/indexing/
node main -dump *PATH_TO_THE_CROSSREF_DUMP* index

Example with Crossref dump Academic Torrent file (path to a repository of *.json.gz files):

node main -dump ~/tmp/crossref_public_data_file_2021_01 index

Example with Crossref Metadata Plus snapshot (path to a file .tar.gz which archives many json files):

node main -dump ~/tmp/crossref_sample.tar.gz index

Example with GreeneLab/Internet Archive dump file:

node main -dump ~/tmp/crossref-works.2019-09-09.json.xz index

Note that launching the above command will fully re-index the data, deleting existing index. The default name of the index is crossref, but this can be changed via the global config file biblio-glutton/config/glutton.yml.

For getting health check information about the selected ElasticSearch cluster:

node main health

Example loading the public Crossref dump available with Academic Torrents (2021-01-07), index on SSD, dump files on hard drive, Ubuntu 18.04, 4 cores, 16MB RAM 6 years old machine:

  • 115,972,356 indexed records (perfect match with metadata db)
  • around 6:30 for indexing (working on the same time on the computer), 4797 records/s
  • 25.94GB index size

Like for the metadata database, the Crossref objects of type component are skipped.

Matching accuracy

Here is some evaluation on the bibliographical reference matching.

Dataset

We created a dataset of 17,015 bibliographical reference/DOI pairs with GROBID and the PMC 1943 sample (a set of 1943 PubMed Central articles from 1943 different journals with both PDF and XML NLM files available, see below). For the bibliographical references present in the NLM file with a DOI, we try to align the raw reference string extracted from the PDF by GROBID and the parsed XML present in the NLM file. Raw reference string are thus coming from the PDF, and we included additional metadata as extracted by GROBID from the PDF.

Example of the two first of the 17.015 entries:

{"reference": "Classen M, Demling L. Endoskopishe shinkterotomie der papilla \nVateri und Stein extraction aus dem Duktus Choledochus [Ger-\nman]. Dtsch Med Wochenschr. 1974;99:496-7.", "doi": "10.1055/s-0028-1107790", "pmid": "4835515", "atitle": "Endoskopishe shinkterotomie der papilla Vateri und Stein extraction aus dem Duktus Choledochus [German]", "firstAuthor": "Classen", "jtitle": "Dtsch Med Wochenschr", "volume": "99", "firstPage": "496"},
{"reference": "Kawai K, Akasaka Y, Murakami K. Endoscopic sphincterotomy \nof the ampulla of Vater. Gastrointest Endosc. 1974;20:148-51.", "doi": "10.1016/S0016-5107(74)73914-1", "pmid": "4825160", "atitle": "Endoscopic sphincterotomy of the ampulla of Vater", "firstAuthor": "Kawai", "jtitle": "Gastrointest Endosc", "volume": "20", "firstPage": "148"},

The goal of Glutton matching is to identify the right DOI from raw metadata. We compare the results with the Crossref REST API, using the query.bibliographic field for raw reference string matching, and author/title field queries for first author lastname (query.author) plus title matching (query.title).

Limits:

  • The DOI present in the NLM files are not always reliable (e.g. DOI not valid anymore following some updates in Crossref). A large amount of the matching errors are actually not due to the matching service, but to NLM reference DOI data quality. However, errors will be the same for all matching services, so it's still valid for comparing them, although for this reason the resulting accuracy is clearly lower than what it should be.

  • GROBID extraction is not always reliable, as well the alignment mechanism with NLM (based on soft match), and some raw reference string might not be complete or include unexpected extra material from the PDF. However, this can be view as part of the matching challenge in real world conditions!

  • the NLM references with DOI are usually simpler reference than in general: there are much fewer abbreviated references (without title nor authors) and references without titles as compared to general publications from non-medicine publishers.

How to run the evaluation

You can use the DOI matching evaluation set (with 17,015 bibliographical reference/DOI pairs) from the indicated above address (here) or recreate this dataset with GROBID as follow:

./gradlew PrepareDOIMatching -Pp2t=ABS_PATH_TO_PMC/PMC_sample_1943

The evaluation dataset will be saved under ABS_PATH_TO_PMC/PMC_sample_1943 with the name references-doi-matching.json

  • For launching an evaluation:
  1. Select the matching method (crossref or glutton) in the grobid-home/config/grobid.yaml file:
consolidation:
    # define the bibliographical data consolidation service to be used, either "crossref" for Crossref REST API or 
    # "glutton" for https://github.com/kermitt2/biblio-glutton
    #service: "crossref"
    service: "glutton"
  1. If Glutton is setected, start the Glutton server as indicated above (we assume that it is running at localhost:8080).

  2. Launch from GROBID the evaluation, indicating the path where the evaluation dataset has been created - here we suppose that the file references-doi-matching.json has been saved under ABS_PATH_TO_PMC/PMC_sample_1943:

./gradlew EvaluateDOIMatching -Pp2t=ABS_PATH_TO_PMC/PMC_sample_1943

Full raw bibliographical reference matching

Runtime corresponds to a processing on a single machine running Glutton REST API server, Elasticsearch and GROBID evaluation with CRF for the citation model, with Crossref index dated Sept. 2021.

======= GLUTTON API ======= 
17015 bibliographical references processed in 1145.593 seconds, 0.06732841610343815 seconds per bibliographical reference.
Found 16699 DOI

precision:      97.33
recall:         95.52
F1-score:       96.42

With BiLSTM-CRF model instead of CRF for parsing the raw references prior to matching:

======= GLUTTON API ======= 

precision:      97.34
recall:         95.83
f-score:        96.58

In the case of Crossref API, we use as much as possible the concurrent queries (usually 50) allowed by the service with the GROBID Crossref multithreaded client.

======= CROSSREF API ======= 

17015 bibliographical references processed in 3057.464 seconds, 0.1797 seconds per bibliographical reference.
Found 16502 DOI

======= CROSSREF API ======= 

precision:      97.19
recall:         94.26
F1-score:       95.69

ISTEX mapping

If you don't know what ISTEX is, you can safely skip this section.

ISTEX identifier mapping

For creating a dump of all ISTEX identifiers associated with existing identifiers (DOI, ark, PII), use the node.js script as follow:

  • install:
cd scripts
npm install requestretry
  • Generate the json dump:
node dump-istexid-and-other-ids.js > istexIds.all

Be sure to have a good internet bandwidth for ensuring a high rate usage of the ISTEX REST API.

You can then move the json dump (e.g. istexIds.all) to the Istex data path indicated in the file config/glutton.yaml (by default data/istex/).

ISTEX to PubMed mapping

The mapping adds PudMed information (in particular MeSH classes) to ISTEX entries. See the instructions here

Main authors and contact

License

Distributed under Apache 2.0 license.

More Repositories

1

grobid

A machine learning software for extracting information from scholarly documents
Java
3,496
star
2

delft

a Deep Learning Framework for Text https://delft.readthedocs.io/
Python
388
star
3

grobid_client_python

Python client for GROBID Web services
Python
274
star
4

entity-fishing

A machine learning tool for fishing entities
Java
239
star
5

pdfalto

PDF to XML ALTO file converter
C
214
star
6

article_dataset_builder

Open Access PDF harvester, metadata aggregator and full-text ingester
Python
53
star
7

grobid-ner

A Named-Entity Recogniser based on Grobid.
Java
48
star
8

Pub2TEI

Service for converting and enhancing heterogeneous publisher XML formats into TEI
XSLT
43
star
9

biblio_glutton_harvester

Open Access PDF harvester
Python
34
star
10

pdf2xml

pdf2xml convertor based on Xpdf library - modified version
C
27
star
11

grobid-example

Some examples of usage of Grobid in a third party java project.
Java
18
star
12

grisp

Knowledge Base stuff
Java
16
star
13

grobid-client-node

Simple node.js client for GROBID REST services
JavaScript
14
star
14

xpdf-4.00

C++
13
star
15

datastet

Finding mentions and citations to named and implicit research datasets from within the academic literature
JavaScript
13
star
16

biblio-glutton-extension

A browser extension providing Open Access bibliographical services
JavaScript
11
star
17

grobid-astro

A machine learning software for extracting astronomical entities from scholarly documents
JavaScript
11
star
18

kish

Keeping It Simple is Hard
JavaScript
7
star
19

arxiv_harvester

Poor man's simple harvester for arXiv resources
Python
6
star
20

grobid-client-java

Simple Java client for GROBID REST services
Java
5
star
21

xpdf-4.03

patched xpdf lib for pdfalto
C++
2
star
22

anHALytics

Analytic platform for the HAL research archive
JavaScript
2
star
23

grobid-bio

Basic grobid-based bio-entity tagger using BioNLP/NLPBA 2004 dataset
Java
2
star
24

dataset_recognition_resources

Python
2
star
25

softcite-api

Web API for the Softcite Knowledge-Base
Python
2
star
26

softdata_mentions_client

Python client for software and dataset mention recognizer in scholarly publications, using the Softcite and Datastet services
Python
2
star
27

xpdf-3.04

C++
1
star