• Stars
    star
    683
  • Rank 66,158 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 9 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Deep neural network framework for multi-label text classification

image

Magpie is a deep learning tool for multi-label text classification. It learns on the training corpus to assign labels to arbitrary text and can be used to predict those labels on unknown data. It has been developed at CERN to assign subject categories to High Energy Physics abstracts and extract keywords from them.

Very short introduction

>>> magpie = Magpie()
>>> magpie.init_word_vectors('/path/to/corpus', vec_dim=100)
>>> magpie.train('/path/to/corpus', ['label1', 'label2', 'label3'], epochs=3)
Training...
>>> magpie.predict_from_text('Well, that was quick!')
[('label1', 0.96), ('label3', 0.65), ('label2', 0.21)]

Short introduction

To train the model you need to have a large corpus of labeled data in a text format encoded as UTF-8. An example corpus can be found under data/hep-categories directory. Magpie looks for .txt files containing the text to predict on and corresponding .lab files with assigned labels in separate lines. A pair of files containing the labels and the text should have the same name and differ only in extensions e.g.

$ ls data/hep-categories
1000222.lab 1000222.txt 1000362.lab 1000362.txt 1001810.lab 1001810.txt ...

Before you train the model, you need to build appropriate word vector representations for your corpus. In theory, you can train them on a different corpus or reuse already trained ones (tutorial), however Magpie enables you to do that as well.

from magpie import Magpie

magpie = Magpie()
magpie.train_word2vec('data/hep-categories', vec_dim=100)

Then you need to fit a scaling matrix to normalize input data, it is specific to the trained word2vec representation. Here's the one liner:

magpie.fit_scaler('data/hep-categories')

You would usually want to combine those two steps, by simply running:

magpie.init_word_vectors('data/hep-categories', vec_dim=100)

If you plan to reuse the trained word representations, you might want to save them and pass in the constructor to Magpie next time. For the training, just type:

labels = ['Gravitation and Cosmology', 'Experiment-HEP', 'Theory-HEP']
magpie.train('data/hep-categories', labels, test_ratio=0.2, epochs=30)

By providing the test_ratio argument, the model splits data into train & test datasets (in this example into 80/20 ratio) and evaluates itself after every epoch displaying it's current loss and accuracy. The default value of test_ratio is 0 meaning that all the data will be used for training.

If your data doesn't fit into memory, you can also run magpie.batch_train() which has a similar API, but is more memory efficient.

Trained models can be used for prediction with methods:

>>> magpie.predict_from_file('data/hep-categories/1002413.txt')
[('Experiment-HEP', 0.47593361),
 ('Gravitation and Cosmology', 0.055745006),
 ('Theory-HEP', 0.02692855)]

>>> magpie.predict_from_text('Stephen Hawking studies black holes')
[('Gravitation and Cosmology', 0.96627593),
 ('Experiment-HEP', 0.64958507),
 ('Theory-HEP', 0.20917746)]

Saving & loading the model

A Magpie object consists of three components - the word2vec mappings, a scaler and a keras model. In order to train Magpie you can either provide the word2vec mappings and a scaler in advance or let the program compute them for you on the training data. Usually you would want to train them yourself on a full dataset and reuse them afterwards. You can use the provided functions for that purpose:

magpie.save_word2vec_model('/save/my/embeddings/here')
magpie.save_scaler('/save/my/scaler/here', overwrite=True)
magpie.save_model('/save/my/model/here.h5')

When you want to reinitialize your trained model, you can run:

magpie = Magpie(
    keras_model='/save/my/model/here.h5',
    word2vec_model='/save/my/embeddings/here',
    scaler='/save/my/scaler/here',
    labels=['cat', 'dog', 'cow']
)

or just pass the objects directly!

Installation

The package is not on PyPi, but you can get it directly from GitHub:

$ pip install git+https://github.com/inspirehep/[email protected]

If you encounter any problems with the installation, make sure to install the correct versions of dependencies listed in setup.py file.

Disclaimer & citation

The neural network models used within Magpie are based on work done by Yoon Kim and subsequently Mark Berger.

Contact

If you have any problems, feel free to open an issue. We'll do our best to help πŸ‘

More Repositories

1

refextract

Extract bibliographic references from (High-Energy Physics) articles.
Python
129
star
2

beard

Bibliographic Entity Automatic Recognition and Disambiguation
Python
66
star
3

inspire-next

The INSPIRE repo.
Python
59
star
4

rest-api-doc

Documentation of the INSPIRE RESTΒ API
40
star
5

hepcrawl

Scrapy project for feeds into INSPIRE-HEP
Python
17
star
6

inspire

Official repo of the legacy INSPIRE-HEP overlay
Python
17
star
7

impact-graphs

Creates graphs to show a publication's impact, and the impact of cited publications, and papers who've cited a publication of interest.
JavaScript
16
star
8

inspirehep

Documentation: http://inspire.docs.cern.ch
Python
13
star
9

inspire-schemas

Inspire JSON schemas and utilities to use them.
Python
8
star
10

jsonschema2rst

Python
7
star
11

record-editor

Record editing tool used in http://inspirehep.net
TypeScript
6
star
12

inspire-query-parser

A PEG-based query parser for INSPIRE.
Python
5
star
13

author.xml

Documentation of the author.xml format to describe author lists
XSLT
5
star
14

invenio-grobid

Invenio package for integration of the Grobid metadata extraction service
Python
4
star
15

inspire-classifier

INSPIRE text classification microservice
Python
4
star
16

inspire-crawler

Crawler integration with INSPIRE-HEP.
Python
4
star
17

inspire-docker

Dockerfiles for inspirehep/inspire-next application
Shell
4
star
18

invenio-matcher-benchmark

Test data for invenio-matcher
Python
4
star
19

inspire-json-merger

INSPIRE-specific configuration of the JSON Merger.
Python
3
star
20

inspire-dojson

INSPIRE-specific rules to transform from MARCXML to JSON and back.
Python
3
star
21

inspire-matcher

Find the records in INSPIRE most similar to a given record or reference.
Python
3
star
22

plotextractor

Extract images and captions from TeX files in a tar archive.
Python
3
star
23

inspirehep-ui

UI for INSPIREHEP
JavaScript
2
star
24

curation-scripts

Scripts for automated large-scale curation
Python
2
star
25

inspire-utils

INSPIRE-specific utils.
Python
2
star
26

inspirehep-search-js

Angular JS application used in search results page
JavaScript
2
star
27

inspire-citesummary-js

INSPIRE HEP Citation Summary JS code
JavaScript
2
star
28

beard-server

Application providing REST API over Beard
Python
2
star
29

inspire-relations

Invenio module to integrate Neo4J graph database into INSPIRE and handle relations across records.
Python
1
star
30

python-rt

Temporary clone of https://gitlab.labs.nic.cz/labs/python-rt
Python
1
star
31

es-cli

Small cli tool to play with Elasticsearch indices (dump, load, reindex...)
Python
1
star
32

isbnid

Python ISBN identifier library
Python
1
star
33

images

Docker images for the inspire project
Python
1
star
34

inspire-mitmproxy

Python
1
star
35

inspire-magpie

Wrapper around magpie for InspireHEP
Python
1
star
36

relations

Neo4J-based module to handle INSPIRE specific relations
Python
1
star
37

invenio-trends

Trends Dashboard API for Invenio Installations
Jupyter Notebook
1
star