• Stars
    star
    158
  • Rank 237,131 (Top 5 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created over 5 years ago
  • Updated about 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Enriching BERT with Knowledge Graph Embedding for Document Classification (PyTorch)

PyTorch BERT Document Classification

Implementation and pre-trained models of the paper Enriching BERT with Knowledge Graph Embedding for Document Classification (PDF). A submission to the GermEval 2019 shared task on hierarchical text classification. If you encounter any problems, feel free to contact us or submit a GitHub issue.

Content

Model architecture

BERT + Knowledge Graph Embeddings

Installation

Requirements:

  • Python 3.6
  • CUDA GPU
  • Jupyter Notebook

Install dependencies:

pip install -r requirements.txt

Prepare data

GermEval data

Author Embeddings

python wikidata_for_authors.py run ~/datasets/wikidata/index_enwiki-20190420.db \
    ~/datasets/wikidata/index_dewiki-20190420.db \
    ~/datasets/wikidata/torchbiggraph/wikidata_translation_v1.tsv.gz \
    ~/notebooks/bert-text-classification/authors.pickle \
    ~/notebooks/bert-text-classification/author2embedding.pickle

# OPTIONAL: Projector format
python wikidata_for_authors.py convert_for_projector \
    ~/notebooks/bert-text-classification/author2embedding.pickle
    extras/author2embedding.projector.tsv \
    extras/author2embedding.projector_meta.tsv

Reproduce paper results

Download pre-trained models: GitHub releases

Available experiment settings

Detailed settings for each experiment can found in cli.py.

task-a__bert-german_full
task-a__bert-german_manual_no-embedding
task-a__bert-german_no-manual_embedding
task-a__bert-german_text-only
task-a__author-only
task-a__bert-multilingual_text-only

task-b__bert-german_full
task-b__bert-german_manual_no-embedding
task-b__bert-german_no-manual_embedding
task-b__bert-german_text-only
task-b__author-only
task-b__bert-multilingual_text-only

Enviroment variables

  • TRAIN_DF_PATH: Path to Pandas Dataframe (pickle)
  • GPU_ID: Run experiments on this GPU (used for CUDA_VISIBLE_DEVICES)
  • OUTPUT_DIR: Directory to store experiment output
  • EXTRAS_DIR: Directory where author embeddings and gender data is located
  • BERT_MODELS_DIR: Directory where pre-trained BERT models are located

Validation set

python cli.py run_on_val <name> $GPU_ID $EXTRAS_DIR $TRAIN_DF_PATH $VAL_DF_PATH $OUTPUT_DIR --epochs 5

Test set

python cli.py run_on_test <name> $GPU_ID $EXTRAS_DIR $FULL_DF_PATH $TEST_DF_PATH $OUTPUT_DIR --epochs 5

Evaluation

The scores from the result table can be reproduced with the evaluation.ipynb notebook.

How to cite

If you are using our code, please cite our paper:

@inproceedings{Ostendorff2019,
    address = {Erlangen, Germany},
    author = {Ostendorff, Malte and Bourgonje, Peter and Berger, Maria and Moreno-Schneider, Julian and Rehm, Georg},
    booktitle = {Proceedings of the GermEval 2019 Workshop},
    title = {{Enriching BERT with Knowledge Graph Embedding for Document Classification}},
    year = {2019}
}

References

License

MIT

More Repositories

1

awesome-document-similarity

A curated list of resources on document similarity measures (papers, tutorials, code, ...)
232
star
2

scincl

Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings (EMNLP 2022 paper)
Python
63
star
3

aspect-document-similarity

Implementation, trained models and result data for the paper "Aspect-based Document Similarity for Research Papers" #COLING2020
Jupyter Notebook
62
star
4

llm-datasets

A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.
Python
51
star
5

semantic-document-relations

Implementation, trained models and result data for the paper "Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles"
Python
32
star
6

legal-document-similarity

Legal document similarity - Code, data, and models for the ICAIL 2021 paper "Evaluating Document Representations for Content-based Legal Literature Recommendations"
Jupyter Notebook
31
star
7

clp-transfer

Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning
Python
29
star
8

aspect-document-embeddings

Code, dataset & models for the paper Specialized Document Embeddings for Aspect-based Similarity of Research Papers (#JCDL2022)
Jupyter Notebook
11
star
9

german-language-models

A collection of German GPT language models
10
star
10

awesome-contrastive-learning-for-nlp

A collection of papers about contrastive learning for natural language processing.
7
star
11

wikipedia-article-recommendations

Survey data and Python code for the ICADL 2021 paper "A Qualitative Evaluation of User Preference for Link-based vs. Text-based Recommendations of Wikipedia Articles"
Jupyter Notebook
5
star
12

getting-started

Dockerfile
3
star
13

covid-vaccination-appointment

Python
3
star
14

Leaflet.Sim

Leaflet.Sim is a framework for location-based simulations with Leaflet maps that can visualise moving markers, which can change their style, and events over time on a map.
JavaScript
2
star
15

emnlp2022-papers

Python
2
star
16

finetune-evaluation-harness

Python
2
star
17

CmdLineSlideShow

Command line script for generating rich slide shows from a set of images with transition effects and audio. Using ImageMagick and FFMPEG.
Shell
2
star
18

Wikipedia2Lucene

Import a Wikipedia XML Dump from HDFS to Lucene index or Elasticsearch and retrieve similar Wikipedia articles based on Lucene's MoreLikeThis query.
Java
1
star
19

kibana-reallybettermap

Multiple locations for Kibana's bettermap panel
JavaScript
1
star
20

news-visualization

News visualization with Elastic Search and Kibana including NER, Sentiment Analysis and Geo Locations.
Java
1
star
21

data-sourcing

Python
1
star
22

turkish-lm-bias

Investigating Gender Bias in Turkish Language Models
Jupyter Notebook
1
star