• This repository has been archived on 24/Jul/2020
  • Stars
    star
    419
  • Rank 103,397 (Top 3 %)
  • Language
    C++
  • License
    MIT License
  • Created about 8 years ago
  • Updated over 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Distantly Supervised Relation Extraction

USC Distantly-supervised Relation Extraction System

This repository puts together recent models and data sets for sentence-level relation extraction using knowledge bases (i.e., distant supervision). In particular, it contains the source code for WWW'17 paper CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases.

Please also check out our new repository on handling shifted label distribution in distant supervision

Task: Given a text corpus with entity mentions detected and heuristically labeled using distant supervision, the task aims to identify relation types/labels between a pair of entity mentions based on the sentence context where they co-occur.

Quick Start

Blog Posts

Data

For evaluating on sentence-level extraction, we processed (using our data pipeline) three public datasets to our JSON format. We ran Stanford NER on training set to detect entity mentions, mapped entity names to Freebase entities using DBpediaSpotlight, aligned Freebase facts to sentences, and assign entity types of Freebase entities to their mapped names in sentences:

  • PubMed-BioInfer: 100k PubMed paper abstracts as training data and 1,530 manually labeled biomedical paper abstracts from BioInfer (Pyysalo et al., 2007) as test data. It consists of 94 relation types (protein-protein interactions) and over 2,000 entity types (from MESH ontology). (Download)

  • NYT-manual: 1.18M sentences sampled from 294K New York Times news articles which were then aligned with Freebase facts by (Riedel et al., ECML'10) (link to Riedel's data). For test set, 395 sentences are manually annotated with 24 relation types and 47 entity types (Hoffmann et al., ACL'11) (link to Hoffmann's data). (Download)

  • Wiki-KBP: the training corpus contains 1.5M sentences sampled from 780k Wikipedia articles (Ling & Weld, 2012) plus ~7,000 sentences from 2013 KBP corpus. Test data consists of 14k system-labeled sentences from 2013 KBP slot filling assessment results. It has 7 relation types and 126 entity types after filtering of numeric value relations. (Download)

Please put the data files in corresponding subdirectories under data/source

Benchmark

Performance comparison with several relation extraction systems over KBP 2013 dataset (sentence-level extraction).

Method Precision Recall F1
Mintz (our implementation, Mintz et al., 2009) 0.296 0.387 0.335
LINE + Dist Sup (Tang et al., 2015) 0.360 0.257 0.299
MultiR (Hoffmann et al., 2011) 0.325 0.278 0.301
FCM + Dist Sup (Gormley et al., 2015) 0.151 0.498 0.300
HypeNet (our implementation, Shwartz et al., 2016) 0.210 0.315 0.252
CNN (our implementation, Zeng et at., 2014) 0.198 0.334 0.242
PCNN (our implementation, Zeng et at., 2015) 0.220 0.452 0.295
LSTM (our implementation) 0.274 0.500 0.350
Bi-GRU (our implementation) 0.301 0.465 0.362
SDP-LSTM (our implementation, Xu et at., 2015) 0.300 0.436 0.356
Position-Aware LSTM (Zhang et al., 2017) 0.265 0.598 0.367
CoType-RM (Ren et al., 2017) 0.303 0.407 0.347
CoType (Ren et al., 2017) 0.348 0.406 0.369

Note: for models that trained on sentences annotated with a single label (HypeNet, CNN/PCNN, LSTM, SDP/PA-LSTMs, Bi-GRU), we form one training instance for each sentence-label pair based on their DS-annotated data.

Usage

Dependencies

We will take Ubuntu for example.

  • python 2.7
  • Python library dependencies
$ pip install pexpect ujson tqdm
$ cd code/DataProcessor/
$ git clone [email protected]:stanfordnlp/stanza.git
$ cd stanza
$ pip install -e .
$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
$ unzip stanford-corenlp-full-2016-10-31.zip

We have included compilied binaries. If you need to re-compile retype.cpp under your own g++ environment

$ cd code/Model/retype; make

Default Run

As an example, we show how to run CoType on the Wiki-KBP dataset

Start the Stanford corenlp server for the python wrapper.

$ java -mx4g -cp "code/DataProcessor/stanford-corenlp-full-2016-10-31/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer

Feature extraction, embedding learning on training data, and evaluation on test data.

$ ./run.sh  

For relation classification, the "none"-labeled instances need to be first removed from train/test JSON files. The hyperparamters for embedding learning are included in the run.sh script.

Parameters

Dataset to run on.

Data="KBP"
  • Hyperparameters for relation extraction:
- KBP: -negative 3 -iters 400 -lr 0.02 -transWeight 1.0
- NYT: -negative 5 -iters 700 -lr 0.02 -transWeight 7.0
- BioInfer: -negative 5 -iters 700 -lr 0.02 -transWeight 7.0

Hyperparameters for relation classification are included in the run.sh script.

Evaluation

Evaluates relation extraction performance (precision, recall, F1): produce predictions along with their confidence score; filter the predicted instances by tuning the thresholds.

$ python code/Evaluation/emb_test.py extract KBP retype cosine 0.0
$ python code/Evaluation/tune_threshold.py extract KBP emb retype cosine

In-text Prediction

The last command in run.sh generates json file for predicted results, in the same format as test.json in data/source/$DATANAME, except that we only output the predicted relation mention labels. Replace the second parameter with whatever threshold you would like.

$ python code/Evaluation/convertPredictionToJson.py $Data 0.0

Customized Run

Code for producing the JSON files from a raw corpus for running CoType and baseline models is here.

Baselines

You can find our implementation of some recent relation extraction models under the Code/Model/ directory.

References

Contributors

  • Ellen Wu
  • Meng Qu
  • Frank Xu
  • Wenqi He
  • Maosen Zhang
  • Qinyuan Ye
  • Xiang Ren

More Repositories

1

RE-Net

Recurrent Event Network: Autoregressive Structure Inference over Temporal Knowledge Graphs (EMNLP 2020)
Python
436
star
2

KagNet

Knowledge-Aware Graph Networks for Commonsense Reasoning (EMNLP-IJCNLP 19)
Python
271
star
3

MHGRN

Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering (EMNLP 2020)
Python
246
star
4

TriggerNER

TriggerNER: Learning with Entity Triggers as Explanations for Named Entity Recognition (ACL 2020)
Python
173
star
5

CommonGen

A Constrained Text Generation Challenge Towards Generative Commonsense Reasoning
Python
139
star
6

AlpacaTag

AlpacaTag: An Active Learning-based Crowd Annotation Framework for Sequence Tagging (ACL 2019 Demo)
HTML
137
star
7

CrossFit

Code for paper "CrossFit 🏋️: A Few-shot Learning Challenge for Cross-task Generalization in NLP" (https://arxiv.org/abs/2104.08835)
Python
102
star
8

ClusType

Automatic Entity Recognition and Typing for Domain-Specific Corpora (KDD'15)
Python
99
star
9

temporal-gcn-lstm

Code for Characterizing and Forecasting User Engagement with In-App Action Graphs: A Case Study of Snapchat
Python
77
star
10

AFET

AFET: Automatic Fine-Grained Entity Typing (EMNLP'16)
Python
57
star
11

CPL

Collaborative Policy Learning for Open Knowledge Graph Reasoning (EMNLP 2019)
Python
56
star
12

PLE

Label Noise Reduction in Entity Typing (KDD'16)
C++
53
star
13

NERO

Source Code for paper "NERO: A Neural Rule Grounding Framework for Label-Efficient Relation Extraction", WWW 2020
Python
47
star
14

fewNER

Good Examples Make A Faster Learner: Simple Demonstration-based Learning for Low-resource NER (ACL 2022)
Python
43
star
15

StructMineDataPipeline

Performs entity detection, distant supervision, candidate generation, and produces JSON files for typing systems (PLE, AFET, CoType)
C++
43
star
16

shifted-label-distribution

Source code for paper "Looking Beyond Label Noise: Shifted Label Distribution Matters in Distantly Supervised Relation Extraction" (EMNLP 2019)
C++
39
star
17

DualRE

Source code for paper: "Learning Dual Retrieval Module for Semi-supervised Relation Extraction"
Python
36
star
18

hierarchical-explanation-neural-sequence-models

Source code for "Towards Hierarchical Importance Attribution: Explaining Compositional Semantics for Neural Sequence Models", ICLR 2020.
Python
30
star
19

CALM

Source code for ICLR 2021 paper : Pre-training Text-to-Text Transformers for Concept-Centric Common Sense
Python
27
star
20

ReQuest

Indirect Supervision for Relation Extraction Using Question-Answer Pairs (WSDM'18)
C++
24
star
21

DIG

Discretized Integrated Gradients for Explaining Language Models (EMNLP 2021)
Python
24
star
22

LEAN-LIFE

Label Efficient Learning From Explanations
Python
23
star
23

XCSR

Code Repo for the ACL21 paper "Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning"
Python
22
star
24

ReCross

ReCross: Unsupervised Cross-Task Generalization via Retrieval Augmentation
Python
22
star
25

VisCOLL

Code and data for the project "Visually grounded continual learning of compositional semantics"
Python
21
star
26

DArtNet

Temporal Attribute Prediction via Joint Modeling of Multi-Relational Structure Evolution
Python
19
star
27

NumerSense

The data and code for NumerSense (EMNLP2020)
Python
19
star
28

NExT

Source Code for paper "Learning from Explanations with Neural Execution Tree", ICLR 2020
Python
18
star
29

GMED

Source code for "Gradient Based Memory Editing for Task-Free Continual Learning", 4th Lifelong ML Workshop@ICML 2020
Python
17
star
30

HGN

Learning Contextualized Knowledge Structures for Commonsense Reasoning
Python
17
star
31

SalKG

This is the official PyTorch implementation of our NeurIPS 2021 paper: "SalKG: Learning From Knowledge Graph Explanations for Commonsense Reasoning"
Python
14
star
32

FaiRR

FaiRR: Faithful and Robust Deductive Reasoning over Natural Language (ACL 2022)
Python
14
star
33

hypter

Zero-shot Learning by Generating Task-specific Adapters
Python
14
star
34

FiD-ICL

"FiD-ICL: A Fusion-in-Decoder Approach for Efficient In-Context Learning" (ACL 2023)
Python
13
star
35

IsoBN

IsoBN: Fine-Tuning BERT with Isotropic Batch Normalization
Python
13
star
36

sparse-distillation

Code for "Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models"
Python
12
star
37

expl-refinement

Code for the paper "Refining Language Model with Compositional Explanation" (NeurIPS 2021)
Python
12
star
38

RiddleSense

RiddleSense: Reasoning about Riddle Questions Featuring Linguistic Creativity and Commonsense Knowledge
Python
12
star
39

ConNet

Python
12
star
40

entity-robustness

Code and data for paper "On the Robustness of Reading Comprehension Models to Entity Renaming" (NAACL'22)
Python
11
star
41

mrc-explanation

Source Code for "Teaching Machine Comprehension with Compositional Explanations" (Findings of EMNLP 2020)
Python
11
star
42

Reflect

Data and Code for Paper "Reflect Not Reflex: Inference-Based Common Ground Improves Dialogue Response Quality" (EMNLP 2022)
Python
11
star
43

rockner

Python
10
star
44

BITE

Code and data for paper "BITE: Textual Backdoor Attacks with Iterative Trigger Injection"
Python
9
star
45

CLIF

Code for Findings at EMNLP 2021 paper: "Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning"
Python
8
star
46

G-PlanET

Python
8
star
47

procedural-extraction

Code for paper Eliciting Knowledge from Experts: Automatic Transcript Parsing for Cognitive Task Analysis, in proceedings of ACL 2019
Python
8
star
48

XMD

XMD: An End-to-End Framework for Interactive Explanation-Based Debugging of NLP Models
Vue
7
star
49

RobustLR

A Diagnostic Benchmark for Evaluating Logical Robustness of Deductive Reasoners
Python
7
star
50

RationaleMultiRewardDistillation

Code and Dataset for preprint titled "Tailoring Self-Rationalizers with Multi-Reward Distillation"
Python
6
star
51

LINK

Code for paper "In Search of the Long-Tail: Systematic Generation of Long-Tail Knowledge via Logical Rule Guided Search"
Python
6
star
52

Upstream-Bias-Mitigation

Code and data for NAACL 2021 paper "On Transferability of Bias Mitigation Effects in Language Model Fine-Tuning"
Python
5
star
53

RationaleHumanUtility

Codebase for Human Utility of FTRs at ACL 2023
Python
5
star
54

Lifelong-ICL

Code for paper "Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack"
Jupyter Notebook
4
star
55

PE2

Code for paper "Prompt Engineering a Prompt Engineer" (https://arxiv.org/abs/2311.05661)
Python
4
star
56

deceive-KG-models

An implementation of the experiments on KG robustness
Python
4
star
57

ER-Test

Code for ER-Test, accepted to the Findings of EMNLP 2022
Python
3
star
58

get-started-on-dl-experiments

2
star
59

ink-usc.github.io

INK Research Lab Website
JavaScript
2
star
60

CrossTaskMoE

Code for paper "Eliciting and Understanding Cross-task Skills with Task-level Mixture-of-Experts" (Findings of EMNLP 2022)
Python
2
star
61

predicting-big-bench

Code for paper "How Predictable Are Large Language Model Capabilities? A Case Study on BIG-bench"
Python
2
star
62

bias-mitigation-via-transfer-learning

Source code for Arxiv paper: Efficiently Mitigating Classification Bias via Transfer Learning
2
star
63

Controllable-AV-Explanations

Python
1
star
64

lm-forgetting-prediction-code

Python
1
star
65

MACROSCORE

MACROSCORE - Scoring Scientific Research
Jupyter Notebook
1
star