• Stars
    star
    534
  • Rank 83,095 (Top 2 %)
  • Language
    Python
  • Created over 9 years ago
  • Updated over 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Named Entity Recognition for Chinese social media (Weibo). From EMNLP 2015 paper.

Chinese Named Entity Recognition for Social Media

This repository contains:

  1. Data: Named Entity Recognition (NER) for Chinese Social Media (Weibo). This dataset contains messages selected from Weibo and annotated according to the DEFT ERE annotation guidelines. Annotations include both name and nominal mentions. The corpus contains 1,890 messages sampled from Weibo between November 2013 and December 2014.

  2. golden-horse: A neural based NER tool for Chinese Social Media.

Important update of the data

We fixed some inconsistancies in the data, especially the annotations for the nominal mentions. We thank Hangfeng He for his contribution to the major cleanup and revision of the annotations.

The original and revised annotated data are both made available in the data/ directory, with prefixes weiboNER.conll and weiboNER_2nd_conll, respectively.

We include updated results of our models on the revised version of the data in supplementary material: golden_horse_supplement.pdf. If you want to compare with our models on the revised data, please refer to this supplementary material. Thanks!

Please note that the updated version provided

If you use the revised dataset, please kindly cite the following bibtex in addition to the citation of our papers:

@article{HeS16,
author={Hangfeng He and Xu Sun},
title={F-Score Driven Max Margin Neural Network for Named Entity Recognition in Chinese Social Media.},
journal={CoRR},
volume={abs/1611.04234},
year={2016}
}

golden-horse

The implementation of the papers:

Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings
Nanyun Peng and Mark Dredze
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015

and

Improving Named Entity Recognition for Chinese Social Media
with Word Segmentation Representation Learning

Nanyun Peng and Mark Dredze
Annual Meeting of the Association for Computational Linguistics (ACL), 2016

If you use the code, please kindly cite the following bibtex:

@inproceedings{peng2015ner,
title={Named Entity Recognition for Chinese Social Media with Jointly Trained Embeddings},
author={Peng, Nanyun and Dredze, Mark},
booktitle={Processings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)},
pages={548–-554},
year={2015}, File={https://www.aclweb.org/anthology/D15-1064/}, }

@inproceedings{peng2016improving,
title={Improving named entity recognition for Chinese social media with word segmentation representation learning},
author={Peng, Nanyun and Dredze, Mark},
booktitle={Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL)},
volume={2},
pages={149--155},
year={2016}, File={https://www.aclweb.org/anthology/P16-2025/}, }

Dependencies:

This is an theano implementation; it requires installation of python module:
Theano
jieba (a Chinese word segmentor)
Both of them can be simply installed by pip moduleName.

The lstm layer was adapted from http://deeplearning.net/tutorial/lstm.html and the feature extraction part was adapted from crfsuite: http://www.chokkan.org/software/crfsuite/

running the EMNLP_15 experiments:

Sample commands for the training:

python theano_src/crf_ner.py --nepochs 30 --neval_epochs 1 --training_data data/weiboNER.conll.train --valid_data data/weiboNER.conll.dev --test_data data/weiboNER.conll.test --emb_file embeddings/weibo_charpos_vectors --emb_type charpos --save_model_param weibo_best_parameters --emb_init true --eval_test False

python theano_src/crf_ner.py --nepochs 30 --neval_epochs 1 --training_data data/weiboNER_2nd_conll.train --valid_data data/weiboNER_2nd_conll.dev --test_data data/weiboNER_2nd_conll.test --emb_file embeddings/weibo_charpos_vectors --emb_type char --save_model_param weibo_best_parameters --emb_init true --eval_test False

In the above example, the output will be written at output_dir/weiboNER.conll.test.prediction. If you also want to see the evaluation (you must have labeled test data), you can add flag --eval_test True.

Sample commands for running the test:

python theano_src/crf_ner.py --test_data data/weiboNER.conll.test --only_test true --output_dir data/ --save_model_param weibo_best_parameters

running the ACL_16 experiments:

python theano_src/jointSegNER.py --cws_train_path data/pku_training.utf8 --cws_valid_path data/pku_test_gold.utf8 --cws_test_path data/pku_test_gold.utf8 --ner_train_path data/weiboNER_2nd_conll.train --ner_valid_path data/weiboNER_2nd_conll.dev --ner_test_path data/weiboNER_2nd_conll.test --emb_init file --emb_file embeddings/weibo_charpos_vectors --lr 0.05 --nepochs 30 --train_mode joint --cws_joint_weight 0.7 --m1_wemb1_dropout_rate 0.1

The last three parameters and the learning rate can be tuned. In our experiments, we found that for named mention, the best combination is (joint, 0.7, 0.1); for nonimal mention, the best combination is (alternative, 1.0, 0.1)

Data

We noticed that several factors could affect the replicatability of experiments:

  1. the segmentor for preprocessing: we used jieba 0.37
  2. the random number generator. Alghough we fixed the random seed, we noticed it will render slight different numbers on different machine.
  3. the traditional lexical feature used.
  4. the pre-trained embeddings. To enhance the replicatability of our experiments, we provide the original data in conll format at data/weiboNER.conll.(train/dev/test). In addition, we also provide files including all the features and the char-positional transformation we used in our experiments in data/crfsuite.weiboNER.charpos.conll.(train/dev/test), as well as the pre-trained char and char-positional embeddings.

Note: the data we provide contains both named and nominal mentions, you can get the dataset with only named entities by simply filtering out the nominal mentions.

Data License

The annotations in this repository are released according to the Creative Commons Attribution-ShareAlike 3.0 Unported License (CC BY-SA 3.0). The messages themselves are selected from Weibo and follow Weibo's terms of service.

More Repositories

1

turkle

Django-based clone of Amazon's Mechanical Turk service running in your local environment.
Python
142
star
2

PredPatt

PredPatt: Predicate-Argument Extraction from Universal Dependencies
Python
112
star
3

ColBERT-X

CLIR version of ColBERT
Python
63
star
4

mingpipe

A Chinese name matcher written in Python. Describe in: Nanyun Peng, Mo Yu, Mark Dredze. An Empirical Study of Chinese Name Matching and Applications. Association for Computational Linguistics (ACL) (short paper), 2015.
Python
37
star
5

EventMiner

Event extraction pipeline.
Python
35
star
6

concrete-python

Python modules and scripts for working with Concrete, a data serialization format for NLP
Python
20
star
7

patapsco

Cross language information retrieval pipeline
Python
18
star
8

concrete

Thrift definitions, making HLT data specifications concrete
Thrift
16
star
9

clir-tutorial

SIGIR 2023 tutorial on cross language information retrieval.
Jupyter Notebook
13
star
10

gazetteer-collection

Jupyter Notebook
12
star
11

xvectors

Python
7
star
12

HC4

HLTCOE CLIR Common-Crawl Collection
Python
7
star
13

parma

A Predicate Argument Linker
Scala
7
star
14

parma2

A predicate argument alignment tool
Scala
7
star
15

sandle

Run a large language modeling SANDbox in your Local Environment
Python
7
star
16

quicklime

Visualization tool for Concrete, a data serialization format for NLP
JavaScript
7
star
17

concrete-java

Java library for Concrete, a data serialization format for NLP
Java
6
star
18

concrete-deprecated

OLD project for Concrete-thrift
Java
5
star
19

prototurk

Simple server for rapidly prototyping Mechanical Turk interfaces
Python
5
star
20

cadet

CADET is a system for rapid discovery, annotation, and extraction on text
JavaScript
4
star
21

concrete-js

JavaScript library for working with Concrete, a data serialization format for NLP
JavaScript
3
star
22

docker-nltk

A very simple example pipeline for named entity recognition using off-the-shelf NLTK.
Python
3
star
23

vivisect

A framework for exploring the internals of DNN models
Python
3
star
24

vaporengine

VaporEngine
JavaScript
3
star
25

concrete-stanford

Concrete-Stanford: Wraps Stanford NLP with utilities to fit it into a concrete compliant workflow
Java
3
star
26

concrete-gigaword

Tools for mapping English Gigaword v5 to Concrete
Java
2
star
27

tift

Tift is for tokenization
Java
2
star
28

peer_measure

Implementation of the measure Probability of Equal Expected Rank
Python
2
star
29

tasa

TASA - Translation And Structural Alignment
JavaScript
2
star
30

fetch-wikiqa-corpus

Concrete FetchCommunicationService bundled with "WikiQA corpus"
1
star
31

probe

Scala
1
star
32

stretcher

Concrete file server
Java
1
star
33

concrete-stanford-deprecated2

Concrete-Stanford: Wraps Stanford NLP with utilities to fit it into a concrete compliant workflow
Java
1
star
34

annotated-nyt

Java wrappers and utilities for reading the Annotated NYT corpus
Java
1
star
35

lid

Python
1
star
36

simple-search-demo

JavaScript
1
star
37

concrete-ontology

Concrete ontology
Java
1
star
38

concrete-agiga

Tools to map between concrete and agiga representations
Java
1
star
39

styleguides

HLTCOE recommended style guidelines for importing into IDEs
1
star
40

BLADE

Python
1
star
41

rebar

Java
1
star
42

cmn-renmin-ocr-ner-dataset

NER annotations of the Chinese Newspaper Renmin
Python
1
star
43

goncrete

golang bindings for concrete
Go
1
star
44

cadet-search-lucene

A search implementation for Concrete
Java
1
star