• Stars
    star
    120
  • Rank 294,229 (Top 6 %)
  • Language
    Python
  • Created almost 5 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

albert + lstm + crf实体识别,pytorch实现。识别的主要实体是人名、地名、机构名和时间。albert + lstm + crf (named entity recognition)

Albert+BI-LSTM+CRF的实体识别 Pytorch

outline

lstm_crf的模型结构

lstm_crf

albert_lstm的模型结构

albert_embedding_lstm

1.这里将每个句子split成一个个字token,将每个token映射成一个数字,再加入masks,然后输入给albert产生句子矩阵表示,比如一个batch=10,句子最大长度为126,加上首尾标志[CLS]和[SEP],max_length=128,albert_base_zh模型输出的数据shape为(batch,max_length,hidden_states)=(10,128,768)。

2.利用albert产生的表示作为lstm的embedding层。

3.没有对albert进行fine-tune。

train

setp 1: 利用albert/tfmodel_2_pymodel.py

1.将tensorflow预训练模型转化为pytorch可用模型。

2.本程序使用albert_base_zh(小模型体验版), 参数量12M, 层数12,大小为40M。

3.转为pytorch模型后放在albert/pretrain/pytorch目录下。

4.模型的参数见albert/configs/目录。

setp 2: 部分参数设置 models/config.yml

embedding_size: 768
hidden_size: 128
model_path: models/
batch_size: 64
max_length: 128
dropout: 0.5
tags:
	- ORG
	- PER
	- LOC
	- T

step 3: train

python main.py train
训练数据来自人民日报的标注数据

evaluate

> epoch [0] |██                       | 395/4473
  loss 0.07
  epoch [0] |██                       | 396/4473
  loss 0.06
  epoch [0] |██                       | 397/4473
  loss 0.06
  epoch [0] |██                       | 398/4473
  loss 0.06
  epoch [0] |██                       | 399/4473
  loss 0.06
  epoch [0] |██                       | 400/4473
  loss 0.05
  eval
        ORG	recall 1.00	precision 1.00	f1 1.00
        PER	recall 0.97	precision 0.96	f1 0.96
        LOC	recall 1.00	precision 1.00	f1 1.00
        T	recall 0.84	precision 0.80	f1 0.82

predict

python main.py predict
input text:“刘老根大舞台”被文化部、国家旅游局联合评为首批“国家文化旅游重点项目”

note

在src/lstm_crf的model.py中

a.albert的预训练模型作为embedding层

> bert_config =BertConfig.from_pretrained(str(config['albert_config_path']), share_type='all')
  self.word_embeddings = BertModel.from_pretrained(config['bert_dir'], config=bert_config)
  self.word_embeddings.to(DEVICE)
  self.word_embeddings.eval()

b.embedding的输出是(batch_size, seq_len, embedding_dim)

> with torch.no_grad():
        embeddings = self.word_embeddings(input_ids=sentence, attention_mask=mask)
        #因为在albert中的config中设置了"output_hidden_states":"True","output_attentions":"True",所以返回所有层
        #也可以只返回最后一层
        all_hidden_states, all_attentions = embeddings[-2:]  # 这里获取所有层的hidden_satates以及attentions
        embeddings = all_hidden_states[-2]  # 倒数第二层hidden_states

REFERENCES

contact

如有搜索、推荐、nlp以及大数据挖掘等问题或合作,可联系我:

1、我的github项目介绍:https://github.com/jiangnanboy

2、我的博客园技术博客:https://www.cnblogs.com/little-horse/

3、我的QQ号:2229029156

More Repositories

1

learning_to_rank

利用lightgbm做(learning to rank)排序学习,包括数据处理、模型训练、模型决策可视化、模型可解释性以及预测等。Use LightGBM to learn ranking, including data processing, model training, model decision visualization, model interpretability and prediction, etc.
Python
221
star
2

movie_knowledge_graph_app

电影知识图谱,主要包括实体识别、实体查询、关系查询以及智能问答等。movie knowledge graph(Entity identification, graph display, and intelligent question and answer)
JavaScript
87
star
3

education_knowledge_graph_app

Education knowledge graph(graph display, knowledge point tracking, intelligent question and answer,questions knowledge point prediction)。k12教育学科知识图谱,图谱展示,知识点追踪,智能问答以及题目知识点预测。
JavaScript
48
star
4

intent_detection_and_slot_filling

intent detection and slot filling 意图识别与槽填充联合模型
Jupyter Notebook
30
star
5

spark_data_mining

spark tutorial for big data mining。包括app流量运营分析、als推荐、smote样本采样、RFM客户价值分群、AHP层次分析客户价值得分、手机定位数据商圈挖掘、马尔可夫智能邮件预测、时序预测、关联规则、推荐电影好友等。
Java
29
star
6

movie_kg

基于知识图谱的电影智能问答。neo4j构建电影图谱,spark ml完成问答意图分类,将问答语句转为cypher查询语句完成匹配查询。
Java
28
star
7

recommendation_methods

个性化推荐模型,主要包括als、als_wr、biaslfm、lfm、nmf、svdpp、基于内容、基于内容回归、user-cf、item-cf、slopeone、关联规则以及基于内容和cf的混合等模型。
Python
24
star
8

java-springboot-paddleocr

本项目利用java加载paddle-ocr的C++编译的exe文件,并利用springboot进行web部署访问。This project loads the C++ compiled version of paddle-ocr in java and makes use of springboot for web deployment.
Java
24
star
9

intelligent_medical

intelligent medical,智慧医疗,包括疾病搜索、相关推荐、疾病医疗问答以及智能疾病诊断等功能。
Java
23
star
10

gnn4lp

gnn for link prediction,图神经网络用于链接预测。
Python
21
star
11

python_search

利用sklearn和gensim中的tfidf,lsa,doc2vec进行查询与文档匹配搜索
Python
21
star
12

jcorrector

jcorrector 中文文本纠错工具, Text Error Correction Tool,Spelling Check
Java
20
star
13

albert_re

albert-fc for RE(Relation Extraction),中文关系抽取
Python
15
star
14

java-springboot-paddleocr-v2

本项目利用JNI加载paddle-ocr的C++编译的dll库,并利用springboot进行web部署访问。This project uses JNI to load the C++ compiled dll libraries of paddle-ocr, and uses springboot for web deployment
Java
15
star
15

punctuation_prediction

chinese sentence punctuation prediction,中文句子标点符号预测。
Python
14
star
16

knowledge-automatic-tagging

题目知识点预测标注。Question knowledge point prediction.
Jupyter Notebook
13
star
17

text_grapher

利用java对文章进行分析并图谱化展示(主要提取关键词、实体、依存分析等)。
Java
11
star
18

gcn_for_prediction_of_protein_interactions

gcn for prediction of protein interactions,图卷积用于蛋白质相互作用。
Python
11
star
19

text_generation

Title and keywords are used to generate text.
Python
11
star
20

model2onnx

model2onnx,将roberta和macbert模型转为onnx格式,并进行推理。
Python
8
star
21

intent_classification

深度网络实现意图分类。
Jupyter Notebook
8
star
22

chatbot_chinese

Chinese chatbot for neural machine translation in PyTorch.Including basic seq2seq、seq2seq with attention、pointer generator、seq2seq with cnn and so on.
PLSQL
8
star
23

t5-onnx-corrector

t5-model-onnx,中文拼写纠错,Chinese spelling correction。
Python
7
star
24

onnx-java

onnx-java,这里利用java加载onnx模型,并进行推理。
Java
7
star
25

macbert-java-onnx

MacBERT for Chinese Spelling Correction, macbert中文拼写纠错
Java
7
star
26

NewsSummary

一个改进的新闻摘要程序(an improved method of news summary)
Java
7
star
27

CNN4IE

Chinese Information Extraction Toolkit。中文信息抽取工具。利用CNN各种变体进行实体抽取。
Python
6
star
28

chinese_sentence_paraphrase

sentence paraphrase
Python
6
star
29

albert_link_prediction

albert-fc for LP(Link Prediction),中文实体链接预测
Python
6
star
30

AutoText

智能文本自动处理工具(Intelligent text automatic processing tool)。AutoText的功能主要有文本纠错,图片ocr、版面检测以及表格结构识别等。The main functions of this project include text error correction, ocr, layout-detection and table structure recognition.
Java
6
star
31

sentence_rewriting

chinese sentence rewriting
Python
5
star
32

knowledge_point_graph

spark neo4j java 知识图谱数据处理
Java
5
star
33

layout_analysis

中文版面检测(Chinese layout detection),yolov8 is used to detect the layout of Chinese document images。
Python
4
star
34

albert_ner

albert-crf for NER(Named Entity Recognition),中文实体识别。
Python
4
star
35

text-de-duplication

text de-duplication 文本去重
4
star
36

pdf_to_docx

ocr,pdf转docx,pdf to docx
Python
4
star
37

albert_srl

albert-crf for SRL(Semantic Role Labeling),中文语义角色标注。
Python
4
star
38

layout_analysis4j

利用java-yolov8实现版面检测(Chinese layout detection),java-yolov8 is used to detect the layout of Chinese document images
Java
4
star
39

gec_check_template

grammatical correction,中文语法纠错模板
Java
4
star
40

chatbot

pytorch前馈网络分类预测chatbot
Jupyter Notebook
3
star
41

j4nlp

java for nlp,java自然语言处理
Java
3
star
42

triple_event_extract

EventExtraction & TriplesExtraction,复合事件抽取,依存关系三元组抽取
Java
2
star
43

bert_ndcg_lp

bert-ndcg for LP(Link Prediction),链接预测
Python
2
star
44

easyKG

deep learning of knowledge graph ,知识图谱深度学习相关技术
Python
2
star
45

llm_corpus_quality

大模型预训练中文语料清洗及质量评估 Large model pre-training corpus cleaning
Java
2
star
46

RecomSys

A simple recommendation system
Java
2
star
47

table_ocr_java

TABLE DETECTION IN IMAGES AND OCR TO CSV WITH JAVA
Java
2
star
48

micrograd4j

A micro scalar-valued Autograd engine developed with java, and a neural net library on top of it.
Java
2
star
49

similarity_words

计算词间的相关性,并进行图谱化展示。calculate the relevance between words
Python
2
star
50

entropy_sim

利用熵计算查询与文档的相关性。Entropy is used to calculate the relevance of a query to a document. This program is mainly based on 《Content-based relevance estimation on the web using inter-document similarities》(2012-CIKM).
Java
2
star
51

vehicle_license_plate_recognition

车牌识别(vehicle license plate recognition)
Python
1
star
52

pediatrics_llm_qa

Small model of pediatric consultation
Python
1
star
53

semantic_matching

semantic matching,语义匹配
Jupyter Notebook
1
star
54

doc_ai

这里将paddle中的ocr等模型转为onnx格式,并利用java版深度框架djl加载这些onnx模型进行推理预测尝试。
Java
1
star
55

spark-smote

The program uses spark to implement smote sampling.利用spark实现训练样本smote采样。
1
star
56

llm_security

利用分类法和敏感词检测法对生成式大模型的输入和输出内容进行安全检测,尽早识别风险内容。The input and output contents of generative large model are checked by classification method and sensitive word detection method to identify content risk as early as possible.
Java
1
star