• Stars
    star
    221
  • Rank 178,738 (Top 4 %)
  • Language
    Python
  • Created almost 5 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

利用lightgbm做(learning to rank)排序学习,包括数据处理、模型训练、模型决策可视化、模型可解释性以及预测等。Use LightGBM to learn ranking, including data processing, model training, model decision visualization, model interpretability and prediction, etc.

利用lightgbm做learning to rank 排序,主要包括:

  • 数据预处理
  • 模型训练
  • 模型决策可视化
  • 预测
  • ndcg评估
  • 特征重要度
  • SHAP特征贡献度解释
  • 样本的叶结点输出

(要求安装lightgbm、graphviz、shap等)

一.data format (raw data -> (feats.txt, group.txt))

python lgb_ltr.py -process
1.raw_train.txt

0 qid:10002 1:0.007477 2:0.000000 ... 45:0.000000 46:0.007042 #docid = GX008-86-4444840 inc = 1 prob = 0.086622

0 qid:10002 1:0.603738 2:0.000000 ... 45:0.333333 46:1.000000 #docid = GX037-06-11625428 inc = 0.0031586555555558 prob = 0.0897452 ...

2.feats.txt:

0 1:0.007477 2:0.000000 ... 45:0.000000 46:0.007042

0 1:0.603738 2:0.000000 ... 45:0.333333 46:1.000000 ...

3.group.txt:

8

8

8

8

8

16

8

118

16

8

...

二.model train (feats.txt, group.txt) -> train -> model.mod

python lgb_ltr.py -train
train params = {
        'task': 'train',  # 执行的任务类型
        'boosting_type': 'gbrt',  # 基学习器
        'objective': 'lambdarank',  # 排序任务(目标函数)
        'metric': 'ndcg',  # 度量的指标(评估函数)
        'max_position': 10,  # @NDCG 位置优化
        'metric_freq': 1,  # 每隔多少次输出一次度量结果
        'train_metric': True,  # 训练时就输出度量结果
        'ndcg_at': [10],
        'max_bin': 255,  # 一个整数,表示最大的桶的数量。默认值为 255。lightgbm 会根据它来自动压缩内存。如max_bin=255 时,则lightgbm 将使用uint8 来表示特征的每一个值。
        'num_iterations': 200,  # 迭代次数,即生成的树的棵数
        'learning_rate': 0.01,  # 学习率
        'num_leaves': 31,  # 叶子数
        'max_depth':6,
        'tree_learner': 'serial',  # 用于并行学习,‘serial’: 单台机器的tree learner
        'min_data_in_leaf': 30,  # 一个叶子节点上包含的最少样本数量
        'verbose': 2  # 显示训练时的信息
    }
  • docs:7796
  • groups:380
  • consume time : 4 seconds
  • training's ndcg@10: 0.940891
1.model.mod(model的格式在data/model/mode.mod)

训练时的输出:

  • [LightGBM] [Info] Total Bins 9171
  • [LightGBM] [Info] Number of data: 7796, number of used features: 40
  • [LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 9
  • [1] training's ndcg@10: 0.791427
  • [LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 12
  • [2] training's ndcg@10: 0.828608
  • [LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 10
  • ...
  • ...
  • ...
  • [198] training's ndcg@10: 0.941018
  • [LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 11
  • [199] training's ndcg@10: 0.941038
  • [LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 11
  • [200] training's ndcg@10: 0.940891
  • consume time : 4 seconds

三.模型决策过程的可视化生成

可指定树的索引进行可视化生成,便于分析决策过程。

python lgb_ltr.py -plottree

image

四.predict 数据格式如feats.txt,当然可以在每行后面加一个标识(如文档编号,商品编码等)作为排序的输出,这里我直接从test.txt中得到feats与comment作为predict

python lgb_ltr.py -predict
1.predict results
  • ['docid = GX252-32-5579630 inc = 1 prob = 0.190849'
  • 'docid = GX108-43-5342284 inc = 0.188670948386237 prob = 0.103576'
  • 'docid = GX039-85-6430259 inc = 1 prob = 0.300191' ...,
  • 'docid = GX009-50-15026058 inc = 1 prob = 0.082903'
  • 'docid = GX065-08-0661325 inc = 0.012907717401617 prob = 0.0312699'
  • 'docid = GX012-13-5603768 inc = 1 prob = 0.0961297']

五.validate ndcg 数据来自test.txt(data from test.txt)

python lgb_ltr.py -ndcg

all qids average ndcg: 0.761044123343

六.features 打印特征重要度(features importance)

python lgb_ltr.py -feature

模型中的特征是"Column_number",这里打印重要度时可以映射到真实的特征名,比如本测试用例是46个feature

1.features importance
  • feat0name : 228 : 0.038
  • feat1name : 22 : 0.0036666666666666666
  • feat2name : 27 : 0.0045
  • feat3name : 11 : 0.0018333333333333333
  • feat4name : 198 : 0.033
  • feat10name : 160 : 0.02666666666666667
  • ...
  • ...
  • ...
  • feat37name : 188 : 0.03133333333333333
  • feat38name : 434 : 0.07233333333333333
  • feat39name : 286 : 0.04766666666666667
  • feat40name : 169 : 0.028166666666666666
  • feat41name : 348 : 0.058
  • feat43name : 304 : 0.050666666666666665
  • feat44name : 283 : 0.04716666666666667
  • feat45name : 220 : 0.03666666666666667

七.利用SHAP值解析模型中特征重要度

python lgb_ltr.py -shap

这里不同于六中特征重要度的计算,而是利用博弈论的方法--SHAP(SHapley Additive exPlanations)来解析模型。 利用SHAP可以进行特征总体分析、多维特征交叉分析以及单特征分析等。

1.总体分析

image

image

2.多维特征交叉分析

image

3.单特征分析

image

八.利用模型得到样本叶结点的one-hot表示,可以用于像gbdt+lr这种模型的训练

python lgb_ltr.py -leaf

这里测试用例是test/leaf.txt 5个样本

[

  • [ 0. 1. 0. ..., 0. 0. 1.]
  • [ 1. 0. 0. ..., 0. 0. 0.]
  • [ 0. 0. 1. ..., 0. 0. 1.]
  • [ 0. 1. 0. ..., 0. 1. 0.]
  • [ 0. 0. 0. ..., 1. 0. 0.] ]

九.REFERENCES

https://github.com/microsoft/LightGBM

https://github.com/jma127/pyltr

https://github.com/slundberg/shap

contact

如有搜索、推荐、nlp以及大数据挖掘等问题或合作,可联系我:

1、我的github项目介绍:https://github.com/jiangnanboy

2、我的博客园技术博客:https://www.cnblogs.com/little-horse/

3、我的QQ号:2229029156

More Repositories

1

albert_lstm_crf_ner

albert + lstm + crf实体识别,pytorch实现。识别的主要实体是人名、地名、机构名和时间。albert + lstm + crf (named entity recognition)
Python
120
star
2

movie_knowledge_graph_app

电影知识图谱,主要包括实体识别、实体查询、关系查询以及智能问答等。movie knowledge graph(Entity identification, graph display, and intelligent question and answer)
JavaScript
87
star
3

education_knowledge_graph_app

Education knowledge graph(graph display, knowledge point tracking, intelligent question and answer,questions knowledge point prediction)。k12教育学科知识图谱,图谱展示,知识点追踪,智能问答以及题目知识点预测。
JavaScript
48
star
4

intent_detection_and_slot_filling

intent detection and slot filling 意图识别与槽填充联合模型
Jupyter Notebook
30
star
5

spark_data_mining

spark tutorial for big data mining。包括app流量运营分析、als推荐、smote样本采样、RFM客户价值分群、AHP层次分析客户价值得分、手机定位数据商圈挖掘、马尔可夫智能邮件预测、时序预测、关联规则、推荐电影好友等。
Java
29
star
6

movie_kg

基于知识图谱的电影智能问答。neo4j构建电影图谱,spark ml完成问答意图分类,将问答语句转为cypher查询语句完成匹配查询。
Java
28
star
7

recommendation_methods

个性化推荐模型,主要包括als、als_wr、biaslfm、lfm、nmf、svdpp、基于内容、基于内容回归、user-cf、item-cf、slopeone、关联规则以及基于内容和cf的混合等模型。
Python
24
star
8

java-springboot-paddleocr

本项目利用java加载paddle-ocr的C++编译的exe文件,并利用springboot进行web部署访问。This project loads the C++ compiled version of paddle-ocr in java and makes use of springboot for web deployment.
Java
24
star
9

intelligent_medical

intelligent medical,智慧医疗,包括疾病搜索、相关推荐、疾病医疗问答以及智能疾病诊断等功能。
Java
23
star
10

gnn4lp

gnn for link prediction,图神经网络用于链接预测。
Python
21
star
11

python_search

利用sklearn和gensim中的tfidf,lsa,doc2vec进行查询与文档匹配搜索
Python
21
star
12

jcorrector

jcorrector 中文文本纠错工具, Text Error Correction Tool,Spelling Check
Java
20
star
13

albert_re

albert-fc for RE(Relation Extraction),中文关系抽取
Python
15
star
14

java-springboot-paddleocr-v2

本项目利用JNI加载paddle-ocr的C++编译的dll库,并利用springboot进行web部署访问。This project uses JNI to load the C++ compiled dll libraries of paddle-ocr, and uses springboot for web deployment
Java
15
star
15

punctuation_prediction

chinese sentence punctuation prediction,中文句子标点符号预测。
Python
14
star
16

knowledge-automatic-tagging

题目知识点预测标注。Question knowledge point prediction.
Jupyter Notebook
13
star
17

text_grapher

利用java对文章进行分析并图谱化展示(主要提取关键词、实体、依存分析等)。
Java
11
star
18

gcn_for_prediction_of_protein_interactions

gcn for prediction of protein interactions,图卷积用于蛋白质相互作用。
Python
11
star
19

text_generation

Title and keywords are used to generate text.
Python
11
star
20

model2onnx

model2onnx,将roberta和macbert模型转为onnx格式,并进行推理。
Python
8
star
21

intent_classification

深度网络实现意图分类。
Jupyter Notebook
8
star
22

chatbot_chinese

Chinese chatbot for neural machine translation in PyTorch.Including basic seq2seq、seq2seq with attention、pointer generator、seq2seq with cnn and so on.
PLSQL
8
star
23

t5-onnx-corrector

t5-model-onnx,中文拼写纠错,Chinese spelling correction。
Python
7
star
24

onnx-java

onnx-java,这里利用java加载onnx模型,并进行推理。
Java
7
star
25

macbert-java-onnx

MacBERT for Chinese Spelling Correction, macbert中文拼写纠错
Java
7
star
26

NewsSummary

一个改进的新闻摘要程序(an improved method of news summary)
Java
7
star
27

CNN4IE

Chinese Information Extraction Toolkit。中文信息抽取工具。利用CNN各种变体进行实体抽取。
Python
6
star
28

chinese_sentence_paraphrase

sentence paraphrase
Python
6
star
29

albert_link_prediction

albert-fc for LP(Link Prediction),中文实体链接预测
Python
6
star
30

AutoText

智能文本自动处理工具(Intelligent text automatic processing tool)。AutoText的功能主要有文本纠错,图片ocr、版面检测以及表格结构识别等。The main functions of this project include text error correction, ocr, layout-detection and table structure recognition.
Java
6
star
31

sentence_rewriting

chinese sentence rewriting
Python
5
star
32

knowledge_point_graph

spark neo4j java 知识图谱数据处理
Java
5
star
33

layout_analysis

中文版面检测(Chinese layout detection),yolov8 is used to detect the layout of Chinese document images。
Python
4
star
34

albert_ner

albert-crf for NER(Named Entity Recognition),中文实体识别。
Python
4
star
35

text-de-duplication

text de-duplication 文本去重
4
star
36

pdf_to_docx

ocr,pdf转docx,pdf to docx
Python
4
star
37

albert_srl

albert-crf for SRL(Semantic Role Labeling),中文语义角色标注。
Python
4
star
38

layout_analysis4j

利用java-yolov8实现版面检测(Chinese layout detection),java-yolov8 is used to detect the layout of Chinese document images
Java
4
star
39

gec_check_template

grammatical correction,中文语法纠错模板
Java
4
star
40

chatbot

pytorch前馈网络分类预测chatbot
Jupyter Notebook
3
star
41

j4nlp

java for nlp,java自然语言处理
Java
3
star
42

triple_event_extract

EventExtraction & TriplesExtraction,复合事件抽取,依存关系三元组抽取
Java
2
star
43

bert_ndcg_lp

bert-ndcg for LP(Link Prediction),链接预测
Python
2
star
44

easyKG

deep learning of knowledge graph ,知识图谱深度学习相关技术
Python
2
star
45

llm_corpus_quality

大模型预训练中文语料清洗及质量评估 Large model pre-training corpus cleaning
Java
2
star
46

RecomSys

A simple recommendation system
Java
2
star
47

table_ocr_java

TABLE DETECTION IN IMAGES AND OCR TO CSV WITH JAVA
Java
2
star
48

micrograd4j

A micro scalar-valued Autograd engine developed with java, and a neural net library on top of it.
Java
2
star
49

similarity_words

计算词间的相关性,并进行图谱化展示。calculate the relevance between words
Python
2
star
50

entropy_sim

利用熵计算查询与文档的相关性。Entropy is used to calculate the relevance of a query to a document. This program is mainly based on 《Content-based relevance estimation on the web using inter-document similarities》(2012-CIKM).
Java
2
star
51

vehicle_license_plate_recognition

车牌识别(vehicle license plate recognition)
Python
1
star
52

pediatrics_llm_qa

Small model of pediatric consultation
Python
1
star
53

semantic_matching

semantic matching,语义匹配
Jupyter Notebook
1
star
54

doc_ai

这里将paddle中的ocr等模型转为onnx格式,并利用java版深度框架djl加载这些onnx模型进行推理预测尝试。
Java
1
star
55

spark-smote

The program uses spark to implement smote sampling.利用spark实现训练样本smote采样。
1
star
56

llm_security

利用分类法和敏感词检测法对生成式大模型的输入和输出内容进行安全检测,尽早识别风险内容。The input and output contents of generative large model are checked by classification method and sensitive word detection method to identify content risk as early as possible.
Java
1
star