charlesXu86/char_featurizer

Stars
125
Rank 286,335 (Top 6 %)
Language
Python
Created over 4 years ago
Updated over 4 years ago

charlesXu86/char_featurizer

charlesXu86

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

汉字字符特征提取工具，可以提取出字符中的字音（声母、韵母、声调）、字形（偏旁、部首）、四角编码等特征，同时可作为tensor输入到模型

char_featurizer

char_featurizer 是一个汉字字符特征提取工具，他可以提取汉字的字音（包括声母、韵母、声调）、字形（偏旁、部首）、四角符号等信息。同时可以将这些特征信息转换为tensor，作为模型的输入特征。这个项目是在安德森大佬的字符提取工具的基础上做了优化整合

目前 char_featurizer 支持的功能有：

1、字形特征提取

2、字音特征提取

3、四角编码提取

4、tensor转换

二、安装使用

1、安装

pip install char_featurizer

2、使用

1、字符特征提取

from char_featurizer import Featurizer

featurizer = Featurizer()

data = '明天去你家玩'

result = featurizer.featurize(data)
print(result)

返回结果:
([['m'], ['t'], ['q'], ['n'], ['j'], ['w']],      # 声母
[['ing'], ['ian'], ['u'], ['i'], ['ia'], ['an']], # 韵母
[['2'], ['1'], ['4'], ['3'], ['1'], ['2']],       # 声调
('6', '1', '4', '2', '3', '1'),
('7', '0', '0', '7', '0', '1'),
('0', '8', '7', '2', '2', '1'),
('2', '0', '3', '9', '3', '1'),
('0', '4', '2', '2', '2', '2'))
元祖的第一个值的组合为对应汉字的四角编码：如：明 -> 67020, 天 -> 10804

注：汉字和四角编码并非是一一对应的，一个四角编码可以对应多个汉字，但是一个汉字只有一个四角编码

2、作为特征输入模型

3、相关资源

1、汉字四角号码在线查询工具

三、Update News

2020.5.4 完成V1版本

四、TO DO LIST

1、字符相似度计算（发音相似度、字形相似度）

2、支持tf2

五、Resources

Chatbot_CN

基于金融-司法领域(兼有闲聊性质)的聊天机器人，其中的主要模块有信息抽取、NLU、NLG、知识图谱等，并且利用Django整合了前端展示,目前已经封装了nlp和kg的restful接口

PAPER-In-CODE

NLP相关的paper代码复现。主要包括ACL，AAAI，EMNLP等顶会论文。

Chatbot_Retrieval

基于检索的任务型多轮对话

Chatbot_RASA

Chatbot_CN项目中的Chatbot_rasa模块

TIANCHI_Project

天池大数据比赛总结

Jupyter Notebook

Chatbot_Utils

该部分停止更新，升级项目地址：https://github.com/we-chatter/chatbot_utils

Time_Convert

时间转换工具

Text_Classification_TF

用tf实现各种文本分类模型，并且封装restful接口，可以直接工程化

Chatbot_Doc

Chatbot_CN项目的Chatbot_Doc模块

Chatbot_Data

nlp包括对话的数据集收集整理

Chatbot_Help

Chatbot_Help:聊天机器人第三方接入工具，如接入到钉钉群、微信公众号、qq等

Chatbot_KG

Chatbot_CN项目的知识图谱模块

Chatbot_DM

Chatbot_Web

Chatbot_Web对话机器人展示页面

Chatbot_NLU

Chatbot_S2S

训练端到端的对话模型，使用ddq学习对话策略，提供对话rest服务

Chatbot_Analytics

机器人分析模块

TTS-Clone-Chinese

Chatbot_Skills

对话技能管理

svoice

Recommendation

Recommendation Research

Bert4tf

Chatbot_Crawler

Chatbot_CN数据爬取，基于scrapy框架

weibo_spider

微博爬虫，更新完善中

AutoDL

AutoDL + 模型加速 + MLflow一站式算法解决方案

HelloWorld

KnowledgeDistillation

基于bert的知识蒸馏学习

Chatbot_Evaluate

对话诊断、对话质量评估、badcase分析、数据反馈、对话模型算法迭代闭环

TensorFlow-2.x-Tutorials

TF2.x学习笔记

Jupyter Notebook

Chatbot_Recommendation

结合用户特征，将推荐系统和对话系统结合，构造推荐式任务型对话

PSpider

一个爬虫工具，自己学习使用

TF2-Examples

Tf2 学习笔记

Jupyter Notebook

xlnet-tutorial

FAQ

Django server helloworld

QuantitativeTrade

量化交易程序

Neural_Coreference

MachineLearning

Jupyter Notebook