• Stars
    star
    2,021
  • Rank 22,898 (Top 0.5 %)
  • Language
    Python
  • Created over 7 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

用于训练中英文对话系统的语料库 Datasets for Training Chatbot System

用于对话系统的中英文语料

Datasets for Training Chatbot System
本项目收集了一些从网络中找到的用于训练中文(英文)聊天机器人的对话语料

公开语料

搜集到的一些数据集如下,点击链接可以进入原始地址

  1. dgk_shooter_min.conv.zip
    中文电影对白语料,噪音比较大,许多对白问答关系没有对应好

  2. The NUS SMS Corpus
    包含中文和英文短信息语料,据说是世界最大公开的短消息语料

  3. ChatterBot中文基本聊天语料
    ChatterBot聊天引擎提供的一点基本中文聊天语料,量很少,但质量比较高

  4. Datasets for Natural Language Processing
    这是他人收集的自然语言处理相关数据集,主要包含Question Answering,Dialogue Systems, Goal-Oriented Dialogue Systems三部分,都是英文文本。可以使用机器翻译为中文,供中文对话使用

  5. 小黄鸡
    据传这就是小黄鸡的语料:xiaohuangji50w_fenciA.conv.zip (已分词) 和 xiaohuangji50w_nofenci.conv.zip (未分词)

  6. 白鹭时代中文问答语料
    由白鹭时代官方论坛问答板块10,000+ 问题中,选择被标注了“最佳答案”的纪录汇总而成。人工review raw data,给每一个问题,一个可以接受的答案。目前,语料库只包含2907个问答。(备份)

  7. Chat corpus repository
    chat corpus collection from various open sources
    包括:开放字幕、英文电影字幕、中文歌词、英文推文

  8. 保险行业QA语料库
    通过翻译 insuranceQA产生的数据集。train_data含有问题12,889条,数据 141779条,正例:负例 = 1:10; test_data含有问题2,000条,数据 22000条,正例:负例 = 1:10;valid_data含有问题2,000条,数据 22000条,正例:负例 = 1:10

未公开语料

这部分语料,网络上有所流传,但由于我们能力所限,或者原作者并未公开,暂时未获取。只是列举出来,供以后继续搜寻。

  1. 微软小冰

版权

所有原始语料归原作者所有

联系

何云超
weibo: @Yunchao_He

More Repositories

1

Chinsese_word_vectors

Chinsese_word_vectors
C
200
star
2

Speech-Corpus-Collection

A Collection of Speech Corpus for ASR and TTS
114
star
3

Griffin_lim

A TensorFlow implementation of Griffin-Lim algorithm
Python
77
star
4

AiVoice

Deep CNN networks for Speech Synthesis
Python
49
star
5

RawNet

RawNet: Fast End-to-End Neural Vocoder
42
star
6

Bots

Chatbot Framework for Chinese based on ChatScript 基于ChatScript的中文聊天引擎
C
41
star
7

CNTN

ChiNese Text Normalization (CNTN) tool for Text-to-speech system
Python
35
star
8

Ossian

Ossian: A simple language-independent Text-to-speech frontend
Python
17
star
9

ChatScript_DOC

A collection of document for ChatScript dialog engine
Batchfile
12
star
10

TensorFlow_Examples

This project use TensorFlow framework to do many interesting applications. Many popular deep leaning architecture will be implemented is this project, including Neural Networks, RNN, LSTM, Auto-encoder, CNN, etc.
Python
12
star
11

Alex

A Slot-filling based Dialog Manager for Task-oriented Bot
Python
11
star
12

SPExtractor

Tools for extract Speech parameters (lf0, mgc, bap) for TTS and wave restore.
Shell
5
star
13

texts_sentiment_analysis

texts sentiment analysis
Python
5
star
14

short_texts_sentiment_analysis

Short informal texts sentiment analysis
Python
5
star
15

ChatScript_Client

ChatScript Python Client
Python
3
star
16

TensorFlow_learn

Repo used for learning TensorFlow Framework
Python
3
star
17

Vecamend

Vecamend
Python
1
star
18

Ordinal_classification

Ordinal Classification of Tweets
Python
1
star
19

Concept_word_embeddings

Concept_word_embeddings
Python
1
star
20

T9Search

T9搜索
Java
1
star
21

Thesis_experiment

Thesis_experiment
Python
1
star
22

Vecamend-master2

more
Python
1
star