• Stars
    star
    488
  • Rank 90,182 (Top 2 %)
  • Language
    Shell
  • License
    Apache License 2.0
  • Created over 3 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A 10000+ hours dataset for Chinese speech recognition

WenetSpeech

Official website | Paper

A 10000+ Hours Multi-domain Chinese Corpus for Speech Recognition

WenetSpeech

Download

Please visit the official website, read the license, and follow the instruction to apply for the PASSWORD to download the data.

echo 'PASSWORD' > SAFEBOX/password

From Tecent Meeting (default)

Download WenetSpeech:

bash utils/download_wenetspeech.sh DOWNLOAD_DIR UNTAR_DIR

From ModelScope

Install modelscope (depends on torch) before downloading:

conda create -n modelscope python=3.7
conda activate modelscope
pip install torch
pip install modelscope -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

Download WenetSpeech from modelscope:

sed -i 's/modelscope=false/modelscope=true/g' utils/download_wenetspeech.sh
bash utils/download_wenetspeech.sh DOWNLOAD_DIR UNTAR_DIR

Discussion & Communication

Please scan the QR code on the left to follow our offical account of WeNet. We created a WeChat group for better discussion and quicker response. Please scan the personal QR code on the right, and the guy is responsible for inviting you to the chat group.

Benchmark

Toolkit Dev Test_Net Test_Meeting AIShell-1
Kaldi 9.07 12.83 24.72 5.41
ESPNet 9.70 8.90 15.90 3.90
WeNet 8.88 9.70 15.59 4.61

Description

Creation

All the data are collected from YouTube and Podcast. Optical character recognition (OCR) and automatic speech recognition (ASR) techniques are adopted to label each YouTube and Podcast recording, respectively. To improve the quality of the corpus, we use a novel end-to-end label error detection method to further validate and filter the data.

Categories

In summary, WenetSpeech groups all data into 3 categories, as the following table shows:

Set Hours Confidence Usage
High Label 10005 >=0.95 Supervised Training
Weak Label 2478 [0.6, 0.95] Semi-supervised or noise training
Unlabel 9952 / Unsupervised training or Pre-training
In Total 22435 / All above

High Label Data

We classify the high label into 10 groups according to its domain, speaking style, and scenarios.

Domain Youtube Podcast Total
audiobook 0 250.9 250.9
commentary 112.6 135.7 248.3
documentary 386.7 90.5 477.2
drama 4338.2 0 4338.2
interview 324.2 614 938.2
news 0 868 868
reading 0 1110.2 1110.2
talk 204 90.7 294.7
variety 603.3 224.5 827.8
others 144 507.5 651.5
Total 6113 3892 10005

As shown in the following table, we provide 3 training subsets, namely S, M and L for building ASR systems on different data scales.

Training Subsets Confidence Hours
L [0.95, 1.0] 10005
M 1.0 1000
S 1.0 100

Evaluation Sets

Evaluation Sets Hours Source Description
DEV 20 Internet Specially designed for some speech tools which require cross-validation set in training
TEST_NET 23 Internet Match test
TEST_MEETING 15 Real meeting Mismatch test which is a far-field, conversational, spontaneous, and meeting dataset

Contributors

ACKNOWLEDGEMENTS

  • WenetSpeech refers a lot of work of GigaSpeech, and we thank Jiayu Du and Guoguo Chen for their suggestions on this work.
  • We thank Tencent Ethereal Audio Lab and Xi'an Future AI Innovation Center for providing hosting service for WenetSpeech. We also thank MindSpore for the support of this work, which is a new deep learning computing framework.
  • Our gratitude goes to Lianhui Zhang and Yu Mao for collecting some of the YouTube data.

More Repositories

1

wenet

Production First and Production Ready End-to-End Speech Recognition Toolkit
Python
4,073
star
2

speech-synthesis-paper

List of speech synthesis papers.
989
star
3

wespeaker

Research and Production Oriented Speaker Verification, Recognition and Diarization Toolkit
Python
690
star
4

wekws

Production First and Production Ready End-to-End Keyword Spotting Toolkit
Python
444
star
5

WeTextProcessing

Text Normalization & Inverse Text Normalization
Python
443
star
6

wetts

Production First and Production Ready End-to-End Text-to-Speech Toolkit
Python
367
star
7

speech-recognition-papers

Towards hot directions in industrial end to end speech recognition
325
star
8

opencpop

Opencpop: A High-Quality Open Source Chinese Popular Song Database for Singing Voice Synthesis
207
star
9

wenet-kws

Production First and Production Ready End-to-End Keyword Spotting Toolkit
Python
142
star
10

west

We Speech Transcript based on LLM, in 300 lines of code.
Python
109
star
11

wesep

Target Speaker Extraction Toolkit
Python
80
star
12

wesignal

Production first, nn-based on-device signal processing toolkit.
63
star
13

WeTextProcessing.deprecated

C++
61
star
14

wesubtitle

用 OCR 提取视频硬字幕
Python
54
star
15

llm-papers

List of Large Lanugage Model Papers
51
star
16

wecut

video cut powered by AI
25
star
17

WeSpeech-AI

Open Source Speech/Text Data on AI
18
star
18

nn-singal-processing-papers

List of NN based singal processing papers
17
star
19

wenet_in_action_homework

WeNet 实战课程作业
Python
16
star
20

wenet-e2e.github.io

WeNet Community
CSS
1
star
21

wenet-contributors

Contributors of WeNet, including individual and companies.
1
star