• Stars
    star
    166
  • Rank 227,748 (Top 5 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 1 year ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context

Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context

This is the official repository of the Libriheavy dataset. Libriheavy is a labeled version of Librilight. Please refer to our paper: Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context for more details. Preprint available on arxiv.

How to download the dataset

The audio files of Libriheavy is the same as those in Librilight, the audio files is available here, you can download it by:

bash run.sh --stage -1 --stop-stage -1

The manifests of Libriheavy is hosted in huggingface and modelscope(for users in the Chinese mainland). You can download the manifests via:

from huggingface:

bash run.sh --stage 1 --stop-stage 1

or from modelscope:

bash run.sh --stage 0 --stop-stage 0

The manifest downloaded above looks like follows, we have two version of texts and pre_texts, the first item is the transcript from original book (with casing and punctuation), the second item is the decoding result from a asr model. The second item was used to align the transcript in the original book, we decide to keep it.

{
  "id": "small/100/sea_fairies_0812_librivox_64kb_mp3/01_baum_sea_fairies_64kb_0",
  "start": 243.919,
  "duration": 7.36,
  "channel": 0,
  "supervisions": [
    {
      "id": "small/100/sea_fairies_0812_librivox_64kb_mp3/01_baum_sea_fairies_64kb_0",
      "recording_id": "small/100/sea_fairies_0812_librivox_64kb_mp3/01_baum_sea_fairies_64kb",
      "start": 0,
      "duration": 7.36,
      "channel": 0,
      "language": "English",
      "speaker": "100",
      "custom": {
        "texts": [
          "The little girl was thoughtful for a moment. \"But why do folks dive in the water when the mermaids smile an' wink?\" she asked.",
          "THE LITTLE GIRL WAS THOUGHTFUL FOR A MOMENT BUT WHY DO FOLKS DIVE IN THE WATER WHEN THE MERMAIDS SMILE AND WINK SHE ASKED"
        ],
        "pre_texts": [                                                                                                                      
          "...us mortal folk,\" replied Cap'n Bill. \"But if anyone happens to see 'em, what then, Cap'n?\" \"Then,\" he answered, slowly wagging his head, \"the mermais give 'em a smile an' a wink, an' they dive into the water an' gets drownded.\" \"S'pose they knew how to swim, Cap'n Bill?\" \"That don't make any diff'rence, Trot. The mermaids live deep down, an' the poor mortals never come up again.",
          "...US MORTAL FOLK REPLIED CAP'N BILL BUT IF ANYONE HAPPENS TO SEE EM WHAT THEN CAP'N THEN HE ANSWERED SLOWLY WAGGING HIS HEAD THE MERMAIDS GIVE EM A SMILE AND A WINK AND THEY DIVES INTO THE WATER AND GETS DROWNDED S'POSE THEY KNOW HOW TO SWIM CAP'N BILL THAT DON'T MAKE ANY DIFFERENCE TROT THE MERMAIDS LIVE DEEP DOWN AND THE POOR MORTALS NEVER COME UP AGAIN"
        ],
        "begin_byte": 4993,
        "end_byte": 5120
      }
    }
  ],
  "recording": {
    "id": "small/100/sea_fairies_0812_librivox_64kb_mp3/01_baum_sea_fairies_64kb",
    "sources": [
      {
        "type": "file",
        "channels": [
          0
        ],
        "source": "download/librilight/small/100/sea_fairies_0812_librivox_64kb_mp3/01_baum_sea_fairies_64kb.flac"
      }
    ],
    "sampling_rate": 16000,
    "num_samples": 9567080,
    "duration": 597.942,
    "channel_ids": [
      0
    ]
  },
  "custom": {
    "text_path": "download/librilight_text/output_text_small_cleaned/Sea Fairies/text.txt"
  },
  "type": "MonoCut"
}

This is the full version of Libriheavy which can be use for various speech tasks. You can further extract the manifests for pure ASR training purpose by:

bash run.sh --stage 2 --stop-stage 2

Now, you have k2 format (lhotse cuts) and kaldi format corpus for both normalized version (upper case without punctuation) and full formated version (casing with punctuation):

├── cases_and_punc
│   ├── kaldi
│   │   ├── large
│   │   │   ├── segments
│   │   │   ├── text
│   │   │   └── wav.scp
......
│   │   ├── test_clean
│   │   │   ├── segments
│   │   │   ├── text
│   │   │   └── wav.scp
│   └── lhotse
│       ├── libriheavy_cuts_dev.jsonl.gz
│       ├── libriheavy_cuts_large.jsonl.gz
│       ├── libriheavy_cuts_medium.jsonl.gz
│       ├── libriheavy_cuts_small.jsonl.gz
│       ├── libriheavy_cuts_test_clean.jsonl.gz
│       ├── libriheavy_cuts_test_clean_large.jsonl.gz
│       ├── libriheavy_cuts_test_other.jsonl.gz
│       └──  libriheavy_cuts_test_other_large.jsonl.gz
└── upper_no_punc
    ├── kaldi
    │   ├── large
    │   │   ├── segments
    │   │   ├── text
    │   │   └── wav.scp
    ......
    │   ├── test_other
    │   │   ├── segments
    │   │   ├── text
    │   │   └── wav.scp
    └── lhotse
        ├── libriheavy_cuts_dev.jsonl.gz
        ├── libriheavy_cuts_large.jsonl.gz
        ├── libriheavy_cuts_medium.jsonl.gz
        ├── libriheavy_cuts_small.jsonl.gz
        ├── libriheavy_cuts_test_clean.jsonl.gz
        ├── libriheavy_cuts_test_clean_large.jsonl.gz
        ├── libriheavy_cuts_test_other.jsonl.gz
        └── libriheavy_cuts_test_other_large.jsonl.gz

For how to use the pre_texts, we have a paper: PromptASR for contextualized ASR with controllable style Preprint available on arxiv

Note The directory of audio files is hard-coded to download/librilight in the manifests.

Leaderboard

Note: large subset=large + medium + small; medium subset = medium + small (i.e. large subset includes the large, medium, small manifests above, medium subset includes the medium and small manifests above).

Models trained on normalized text

Note: The models trained with Wenet might not be tuned well.

large subset

contributor toolkit LibriSpeech WER (clean & other) Libriheavy WER (clean & other) recipe model
baseline Wenet 2.02 & 5.22 2.74 & 6.68 CTC + Attention model
baseline icefall 1.62 & 3.36 2.20 & 5.57 Transducer model

medium subset

contributor toolkit LibriSpeech WER (clean & other) Libriheavy WER (clean & other) recipe model
baseline Wenet 3.15 & 7.88 3.80 & 8.80 CTC + Attention model
baseline icefall 2.35 & 4.82 2.90 & 6.57 Transducer model

small subset

contributor toolkit LibriSpeech WER (clean & other) Libriheavy WER (clean & other) recipe model
baseline Wenet 5.76 & 15.60 6.94 & 15.17 CTC + Attention model
baseline icefall 4.05 & 9.89 4.68 & 10.01 Transducer model

Models trained on text with casing and punctuation

large subset

contributor toolkit Libriheavy normalized WER (clean & other) Libriheavy WER (clean & other) recipe model
baseline icefall 2.28 & 5.68 7.76 & 11.32 Transducer model

medium subset

contributor toolkit Libriheavy normalized WER (clean & other) Libriheavy WER (clean & other) recipe model
baseline icefall 3.05 & 6.78 9.84 & 13.39 Transducer model

small subset

contributor toolkit Libriheavy normalized WER (clean & other) Libriheavy WER (clean & other) recipe model
baseline icefall 5.16 & 11.12 13.04 & 19.54 Transducer model

Statistics

You can find the detail description of the corpus in Librilight paper, here are some statistics of Libriheavy. The last 7 columns are the distribution of durations (in seconds).

subset #hours #books per-spk hrs total spks mean std min 25% 50% 75% 99%
small 509 173 1.22 417 14.9 6.5 2.0 10 14.4 18.6 30.8
medium 5042 960 3.29 1531 14.8 6.4 2.0 9.9 14.3 18.5 30.8
large 50794 8592 7.54 6736 14.8 6.4 2.0 9.8 14.2 18.4 30.7
dev 22.3 180 0.16 141 15.0 6.5 2.1 10.1 14.5 18.6 30.8
test-clean 10.5 87 0.15 70 14.7 6.5 2.3 9.6 14.2 18.5 30.8
test-other 11.5 112 0.16 72 14.6 6.4 2.2 9.7 14.0 18.2 30.6
test-clean-large 107.5 95 1.49 72 14.8 6.4 2.0 9.9 14.3 18.4 30.8
test-other-large 100.3 136 1.37 73 14.6 6.5 2.0 9.7 14.0 18.4 30.8

Creation pipeline

You can find the documentation of creation pipeline here.

Citation

@misc{kang2023libriheavy,
      title={Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context}, 
      author={Wei Kang and Xiaoyu Yang and Zengwei Yao and Fangjun Kuang and Yifan Yang and Liyong Guo and Long Lin and Daniel Povey},
      year={2023},
      eprint={2309.08105},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

More Repositories

1

sherpa-onnx

Speech-to-text, text-to-speech, speaker recognition, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter, Object Pascal, Lazarus, Rust
C++
3,307
star
2

k2

FSA/FST algorithms, differentiable, with PyTorch compatibility.
Cuda
1,117
star
3

sherpa-ncnn

Real-time speech recognition and voice activity detection (VAD) using next-gen Kaldi with ncnn without Internet connection. Support iOS, Android, Linux, macOS, Windows, Raspberry Pi, VisionFive2, LicheePi4A etc.
C++
1,001
star
4

icefall

Python
912
star
5

sherpa

Speech-to-text server framework with next-gen Kaldi
C++
539
star
6

snowfall

Moved to https://github.com/k2-fsa/icefall
Python
143
star
7

fast_rnnt

A torch implementation of a recursion which turns out to be useful for RNN-T.
Python
136
star
8

text_search

Some fast-ish algorithms for batch text search in moderate-sized collections, intended for data cleanup
Python
53
star
9

kaldifst

Python wrapper for OpenFST and its extensions from Kaldi. Also support reading/writing ark/scp files
C++
41
star
10

multi_quantization

Python
40
star
11

next-gen-kaldi-wechat

32
star
12

kaldi-decoder

Decoders from Kaldi using OpenFst
C++
18
star
13

colab

Colab notebooks for Next-gen Kaldi
Jupyter Notebook
14
star
14

analyze_diagnostics

Scripts for analyzing the output of icefall's "diagnostics.py" code (--print-diagnostics=True option)
Perl
5
star
15

k2-fsa-www

Source for next-gen Kaldi home page.
JavaScript
4
star
16

sherpa-torch-cpp-makefile-example

C++
4
star
17

divide_lm

Python
4
star
18

sherpa-onnx-go

sherpa-onnx Go package for speech recognition without network access, supporting Linux, macOS, Windows
Go
2
star
19

sherpa-onnx-go-windows

sherpa-onnx Go package for Windows
C
1
star
20

sherpa-ncnn-go-linux

sherpa-ncnn Go package for Linux
C
1
star