SOVA Dataset
SOVA Dataset is free public STT/ASR dataset.
Key facts:
- Russian, English and Chinese languages
- ~ 32 328 hours
- ~ 3,21 TB in
.wav
format
Dataset composition
Name | Lang | Hours | Size | Source | Equipment | Annotation | Speech type | Augmentation | Quality | |
---|---|---|---|---|---|---|---|---|---|---|
EngAudiobooksOriginal | Download | EN | 7Â 130 | 743Â Gb | audiobook | professional | forced alignment | reading | none | 95% |
EngAudiobooksNoisy | Download | EN | 3Â 873 | 310Â Gb | audiobook | professional | forced alignment | reading | phone calls | 95% |
RuAudiobooksDevices | Download | RU | 298 | 30,24Â Gb | audiobook | unprofessional | manual | reading | none | 99% |
RuDevices | Download | RU | 101 | 10,42Â Gb | audio records | unprofessional | manual | live speech | none | 98% |
RuYoutube | Download | RU | 17Â 451 | 1 873Â Gb | audio records | unprofessional | asr | live speech | none | 95% |
ZhYoutube | Download | CN | 3Â 475,1 | 321Â Gb | audio records | unprofessional | asr | live speech | none | 97.83% |
TOTAL | - | - | 32Â 328,1 | 3Â 287,66Â Gb (3,21Â TB) |
- | - | - | - | - | - |
Audio characteristics
- Bit rate mode: constant
- Bit rate: 256 kbps
- Channel(s): 1 channel
- Sample rate: 16.0 kHz
- Bit depth: 16 bit
Updates
- 08/11/2022: Release v0.4.0
- 10/12/2021: Release v0.3.0
- 22/12/2020: Release v0.2.0
- 24/12/2019: Published dataset with 116 hours.
Contacts
For all questions please feel free to contact us [email protected]
License
SOVA Dataset is licensed under Creative Commons BY 4.0 license by Virtual Assistant, LLC.