• Stars
    star
    204
  • Rank 192,063 (Top 4 %)
  • Language
  • Created over 3 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Golos dataset

Golos is a Russian corpus suitable for speech research. The dataset mainly consists of recorded audio files manually annotated on the crowd-sourcing platform. The total duration of the audio is about 1240 hours. We have made the corpus freely available for downloading, along with the acoustic model prepared on this corpus. Also we create 3-gram KenLM language model using an open Common Crawl corpus.

Table of contents

Dataset structure

Domain Train files Train hours Test files Test hours
Crowd 979 796 1 095 9 994 11.2
Farfield 124 003 132.4 1 916 1.4
Total 1 103 799 1 227.4 11 910 12.6

Downloads

MD5 Checksums

Audio files in opus format

Archive Size Link
golos_opus.tar 20.5 GB https://sc.link/JpD

Audio files in wav format

Manifest files with all the training transcription texts are in the train_crowd9.tar archive listed in the table:

Archives Size Links
train_farfield.tar 15.4 GB https://sc.link/1Z3
train_crowd0.tar 11 GB https://sc.link/Lrg
train_crowd1.tar 14 GB https://sc.link/MvQ
train_crowd2.tar 13.2 GB https://sc.link/NwL
train_crowd3.tar 11.6 GB https://sc.link/Oxg
train_crowd4.tar 15.8 GB https://sc.link/Pyz
train_crowd5.tar 13.1 GB https://sc.link/Qz7
train_crowd6.tar 15.7 GB https://sc.link/RAL
train_crowd7.tar 12.7 GB https://sc.link/VG5
train_crowd8.tar 12.2 GB https://sc.link/WJW
train_crowd9.tar 8.08 GB https://sc.link/XKk
test.tar 1.3 GB https://sc.link/Kqr

Acoustic and language models

Acoustic model built using QuartzNet15x5 architecture and trained using NeMo toolkit

Three n-gram language models created using KenLM Language Model Toolkit

Archives Size Links
QuartzNet15x5_golos.nemo 68 MB https://sc.link/ZMv
KenLMs.tar 4.8 GB https://sc.link/YL0

Golos data and models are also available in the hub of pre-trained models, datasets, and containers - DataHub ML Space. You can train the model and deploy it on the high-performance SberCloud infrastructure in ML Space - full-cycle machine learning development platform for DS-teams collaboration based on the Christofari Supercomputer.

Evaluation

Percents of Word Error Rate for different test sets

Decoder \ Test set Crowd test Farfield test MCV1 dev MCV1 test
Greedy decoder 4.389 % 14.949 % 9.314 % 11.278 %
Beam Search with Common Crawl LM 4.709 % 12.503 % 6.341 % 7.976 %
Beam Search with Golos train set LM 3.548 % 12.384 % - -
Beam Search with Common Crawl and Golos LM 3.318 % 11.488 % 6.4 % 8.06 %

1 Common Voice - Mozilla's initiative to help teach machines how real people speak.

Resources

[arxiv.org] Golos: Russian Dataset for Speech Research

[habr.com] Golos — самый большой русскоязычный речевой датасет, размеченный вручную, теперь в открытом доступе

[habr.com] Как улучшить распознавание русской речи до 3% WER с помощью открытых данных

License

English Version

Russian Version

Contacts

Please create a GitHub issue!

Authors (in alphabetic order):

  • Alexander Denisenko
  • Angelina Kovalenko
  • Fedor Minkin
  • Nikolay Karpov

More Repositories

1

plasma

💠 Дизайн-Система для создания навыков семейства Виртуальных Ассистентов "Салют"
TypeScript
64
star
2

smart_app_framework

SmartApp Framework для создания навыков семейства Виртуальных Ассистентов "Салют" на языке Python
Python
47
star
3

assistant-client

Инструмент для тестирования и отладки СanvasApps — навыков семейства Виртуальных Ассистентов "Салют"
TypeScript
40
star
4

salutejs

SmartApp Framework для создания навыков семейства Виртуальных Ассистентов "Салют" на языке JavaScript
TypeScript
36
star
5

smartspeech

SmartSpeech — это сервис для синтеза и распознавания речи
C++
29
star
6

salute-issues

Salute Issues — пространство для предложений и обсуждения багов в продуктах семейства Виртуальных Ассистентов "Салют"
24
star
7

native_smartapp_sdk

Android libs for native app development with power of Sber's virtual assistants
Kotlin
15
star
8

todo-canvas-app

Пример реализации Canvas App на React
TypeScript
9
star
9

vps-sdk-unity

C#
8
star
10

vps-sdk-ios

Visual Positioning System SDK iOS
Swift
7
star
11

smartspeech-unimrcp

C++
4
star
12

saf_vectorizers

Плагин для SmartApp Framework, осуществляющий векторизацию (получение embedding'ов) текстов с помощью различных моделей
Python
4
star
13

saf_patterns

Python
4
star
14

vps-sdk-android

Visual Positioning System SDK for native Android apps
Kotlin
4
star
15

awesome-salute

Список полезных ресурсов для разработчиков смартапов Виртуальных Ассистентов семейства "Салют"
4
star
16

todo-canvas-app-backend

Backend of Canvas App TODO for SmartApp Code.
SuperCollider
3
star
17

saf_jsonschema

Плагин для SmartApp Framework, который позволяет валидировать сообщения используя JSON-Schema
Python
2
star
18

open-license

1
star
19

python_hackathon_template

Python
1
star