• Stars
    star
    229
  • Rank 174,639 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created over 5 years ago
  • Updated about 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text

Predicting Prosodic Prominence from Text with Pre-Trained Contextualized Word Representations

Update 30 October 2019:

  • Data files modified to include improved word boundary values.

Update 9 September 2019:

  • Data files have been modified to include information about the source file in LibriTTS: Instead of an empty line before each sentence, there is now a line with <file> file_name.txt.
  • The code in prosody_dataset.py has been updated accordingly.

This repository contains the Helsinki Prosody Corpus and the code for the paper:

Aarne Talman, Antti Suni, Hande Celikkanat, Sofoklis Kakouros, Jรถrg Tiedemann and Martti Vainio. 2019. Predicting Prosodic Prominence from Text with Pre-trained Contextualized Word Representations. Proceedings of NoDaLiDa.

Abstract: In this paper we introduce a new natural language processing dataset and benchmark for predicting prosodic prominence from written text. To our knowledge this will be the largest publicly available dataset with prosodic labels. We describe the dataset construction and the resulting benchmark dataset in detail and train a number of different models ranging from feature-based classifiers to neural network systems for the prediction of discretized prosodic prominence. We show that pre-trained contextualized word representations from BERT outperform the other models even with less than 10% of the training data. Finally we discuss the dataset in light of the results and point to future research and plans for further improving both the dataset and methods of predicting prosodic prominence from text. The dataset and the code for the models are publicly available.

If you find the corpus or the system useful, please cite:

@inproceedings{talman_etal2019prosody,
  author = {Aarne Talman and Antti Suni and Hande Celikkanat and Sofoklis Kakouros 
            and J\"org Tiedemann and Martti Vainio},
  title = {Predicting Prosodic Prominence from Text with Pre-trained Contextualized 
           Word Representations},
  booktitle = {Proceedings of NoDaLiDa},
  year = {2019}
}

The Helsinki Prosody Corpus

License: CC BY 4.0

This repository contains the largest annotated dataset of English language with labels for prosodic prominence.

  • Download: The corpus is available in the data folder. Clone this repository or download the files separately.

The prosody corpus contains automatically generated, high quality prosodic annotations for the recently published LibriTTS corpus (Zen et al. 2019) using the Continuous Wavelet Transform Annotation method (Suni et al. 2017) and the Wavelet Prosody Analyzer toolkit.

Continuous Wavelet Transform Annotation method Image: Continuous Wavelet Transform Annotation method

Corpus statistics

Datasets Speakers Sentences Words Label: 0 Label: 1 Label: 2
train-100 247 33,041 570,592 274,184 155,849 140,559
train-360 904 116,262 2,076,289 1,003,454 569,769 503,066
dev 40 5,726 99,200 47,535 27,454 24,211
test 39 4,821 90,063 43,234 24,543 22,286
Total: 1230 159,850 2,836,144 1,368,407 777,615 690,122

Format

The corpus contains data in text files with one word per line and sentences separated with a line <file> file_name.txt, where the filename refers to the source file in LibriTTS. Each line in a sentence has five items separated with tabs in the following order:

  • word
  • discrete prominence label: 0 (non-prominent), 1 (prominent), 2 (highly prominent), (NA for punctuation)
  • discrete word boundary label: 0, 1, 2 (NA for punctuation)
  • real-valued prominence label (NA for punctuation)
  • real-valued word boundary label (NA for punctuation)

Example:

commercial    2    1    1.679    0.715

Tasks

The dataset can be used for two different prosody prediction tasks: 2-way and 3-way prosody prediction. As the dataset is annotated with three labels, 3-way classification can be done directly with the data. To use the data for 2-way classification task map label 2 to label 1 to get two discrete classes 0 (non-promiment) and 1 (prominent).

System

License: MIT

This repository contains the code for our BERT and BiLSTM models for predicting prosodic prominence from written English text. To use the system following dependencies need to be installed:

  • Python 3
  • PyTorch>=1.0
  • argparse
  • pytorch_transformers
  • numpy

To install the requirements run:

pip3 install -r requirements.txt

To download the word embeddings for the LSTM model run:

./download_embeddings.sh

Models included:

  • BERT
  • LSTM
  • Majority class per word
  • See model.py for the complete list

For the BERT model run training by executing:

# Train BERT-Uncased
python3 main.py \
    --model BertUncased \
    --train_set train_360 \
    --batch_size 32 \
    --epochs 2 \
    --save_path results_bert.txt \
    --log_every 50 \
    --learning_rate 0.00005 \
    --weight_decay 0 \
    --gpu 0 \
    --fraction_of_train_data 1 \
    --optimizer adam \
    --seed 1234

For the Bidirectional LSTM model run training by executing:

# Train 3-layer BiLSTM
python3 main.py \
    --model BiLSTM \
    --train_set train_360 \
    --layers 3 \
    --hidden_dim 600 \
    --batch_size 64 \
    --epochs 5 \
    --save_path results_bilstm.txt \
    --log_every 50 \
    --learning_rate 0.001 \
    --weight_decay 0 \
    --gpu 0 \
    --fraction_of_train_data 1 \
    --optimizer adam \
    --seed 1234

Output

Output of the system is a text file with the following structure:

<word> tab <label> tab <prediction>

Example output:

And    0     0
those  2     2
who    0     0
meet   1     2
in     0     0
the    0     0
great  1     1
hall   1     1
with   0     0
the    0     0
white  2     1
Atlas  2     2
?      NA    NA

Baseline Results

Main experimental results from the paper using the train-360 dataset.

Model Test accuracy (2-way) Test accuracy (3-way)
BERT-base 83.2% 68.6%
3-layer BiLSTM 82.1% 66.4%
CRF (MarMoT) 81.8% 66.4%
SVM+GloVe (Minitagger) 80.8% 65.4%
Majority class per word 80.2% 62.4%
Majority class 52.0% 48.0%
Random 49.0% 39.5%

Contact

Aarne Talman: [email protected]

References

[1] Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen and Yonghui Wu. 2019. LibriTTS: A corpus derived from LibriSpeech for text-to-speech. arXiv preprint arXiv:1904.02882.

[2] Antti Suni, Juraj ล imko, Daniel Aalto and Martti Vainio. 2017. Hierarchical representation and estimation of prosody using continuous wavelet transform. Computer Speech & Language. Volume 45. Pages 123-136. ISSN 0885-2308. https://doi.org/10.1016/j.csl.2016.11.001.

More Repositories

1

Tatoeba-Challenge

Makefile
797
star
2

Opus-MT

Open neural machine translation models and web services
Python
612
star
3

OPUS-MT-train

Training open neural machine translation models
Makefile
330
star
4

OpusFilter

OpusFilter - Parallel corpus processing toolkit
Python
101
star
5

HBMP

Sentence Embeddings in NLI with Iterative Refinement Encoders
Python
78
star
6

OPUS-CAT

OPUS-CAT is a collection of software which make it possible to OPUS-MT neural machine translation models in professional translation. OPUS-CAT includes a local offline MT engine and a collection of CAT tool plugins.
C#
69
star
7

OpusTools

Python
67
star
8

XED

XED multilingual emotion datasets
Jupyter Notebook
55
star
9

OPUS

The Open Parallel Corpus
JavaScript
54
star
10

UkrainianLT

A collection of links to Ukrainian language tools
29
star
11

OPUS-translator

Translation demonstrator
Smalltalk
27
star
12

mammoth

MAMMOTH: MAssively Multilingual Modular Open Translation @ Helsinki
Python
21
star
13

MuCoW

Automatically harvested multilingual contrastive word sense disambiguation test sets for machine translation
Python
16
star
14

subalign

Perl
15
star
15

sentimentator

Tool for sentiment analysis annotation
HTML
11
star
16

OPUS-MT-testsets

benchmarks for evaluating MT models
Smalltalk
10
star
17

OpusTools-perl

Perl
6
star
18

neural-search-tutorials

Additional Notebooks for the Building NLP Applications course
Jupyter Notebook
5
star
19

OPUS-interface

OPUS repository interface
Python
5
star
20

OPUS-ingest

Makefile
4
star
21

LanguageCodes

Perl
4
star
22

shroom

SCSS
4
star
23

nli-data-sanity-check

Data and scripts for a diagnostics test suite which allows to assess whether an NLU dataset constitutes a good testbed for evaluating the models' meaning understanding capabilities.
Jupyter Notebook
4
star
24

OPUS-repository

Perl
3
star
25

doclevel-MT-benchmark

Document-level Machine Translation Benchmark
Python
3
star
26

Uplug

HTML
3
star
27

americasnlp2021-st

AmericasNLP 2021 shared task
JavaScript
3
star
28

Geometry

Python
2
star
29

shared-info

2
star
30

LSDC

Low-Saxon Dialect Classification
2
star
31

pdf2xml

Perl
2
star
32

Syntactic_Debiasing

Python
2
star
33

OpusTranslationService

Translation service based on LibreTranslate
Python
2
star
34

murre24

Manually annotated dataset of Finnish varieties in the Suomi24, the largest Finnish internet forum, the id's of automatically annotated dialectal messages and the scripts used for classification and evaluation.
Python
2
star
35

OPUS-index

Index of resources in OPUS
1
star
36

OpusFilter-hub

A hub of OpusFilter configurations
Python
1
star
37

NLU-Course-2020

Python
1
star
38

SELF-FEIL

Emotion Lexicons for Finnish
1
star
39

ndc-aligned

Word-aligned version of the Norwegian Dialect Corpus
Python
1
star
40

OPUS-MT-dashboard

PHP
1
star
41

External-MT-leaderboard

Leaderboards for external MT models
1
star
42

nlu-dataset-diagnostics

This repository contains data and scripts to reproduce the results from our paper: How Does Data Corruption Affect Natural Language Understanding Models? A Study on GLUE datasets.
Python
1
star
43

en-fi-testsuite

WMT18 Testsuite for Finnish morphology
Python
1
star
44

finlandsvensk-AI

1
star
45

OPUS-website

OPUS website files
1
star
46

OPUS-MT-leaderboard-recipes

Makefile recipes shared between all leaderboard repos
Makefile
1
star
47

OPUS-MT-leaderboard

1
star
48

murreviikko

Dialectologically annotated and normalized dataset of dialectal Finnish tweets
Python
1
star
49

Sami-MT

machine translation for Sรกmi languages
1
star
50

lm-vs-mt

Two Stacks Are Better Than One: A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives
Python
1
star
51

OPUS-API

API for searching corpora from OPUS
Python
1
star
52

dialect-topic-model

Scripts and metadata for the paper "Corpus-based dialectometry with topic models"
Python
1
star