• Stars
    star
    142
  • Rank 257,015 (Top 6 %)
  • Language
    Jupyter Notebook
  • Created over 7 years ago
  • Updated almost 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

a pytorch implementation of auto-punctuation learned character by character

Learning Auto-Punctuation by Reading Engadget Articles

DOI

Link to Other of my work

🌟 Deep Learning Notes: A collection of my notes going from basic multi-layer perceptron to convNet and LSTMs, Tensorflow to pyTorch.

πŸ’₯ Deep Learning Papers TLDR; A growing collection of my notes on deep learning papers! So far covers the top papers from this years ICLR.

Overview

This project trains a bi-directional GRU to learn how to automatically punctuate a sentence by reading blog posts from Engadget.com character by character. The set of operation it learns include:

capitalization: <cap>
         comma:  ,
        period:  .
   dollar sign:  $
     semicolon:  ;
         colon:  :
  single quote:  '
  double quote:  "
  no operation: <nop>

Performance

After 24 epochs of training, the network achieves the following performance on the test-set:

    Test P/R After 24 Epochs 
    =================================
    Key: <nop>	Prec:  97.1%	Recall:  97.8%	F-Score:  97.4%
    Key: <cap>	Prec:  68.6%	Recall:  57.8%	F-Score:  62.7%
    Key:   ,	Prec:  30.8%	Recall:  30.9%	F-Score:  30.9%
    Key:   .	Prec:  43.7%	Recall:  38.3%	F-Score:  40.8%
    Key:   '	Prec:  76.9%	Recall:  80.2%	F-Score:  78.5%
    Key:   :	Prec:  10.3%	Recall:   6.1%	F-Score:   7.7%
    Key:   "	Prec:  26.9%	Recall:  45.1%	F-Score:  33.7%
    Key:   $	Prec:  64.3%	Recall:  61.6%	F-Score:  62.9%
    Key:   ;	Prec:   0.0%	Recall:   0.0%	F-Score:   N/A
    Key:   ?	Prec:   0.0%	Recall:   0.0%	F-Score:   N/A
    Key:   !	Prec:   0.0%	Recall:   0.0%	F-Score:   N/A

As a frist attempt, the performance is pretty good! Especially since I did not fine tune with a smaller step size afterward, and the Engadget dataset used here is small in size (4MB total).

Double the training gives a small improvement.

Table 2. After 48 epochs of training

    Test P/R  Epoch 48 Batch 380
    =================================
    Key: <nop>	Prec:  97.1%	Recall:  98.0%	F-Score:  97.6%
    Key: <cap>	Prec:  73.2%	Recall:  58.9%	F-Score:  65.3%
    Key:   ,	Prec:  35.7%	Recall:  32.2%	F-Score:  33.9%
    Key:   .	Prec:  45.0%	Recall:  39.7%	F-Score:  42.2%
    Key:   '	Prec:  81.7%	Recall:  83.4%	F-Score:  82.5%
    Key:   :	Prec:  12.1%	Recall:  10.8%	F-Score:  11.4%
    Key:   "	Prec:  25.2%	Recall:  44.8%	F-Score:  32.3%
    Key:   $	Prec:  51.4%	Recall:  87.8%	F-Score:  64.9%
    Key:   ;	Prec:   0.0%	Recall:   0.0%	F-Score:   N/A
    Key:   ?	Prec:   5.6%	Recall:   4.8%	F-Score:   5.1%
    Key:   !	Prec:   0.0%	Recall:   0.0%	F-Score:   N/A

Usage

If you feel like using some of the code, you can cite this project via

@article{deeppunc,
  title={Deep-Auto-Punctuation},
  author={Yang, Ge},
  journal={arxiv},
  year={2017},
  doi={10.5281/zenodo.438358}
  url={https://zenodo.org/record/438358;
       https://github.com/episodeyang/deep-auto-punctuation}
}

To run

First unzip the engagdget data into folder ./engadget_data by running

tar -xvzf engadget_data.tar.gz

and then open up the notebook Learning Punctuations by reading Engadget.pynb, and you can just execute.

To view the reporting, open a visdom server by running

python visdom.server

and then go to http://localhost:8097

Requirements

pytorch numpy matplotlib tqdm bs4

Model Setup and Considerations

The initial setup I began with was a single uni-direction GRU, with input domain [A-z0-9] and output domain of the ops listed above. My hope at that time was to simply train the RNN to learn correcponding operations. A few things jumped out during the experiment:

  1. Use bi-directional GRU. with the uni-direction GRU, the network quickly learned capitalization of terms, but it had difficulties with single quote. In words like "I'm", "won't", there are simply too much ambiguity from reading only the forward part of the word. The network didn't have enough information to properly infer such punctuations.

    So I decided to change the uni-direction GRU to bi-direction GRU. The result is much better prediction for single quotes in concatenations.

    the network is still training, but the precision and recall of single quote is nowt close to 80%.

    This use of bi-directional GRU is standard in NLP processes. But it is nice to experience first-hand the difference in performance and training.

    A side effect of this switch is that the network now runs almost 2x slower. This leads to the next item in this list:

  2. Use the smallest model possible. At the very begining, my input embeding was borrowed from the Shakespeare model, so the input space include both capital alphabet as well as lower-case ones. What I didn't realize was that I didn't need the capital cases because all inputs were lower-case.

    So when the training became painfully slow after I switch to bi-directional GRU, I looked for ways to make the training faster. A look at the input embeding made it obvious that half of the embedding space wasn't needed.

    Removing the lower case bases made the traing around 3x faster. This is a rough estimate since I also decided to redownload the data set at the same time on the same machine.

  3. Text formatting. Proper formating of input text crawed from Engadget.com was crucial, especially because the occurrence of a lot of the puncuation was low and this is a character-level model. You can take a look at the crawed text inside ./engadget_data_tar.gz.

  4. Async and Multi-process crawing is much much faster. I initially wrote the engadget crawer as a single threaded class. Because the python requests library is synchronous, the crawler spent virtually all time waiting for the GET requests.

    This could be made a lot faster by parallelizing the crawling, or use proper async pattern.

    This thought came to me pretty late during the second crawl so I did not implement it. But for future work, parallel and async crawler is going to be on the todo list.

  5. Using Precision/Recall in a multi-class scenario. The setup makes the reasonable assumption that each operation can only be applied mutually exclusively. The accuracy metric used here are precision/recall and the F-score, both commonly used in the literature1, 2. The P/R and F-score are implemented according to wikipedia 3, 4.

    example accuracy report:

    Epoch 0 Batch 400 Test P/R
    =================================
    Key: <nop>	Prec:  99.1%	Recall:  96.6%	F-Score:  97.9%
    Key:   ,	Prec:   0.0%	Recall:   0.0%	F-Score:   N/A
    Key: <cap>	Prec: 100.0%	Recall:  75.0%	F-Score:  85.7%
    Key:   .	Prec:   0.0%	Recall:   0.0%	F-Score:   N/A
    Key:   '	Prec:  66.7%	Recall: 100.0%	F-Score:  80.0%
    
    
    true_p:	{'<nop>': 114, '<cap>': 3, "'": 2}
    p:	{'<nop>': 118, '<cap>': 4, "'": 2}
    all_p:	{'<nop>': 115, ',': 2, '<cap>': 3, '.': 1, "'": 3}
    
    400it [06:07,  1.33s/it]
    
  6. Hidden Layer initialization: In the past I've found it was easier for the neural network to generate good results when both the training and the generation starts with a zero initial state. In this case because we are computing time limited, I zero the hidden layer at the begining of each file.

  7. Mini-batches and Padding: During training, I first sort the entire training set by the length of each file (there are 45k of them) and arrange them in batches, so that files inside each batch are roughly similar size, and only minimal padding is needed. Sometimes the file becomes too long. In that case I use data.fuzzy_chunk_length() to calculate a good chunk length with heuristics. The result is mostly no padding during most of the trainings.

    Going from having no mini-batch to having a minibatch of 128, the time per batch hasn't changed much. The accuracy report above shows the training result after 24 epochs.

Data and Cross-Validation

The entire dataset is composed of around 50k blog posts from engadget. I randomly selected 49k of these as my training set, 50 as my validation set, and around 0.5k as my test set. The training is a bit slow on an Intel i7 desktop, averaging 1.5s/file depending on the length of the file. As a result, it takes about a day to go through the entire training set.

Todo:

All done.

Done:

  • execute demo test after training
  • add final performance metric
  • implement minibatch
  • a generative demo
  • add validation (once an hour or so)
  • add accuracy metric, use precision/recall.
  • change to bi-directional GRU
  • get data
  • Add temperature to generator
  • add self-feeding generator
  • get training to work
  • use optim and Adam

References

1: https://www.aclweb.org/anthology/D/D16/D16-1111.pdf
2: https://phon.ioc.ee/dokuwiki/lib/exe/fetch.php?media=people:tanel:interspeech2015-paper-punct.pdf
3: https://en.wikipedia.org/wiki/precision_and_recall
4: https://en.wikipedia.org/wiki/F1_score

More Repositories

1

ml_logger

A logger, server and visualization dashboard for ML projects
Python
192
star
2

deep_learning_notes

a collection of my notes on deep learning
Jupyter Notebook
123
star
3

grammar_variational_autoencoder

pytorch implementation of grammar variational autoencoder
Python
60
star
4

plan2vec

Public Release of Plan2vec Implementation in pyTorch
Python
56
star
5

char2wav_pytorch

pytorch implementation of lyre.ai's char2wav model
Python
32
star
6

deep_learning_papers_TLDR

repository for my TLDR for deep learning papers (and SML papers!)
Mathematica
18
star
7

gym-fetch

A collection of manipulation tasks with the fetch robot
Python
18
star
8

jaynes-starter-kit

a starter-kit for jaynes, the cloud-agnostic launch library
Python
16
star
9

react-prosemirror

JavaScript
16
star
10

e-maml

E-MAML, and RL-MAML baseline implemented in Tensorflow v1
Python
15
star
11

deep_learning_plotting_example

Some of the plotting code used in our paper, as an example for good-looking plots.
Python
13
star
12

variational_autoencoder_pytorch

pyTorch variational autoencoder, with explainations
Jupyter Notebook
10
star
13

params-proto

params_proto, a collection of decorators that makes shell argument passing declarative
Python
10
star
14

unitree-go1-setup-guide

Setup guide for the UniTree Go1 robot
Python
10
star
15

ffn

Public Repo for the paper "Overcoming The Spectral-Bias of Neural Value Approximation"
Python
7
star
16

react-docgen-loader

a small webpack loader that generates react component metaData using react-docgen
JavaScript
7
star
17

nanoGPT

Adapted version of nanoGPT for teaching
Jupyter Notebook
7
star
18

torch-ppo

PyTorch implementation of PPO, A2C, ACKTR, and GAIL
Python
6
star
19

gym-distracting-control

this is a packaged version of the distracting control suite from Stone et al.
Python
6
star
20

jaynes

Python
6
star
21

gym-dmc

DeepMind Control Suite plugin for gym
Jupyter Notebook
5
star
22

gym-sawyer

Sawyer robot adapter for OpenAI gym
Jupyter Notebook
4
star
23

yatta

yatta is a cli tool that manages a local *.bib index for your PDFs.
JavaScript
4
star
24

zaku

Machine Learning Job Queue and Procedural Calls
Python
4
star
25

visdom_helper

helpers for pyTorch and visdom to better impedance match.
Python
3
star
26

deep_machine_translation

A sequence to sequence model implemented in pyTorch
Python
3
star
27

practical-rl-at-scale

example for running RL code bases at scale
Python
3
star
28

ml-research-containers

A list of community maintained docker images for deep reinforcement learning research
Dockerfile
3
star
29

tensorflow_data_loading

A list of data loading patterns for TensorFlow
Python
3
star
30

reinforcement_learning_learning_notes

Python
3
star
31

legged-control-suite

A collection of legged robot environments built off DeepMind control
Python
3
star
32

moleskin

A Debugging and console print utility for python
Python
2
star
33

savio-starter-kit

starter-kit for running stuff on the Berkeley Research Cluster Savio cluster Batteries Included!πŸ”‹
Python
2
star
34

react-markdownit

JavaScript
2
star
35

KL_divergence

small example demonstrating the difference between KL(P||Q) vs KL(Q||P)
Jupyter Notebook
2
star
36

waterbear

Waterbear is a simple utility that makes your python dictionary accessible via dot notation.
Python
2
star
37

computer_science_basics

some of the basic topics for computer science.
Jupyter Notebook
2
star
38

ml-logger_examples

Usages Examples for ML-Logger
Python
1
star
39

vuer

Python
1
star
40

ICLR_2018_analysis

Analyzing the submission metadata from ICLR 2018
Python
1
star
41

graph-search

collection of graph-search algorithms
Python
1
star
42

react-component-props-table

a table component that takes the meta-data of a react component and renders it in a table
JavaScript
1
star
43

MyFirstApp

Java
1
star
44

tf_logger

logging utility for tensorboard
Python
1
star
45

rl-playground-old

playground repo for rl algorithms
Jupyter Notebook
1
star
46

dave

A command line utility that runs your script using arguments from a yaml config file
Python
1
star
47

many-world

a many-world task suite
Python
1
star
48

react-es6-template

a minimal react es6 template with automatic document generation
JavaScript
1
star
49

react-vis-graph-components

a link-graph component built on top of react-vis
JavaScript
1
star
50

react-codemirror

JavaScript
1
star
51

react-bristol

a react canvas paint component
JavaScript
1
star
52

torch_helpers

A collection of helpers for pyTorch
Python
1
star
53

megadraft-demo

JavaScript
1
star
54

memory

Collection of Replay Buffers for Reinforcement Learning Research
Makefile
1
star