• Stars
    star
    480
  • Rank 91,562 (Top 2 %)
  • Language
    Python
  • Created about 5 years ago
  • Updated almost 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors

GitHub Typo Corpus

A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors

Masato Hagiwara and Masato Mita

Introduction

Overview of GitHub Typo Corpus

Are you the kind of person who makes a lot of typos when writing code? Or are you the one who fixes them by making "fix typo" commits? Either way, thank you—you contributed to the state-of-the-art in the NLP field.

GitHub Typo Corpus is a large-scale dataset of misspellings and grammatical errors along with their corrections harvested from GitHub. It contains more than 350k edits and 65M characters in more than 15 languages, making it the largest dataset of misspellings to date.

See the paper for more information.

Dataset

Download the GitHub Typo Corpus (ver. 1.0.0)

The dataset is formatted in JSONL, one commit object per line. Here's sample of a commit object in the dataset:

{
  "repo": "https://github.com/user/repository",
  "commit": "08d8049...",
  "message": "Edit document.txt; fix a typo",
  "edits": [
    {
      "src": {
        "text": "check this dokument. On",
        "path": "document.txt",
        "lang": "eng",
        "ppl": 14.75...
      },
      "tgt": {
        "text": "check this document. On",
        "path": "document.txt",
        "lang": "eng",
        "ppl": 13.03...
      },
      "prob_typo": 0.9,
      "is_typo": true
    }
  ]
}

The commit object contains the following keys:

  • repo: URL of the repository
  • commit: hash of the commit
  • message: commit message
  • edits: list of edits extracted from this commit. An edit object contains the following keys:
    • src: text info before the edit
    • tgt: text info after the edit
    • prob_typo: probability that this edit is a typo edit (versus a type of edit that changes the meaning before and after the edit)
    • is_typo: true/false indicating if this edit is a typo edit (i.e., if prob_typo > 0.5)
    • src and tgt contain the following keys:
      • text: text of edit
      • path: path of the file the edit is made
      • lang: language of the text (automatically detected by NanigoNet)
      • ppl: perplexity of the text measured by a language model
    • Note: The prob_typo, is_typo, and ppl keys are only available for English (eng), Simplified Chinese (cmn-hans), and Japanese (jpn), the three largest languages in the dataset.

We recommend using tools like jq when browsing the file.

Source

See src/ for the source code for collecting repositories, commits, and edits. You need Python3 + GitPython to run the code.

Terms

The copyright and license terms of the individual commits and texts contained in the dataset follow the terms of the repositories they belong to. We collect and publish the GitHub Typo Corpus under GitHub's Acceptable Use Policies—5. Scraping and API Usage Restrictions. Let us know if you find any copyright issues regarding the dataset.

More Repositories

1

100-nlp-papers

100 Must-Read NLP Papers
3,722
star
2

realworldnlp

Example code for "Real-World Natural Language Processing"
Python
328
star
3

xfspell

xfspell — the Transformer Spell Checker
Shell
185
star
4

nanigonet

NanigoNet — Language detector for code-mixed input supporting 150+19 human+programming languages using deep neural networks
Python
70
star
5

cc-kedict

cc-kedict: Creative Commons Korean-English Dictionary
Python
41
star
6

zmifanva

zmifanva - Lojban ↔ English Machine Translation Engine
Python
37
star
7

nltk

NKTL Japanese related files
Python
22
star
8

enja.kdict.org

The world's fastest online dictionary
HTML
15
star
9

camxes.js

Lojban Parser written in JavaScript. Based on camxes.
JavaScript
15
star
10

cll-ja

Japanese summary translation of "The Complete Lojban Language"
XSLT
15
star
11

paper-reviews

10
star
12

awesome-japanese-nlp

📖 A curated list of resources for Japanese Natural Language Processing (NLP)
6
star
13

nlproc-cookbook

Python
6
star
14

chinese-nlp

mhagiwara's Chinese (language) related files
Python
4
star
15

LojbanDictionary

Swift
3
star
16

mhagiwara.github.io

Masato Hagiwara's user pages
HTML
3
star
17

deepnlp-kata

Deep NLP Kata - Practice Exercises for Deep Learning and Natural Language Processing
HTML
3
star
18

universalscripts

Parametrized Universal Scripts—generation model trained from all the scripts in the world
Jupyter Notebook
2
star
19

runway-distilgpt

DistilGPT2 model for Runway ML
Python
2
star
20

runway-e2e-tts

Real-time text-to-speech using ParallelWaveGAN
Python
2
star
21

nes-music-with-transformer

CSS
1
star
22

www.aimlbooks.com

HTML
1
star
23

www.realworldnlpbook.com

HTML
1
star
24

englishforhackers.com

HTML
1
star
25

fcg.sharedtask.org

Feedback Comment Generation Shared Task
HTML
1
star
26

aiml-dict-ja

AIML-dict-ja ― オープンソースの AI (人工知能)・ML (機械学習) 用語辞典
Python
1
star
27

szdict

Creative Commons Chinese-English Dictionary of Tech Terms
Python
1
star