g2pK: g2p module for Korean
g2p means a task that converts graphemes to phonemes. Hangul, the main script for Korean, is phonetic, but the pronunciation rules are notoriously complicated. So it is never easy to learn how to read a text in Korean. That's why g2p is necessary in various nlp tasks like TTS. . There's a open source g2p library for Korean, KoG2P. It is simple and works well, but I think we need a better one. Please read through the following section (main features and usage) to understand the philosophy of g2pK and how to use g2pK. We know it is not perfect in present. That's one of the reasons your contributions are more than welcome.
Requirements
- python >= 3.6
- jamo
- python-mecab-ko
- konlpy
- nltk
Installation
pip install g2pk
Main features & Usage
- Returns text as it is pronounced, keeping punctuations.
>>> from g2pk import G2p
>>> g2p = G2p()
>>> g2p("์ด์ ๋ ๋ ์จ๊ฐ ๋ง์๋๋ฐ, ์ค๋์ ํ๋ฆฌ๋ค.")
์ด์ ๋ ๋ ์จ๊ฐ ๋ง๊ฐ๋๋ฐ, ์ค๋๋ฅธ ํ๋ฆฌ๋ค.
- Determines pronunciation seeing context, thanks to Mecab, a morphological analyzer. In the following example, note that the first and second ์ ๊ณ are pronounced differently.
>>> g2p("์ ์ ์ ๊ณ ์ผ๋ฅธ ๋์ฌ๋ฌด์์ ๊ฐ์ ํผ์ธ ์ ๊ณ ํด๋ผ")
์๋ ์ ๊ผฌ ์ผ๋ฅธ ๋์ฌ๋ฌด์์ ๊ฐ์ ํธ๋ ์ ๊ณ ํด๋ผ
- Returns two types of results, that is, prescriptive (default) and descriptive (with the option
descriptive=True
) pronunciation. For example, josa ์ is pronounced ์ in principle, but in real life, it is often pronounced ์. Also, ๊ณ is much more often pronounced ๊ฒ.
>>> sent = "๋์ ์น๊ตฌ๋ ๊ณ์ฐ์ด ์์ฃผ ๋น ๋ฅด๋ค"
>>> g2p(sent)
๋์ ์น๊ตฌ๋ ๊ณ์ฌ๋ ์์ฃผ ๋น ๋ฅด๋ค
>>> g2p(sent, descriptive=True)
๋์ ์น๊ตฌ๋ ๊ฒ์ฌ๋ ์์ฃผ ๋น ๋ฅด๋ค
- This distinction becomes more obvious if you set
group_vowels=True
. In contemporary colloquial speech, some vowels are hard to distinguish from each other. For example, in the example below, the vowel ใ is normalized to ใ .
>>> sent = "์ ๋ ์์ ์ ๊ทธ ์๊ธฐ๋ฅผ ๋ค์ ์ ์ด ์์ต๋๋ค"
>>> g2p(sent)
์ ๋ ๋
์ ๋ค ๊ทธ ์๊ธฐ๋ฅผ ๋๋ฅธ ์ ๊ธฐ ์ป์๋๋ค
>>> g2p(sent, group_vowels=True)
์ ๋ ๋
์ ๋ค ๊ทธ ์๊ธฐ๋ฅผ ๋๋ฅธ ์ ๊ธฐ ์ป์๋๋ค
- By default, it returns the standard Korean script, where letters are assembled to form a syllable.
If you set
to_syl=False
, however, it returns Hangul letters or jamo. This can be useful for many applications like speech synthesis. *Depending on the font you are using, the two results below may look the same, but actually they are not.
>>> sent = "์ด์ ๋ ๋ ์จ๊ฐ ๋ง์๋๋ฐ, ์ค๋์ ํ๋ฆฌ๋ค."
>>> g2p(sent)
์ด์ ๋ ๋ ์จ๊ฐ ๋ง๊ฐ๋๋ฐ, ์ค๋๋ฅธ ํ๋ฆฌ๋ค.
>>> g2p(sent, to_syl=False)
แแ
ฅแแ
ฆแแ
ณแซ แแ
กแฏแแ
ตแแ
ก แแ
กแฏแแ
กแซแแ
ณแซแแ
ฆ, แแ
ฉแแ
ณแ
แ
ณแซ แแ
ณแ
แ
ตแแ
ก.
- English words in alphabets are converted into Hangul. This is possible due to cmu pronouncing dictionary.
>>> sent = "๊ทธ ์ฌ๋์ ์ข, old school ๊ฐ์"
>>> g2p(sent)
๊ทธ ์ฌ๋ผ๋ฏ ์ข, ์ฌ๋ ์ค์ฟจ ๊ฐํ
- Arabic numbers are spelled out to their context. Note that the first 12 is pronounced ์ด๋, whereas the second 12 is pronounced ์ญ์ด.
>>> sent = "์ง๊ธ ์๊ฐ์ 12์ 12๋ถ์
๋๋ค"
>>> g2p(sent)
์ง๊ธ ์๊ฐ๊ทธ ๋
๋์ ์๋น๋ถ๋๋๋ค
- It is natural that rules can NOT cover every single case. Add special idioms to
idioms.txt
. - If you set
verbose=True
, you will see the conversion processes with relevant information.
>>> sent = "ํ๊ต์ ๊ฐ๋ค ์์, ์๋ง๊ฐ ํด ์ฃผ์ ๋ฐฅ์ ๋จน์๋ค."
>>> g2p(sent, verbose=True)
ํ๊ต์ ๊ฐ๋ค ์์, ์๋ง๊ฐ ํด ์ฃผ์ ๋ฐฅ์ ๋จน์๋ค. -> ํ๊พ์ ๊ฐ๋ค ์์, ์๋ง๊ฐ ํด ์ฃผ์ ๋ฐฅ์ ๋จน์๋ค.
์ 23ํญใ๋ฐ์นจ 'ใฑ(ใฒ, ใ
, ใณ, ใบ), ใท(ใ
, ใ
, ใ
, ใ
, ใ
), ใ
(ใ
, ใผ, ใฟ, ใ
)' ๋ค์ ์ฐ๊ฒฐ๋๋ 'ใฑ, ใท, ใ
, ใ
, ใ
'์ ๋์๋ฆฌ๋ก ๋ฐ์ํ๋ค.
-> ๊ตญ๋ฐฅ[๊ตญ๋นฑ], ๊น๋ค[๊น๋ฐ], ๋๋ฐ์ด[๋๋น ์ง], ์ฏ๋[์ญ๋]
-> ๋ญ์ฅ[๋ฅ์งฑ], ์นก๋ฒ[์น๋ป ], ๋ป๋๋ค[๋ป๋๋ค], ์ท๊ณ ๋ฆ[์ซ๊ผฌ๋ฆ]
-> ์๋[์ป๋ค], ๊ฝ๊ณ [๊ผณ๊ผฌ], ๊ฝ๋ค๋ฐ[๊ผณ๋ฐ๋ฐ], ๋ฏ์ค๋ค[๋์ฐ๋ค]
-> ๋ฐญ๊ฐ์ด[๋ฐ๊น๋ฆฌ], ์ฅ์ [์์ฉ], ๊ณฑ๋[๊ณฑ๋], ๋ฎ๊ฐ[๋ฅ๊นจ]
-> ์์ง[์ฝ์ฐ], ๋์ฃฝํ๋ค[๋์ญ์นด๋ค], ์์กฐ๋ฆฌ๋ค[์์ชผ๋ฆฌ๋ค], ๊ฐ์ง๋ค[๊ฐ์ฐ๋ค]
ํ๊พ์ ๊ฐ๋ค ์์, ์๋ง๊ฐ ํด ์ฃผ์ ๋ฐฅ์ ๋จน์๋ค. -> ํ๊พ์ ๊ฐ๋ฐ ์์, ์๋ง๊ฐ ํด ์ฃผ์ ๋ฐฅ์ ๋จน์ป๋ฐ.
์ 9ํญใ๋ฐ์นจ 'ใฒ, ใ
', 'ใ
, ใ
, ใ
, ใ
, ใ
', 'ใ
'์ ์ด๋ง ๋๋ ์์ ์์์ ๊ฐ๊ฐ ๋ํ์ [ใฑ, ใท, ใ
]์ผ๋ก ๋ฐ์ํ๋ค.
-> ๋ฆ๋ค[๋ฅ๋ฐ], ํค์[ํค์ฝ], ํค์๊ณผ[ํค์ฝ๊ฝ], ์ท[์ซ]
-> ์๋ค[์ท๋ฐ], ์๋ค[์ป๋ฐ], ์ [์ ], ๋น๋ค[๋น๋ฐ]
-> ๊ฝ[๊ผณ], ์ซ๋ค[์ซ๋ฐ], ์ฅ[์], ๋ฑ๋ค[๋ฐท๋ฐ]
-> ์[์], ๋ฎ๋ค[๋ฅ๋ฐ]
์ 23ํญใ๋ฐ์นจ 'ใฑ(ใฒ, ใ
, ใณ, ใบ), ใท(ใ
, ใ
, ใ
, ใ
, ใ
), ใ
(ใ
, ใผ, ใฟ, ใ
)' ๋ค์ ์ฐ๊ฒฐ๋๋ 'ใฑ, ใท, ใ
, ใ
, ใ
'์ ๋์๋ฆฌ๋ก ๋ฐ์ํ๋ค.
-> ๊ตญ๋ฐฅ[๊ตญ๋นฑ], ๊น๋ค[๊น๋ฐ], ๋๋ฐ์ด[๋๋น ์ง], ์ฏ๋[์ญ๋]
-> ๋ญ์ฅ[๋ฅ์งฑ], ์นก๋ฒ[์น๋ป ], ๋ป๋๋ค[๋ป๋๋ค], ์ท๊ณ ๋ฆ[์ซ๊ผฌ๋ฆ]
-> ์๋[์ป๋ค], ๊ฝ๊ณ [๊ผณ๊ผฌ], ๊ฝ๋ค๋ฐ[๊ผณ๋ฐ๋ฐ], ๋ฏ์ค๋ค[๋์ฐ๋ค]
-> ๋ฐญ๊ฐ์ด[๋ฐ๊น๋ฆฌ], ์ฅ์ [์์ฉ], ๊ณฑ๋[๊ณฑ๋], ๋ฎ๊ฐ[๋ฅ๊นจ]
-> ์์ง[์ฝ์ฐ], ๋์ฃฝํ๋ค[๋์ญ์นด๋ค], ์์กฐ๋ฆฌ๋ค[์์ชผ๋ฆฌ๋ค], ๊ฐ์ง๋ค[๊ฐ์ฐ๋ค]
ํ๊พ์ ๊ฐ๋ฐ ์์, ์๋ง๊ฐ ํด ์ฃผ์ ๋ฐฅ์ ๋จน์ป๋ฐ. -> ํ๊พ์ ๊ฐ๋ฐ ์์, ์๋ง๊ฐ ํด ์ฃผ์ ๋ฐ๋ธ ๋จธ๊ฑท๋ฐ.
์ 13ํญใํ๋ฐ์นจ์ด๋ ์๋ฐ์นจ์ด ๋ชจ์์ผ๋ก ์์๋ ์กฐ์ฌ๋ ์ด๋ฏธ, ์ ๋ฏธ์ฌ์ ๊ฒฐํฉ๋๋ ๊ฒฝ์ฐ์๋, ์ ์๊ฐ๋๋ก ๋ค ์์ ์ฒซ์๋ฆฌ๋ก ์ฎ๊ฒจ ๋ฐ์ํ๋ค.
-> ๊น์[๊น๊น], ์ท์ด[์ค์], ์์ด[์ด์จ], ๋ฎ์ด[๋์ง]
-> ๊ฝ์[๊ผฌ์], ๊ฝ์[๊ผฌ์ธจ], ์ซ์[์ชผ์ฐจ], ๋ฐญ์[๋ฐํ
]
-> ์์ผ๋ก[์ํ๋ก], ๋ฎ์ด๋ค[๋ํผ๋ค]
References
If you use our software for research, please cite:
@misc{park2019g2pk,
author = {Park, Kyubyong},
title = {g2pK},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/Kyubyong/g2pk}}
}