• Stars
    star
    170
  • Rank 223,357 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 6 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Tacotron, Korean, Wavenet-Vocoder, Korean TTS

Tocotron + Wavenet Vocoder + Korean TTS

Tacotron๋ชจ๋ธ๊ณผ Wavenet Vocoder๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ํ•œ๊ตญ์–ด TTS๊ตฌํ˜„ํ•˜๋Š” project์ž…๋‹ˆ๋‹ค.

Based on

Tacotron History

  • keithito๊ฐ€ Tocotron์„ ์ฒ˜์Œ ๊ตฌํ˜„ํ•˜์—ฌ ๊ณต๊ฐœํ•˜์˜€๊ณ , ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ๊ตญ์–ด๋ฅผ ์ ์šฉํ•œ carpedm20์˜ ๊ตฌํ˜„์ด ์žˆ๋‹ค.
  • carpedm20์˜ ๊ตฌํ˜„์€ deep voice2์—์„œ ์ œ์•ˆํ•˜๊ณ  ์žˆ๋Š” multi-speaker๋„ ๊ฐ™์ด ๊ตฌํ˜„ํ–ˆ๋‹ค.
  • Tacotron๋ชจ๋ธ์—์„œ๋Š” vocoder๋กœ Griffin Lim ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋‹ค.

Wavenet History

  • Wavenet ๊ตฌํ˜„์€ ibab์˜ ๊ตฌํ˜„์ด ๋Œ€ํ‘œ์ ์ด๋‹ค.
  • ibab์€ local condition์„ ๊ตฌํ˜„ํ•˜์ง€ ์•Š์•˜๋‹ค. ๊ทธ๋ž˜์„œ train ํ›„, ์†Œ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•˜๋ฉด ์•Œ์•„๋“ค์„ ์ˆ˜ ์žˆ๋Š” ๋ง์ด ์•„๋‹ˆ๊ณ , '์˜น์•Œ๊ฑฐ๋ฆฌ๋Š” ์†Œ๋ฆฌ'๋งŒ ๋“ค์„ ์ˆ˜ ์žˆ๋‹ค. ์˜๋ฏธ ์žˆ๋Š” ์†Œ๋ฆฌ๋ฅผ ๋“ค์„ ์ˆ˜ ์žˆ๊ธฐ ์œ„ํ•ด์„œ๋Š” local condition์„ ์ ์šฉํ•ด์„œ ๊ตฌํ˜„ํ•ด์•ผ ํ•œ๋‹ค.
  • local condition์„ ๊ตฌํ˜„ํ•œ wavenet-vocoder ๊ตฌํ˜„์€ r9y9์˜ ๊ตฌํ˜„์ด ๋Œ€ํ‘œ์ ์ด๋‹ค.
  • local condition์œผ๋กœ mel spectrogram์„ ๋„ฃ์–ด์ฃผ๋Š”๋ฐ, mel spectrogram์€ raw audio ๊ธธ์ด๋ณด๋‹ค ์งง์•„์ง€๊ธฐ ๋•Œ๋ฌธ์— upsampling ๊ณผ์ •์ด ํ•„์š”ํ•˜๋‹ค. upsampling์€ conv2d_transpose๋ฅผ ์ด์šฉํ•œ๋‹ค.

Tacotron 2

  • Tacotron2์—์„œ๋Š” ๋ชจ๋ธ ๊ตฌ์กฐ๋„ ๋ฐ”๋€Œ์—ˆ๊ณ , Location Sensitive Attention, Stop Token, Vocoder๋กœ Wavenet์„ ์ œ์•ˆํ•˜๊ณ  ์žˆ๋‹ค.
  • Tacotron2์˜ ๊ตฌํ˜„์€ Rayhane-mamah์˜ ๊ฒƒ์ด ์žˆ๋Š”๋ฐ, ์ด ์—ญ์‹œ, keithito, r9y9์˜ ์ฝ”๋“œ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฐœ์ „๋œ ๊ฒƒ์ด๋‹ค.

This Project

  • Tacotron ๋ชจ๋ธ์— Wavenet Vocoder๋ฅผ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด 1์ฐจ ๋ชฉํ‘œ์ด๋‹ค.
  • Tacotron๊ณผ Wavenet Vocoder๋ฅผ ๊ฐ™์ด ๊ตฌํ˜„ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” mel spectrogram์„ ๋งŒ๋“ค๋•Œ ๋ถ€ํ„ฐ, ๋‘ ๋ชจ๋ธ ๋ชจ๋‘์— ์ ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ๋งŒ๋“ค์–ด ์ฃผ์–ด์•ผ ํ•œ๋‹ค(audio์˜ ๊ธธ์ด๊ฐ€ hop_size์˜ ๋ฐฐ์ˆ˜๊ฐ€ ๋  ์ˆ˜ ์žˆ๋„๋ก). ์ด๋ ‡๊ฒŒ ํ•ด์•ผ, wavenet trainingํ•  ๋•Œ, upsampling์ด ์›ํ• ํ•˜๋‹ค.
  • Tacotron2์˜ stop token์ด๋‚˜ Location Sensitive Attention์„ Tacotron1์— ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด ๊ทธ๋ ‡๊ฒŒ ํšจ๊ณผ์ ์ด์ง€ ๋ชปํ–ˆ๋‹ค(์ œ ๊ฒฝํ—˜์ƒ).
  • carpedm20์˜ ๊ตฌํ˜„๊ณผ ๋‹ค๋ฅธ ์ 
    • Tensorflow 1.3์—์„œ๋งŒ ์‹คํ–‰๋˜๋Š” carpedm20์˜ ๊ตฌํ˜„์„ tensorflow 1.8์ด์ƒ์—์„œ๋„ ์ž‘๋™ํ•  ์ˆ˜ ์žˆ๊ฒŒ ์ˆ˜์ •. Tensorflow ๋ฒ„์ „์ด ์—…๊ทธ๋ ˆ์ด๋“œ๋˜๋ฉด์„œ, AttentionWrapperState์—์„œ attention_state๊ฐ€ ์ถ”๊ฐ€๋˜์—ˆ๋Š”๋ฐ, ์ด ๋ถ€๋ถ„์„ ๋งž๊ฒŒ ์ˆ˜์ •ํ•ด ์คŒ.
    • dropout bug ์ˆ˜์ •
    • DecoderPrenetWrapper, AttentionWrapper ์ˆœ์„œ๋ฅผ ๋ฐ”๋กœ ์žก์Œ. ์ด๋ ‡๊ฒŒ ํ•ด์•ผ keithito์˜ ๊ตฌํ˜„๊ณผ ๊ฐ™์•„์ง€๊ณ  ๋…ผ๋ฌธ์—์„œ์˜ ์ทจ์ง€์™€๋„ ์ผ์น˜ํ•จ. AttentionWrapper๋ฅผ DecoderPrenetWrapper๊ฐ€ ๊ฐ์‹ธ์•ผ, Prenet์˜ ๊ฒฐ๊ณผ๊ฐ€ AttentionWrapper์˜ ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด๊ฐ„๋‹ค.
    • mel spectrogram ์ƒ์„ฑ ๋ฐฉ์‹์„ keithito์˜ ๊ตฌํ˜„ ๋ฐฉ๋ฒ•์œผ๋กœ ํ™˜์›(์ด๊ฒƒ๋„ keithito๊ฐ€ ์ถ”ํ›„์— ์ˆ˜์ •ํ•œ ๊ฒƒ์ž„). ์ด๋ ‡๊ฒŒ mel spectrogram ์ƒ์„ฑ๋ฐฉ์‹์„ ๋ฐ”๊พธ๋ฉด train ์†๋„๊ฐ€ ๋งŽ์ด ํ–ฅ์ƒ๋จ. 20k step ์ด์ƒ trainํ•ด์•ผ ์†Œ๋ฆฌ๊ฐ€ ๋“ค๋ฆฌ๊ธฐ ์‹œ์ž‘ํ–ˆ๋Š”๋ฐ, ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด 8k step๋ถ€ํ„ฐ ์†Œ๋ฆฌ๊ฐ€ ๋“ค๋ฆฐ๋‹ค.
    • padding์ด ๋œ ๊ณณ์— Attention์ด ๊ฐ€์ง€ ์•Š๋„๋ก ๋ณด์™„.
    • Attention ๋ชจ๋ธ ์ถ”๊ฐ€: LocationSensitiveAttention, GmmAttention ๋“ฑ
  • ibab์˜ wavenet ๊ตฌํ˜„๊ณผ ๋‹ค๋ฅธ ์ 
    • fast generation์„ ์œ„ํ•ด์„œ tf.Variable์„ ์ด์šฉํ•ด์„œ ๊ตฌํ˜„ํ–ˆ๋‹ค. ์ด project์—์„œ๋Š” Tensorflow middle level api tf.layers.conv1d๋ฅผ ์ด์šฉํ•˜์—ฌ, ์ฝ”๋“œ๋ฅผ ์ดํ•ดํ•˜๊ธฐ ์‰ฝ๊ฒŒ ๋งŒ๋“ค์—ˆ๋‹ค.
  • ์ฐธ๊ณ  ์ฝ”๋“œ ๋“ฑ์—์„œ์˜ ๋ณต์žกํ•œ option์„ ๋งŽ์ด ์ค„์˜€์Šต๋‹ˆ๋‹ค.

Tacotron 1์—์„œ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด์„œ๋Š”

  • BahdanauMonotonicAttention์— normalize=True๋กœ ์ ์šฉํ•˜๋ฉด Attention์ด ์ž˜ ํ•™์Šต๋œ๋‹ค.
  • Location Sensitive Attention, GMM Attention๋“ฑ์€ ์ œ ๊ฒฝํ—˜์œผ๋กœ๋Š” ์„ฑ๋Šฅ์ด ์ž˜ ๋‚˜์ง€ ์•Š์Œ.
  • Tacotron2์—์„œ๋Š” Locatin Sensitive Attention๊ณผ Stop Token์ด ๊ฒฐํ•ฉํ•˜์—ฌ Tacotron1๋ณด๋‹ค ๋น ๋ฅธ๊ฒŒ ์ˆ˜๋ ด๋จ.

๋‹จ๊ณ„๋ณ„ ์‹คํ–‰

์‹คํ–‰ ์ˆœ์„œ

  • data ๋งŒ๋“ค๊ธฐ
  • tacotron training ํ›„, synthesize.py๋กœ test.
  • wavenet training ํ›„, generate.py๋กœ test(tactron์ด ๋งŒ๋“ค์ง€ ์•Š์€ mel spectrogram์œผ๋กœ testํ•  ์ˆ˜๋„ ์žˆ๊ณ , tacotron์ด ๋งŒ๋“  mel spectrogram์„ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ๋‹ค.)
  • 2๊ฐœ ๋ชจ๋ธ ๋ชจ๋‘ train ํ›„, tacotron์—์„œ ์ƒ์„ฑํ•œ mel spectrogram์„ wavent์— local condition์œผ๋กœ ๋„ฃ์–ด testํ•˜๋ฉด ๋œ๋‹ค.

Data ๋งŒ๋“ค๊ธฐ

  • audio data(e.g. wave ํŒŒ์ผ)์„ ๋‹ค์šด๋ฐ›๊ณ , 1~3์ดˆ(์ตœ๋Œ€ 12์ดˆ)๊ธธ์ด๋กœ ์ž˜๋ผ์ฃผ๋Š” ์ž‘์—…์„ ํ•ด์•ผ ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ž˜๋ผ์ง„ audio์™€ text(script)์˜ sync๋ฅผ ๋งž์ถ”๋Š” ๊ฒƒ์€ ๊ณ ๋‹จํ•œ ์ž‘์—…์ด๋‹ค. Google Speech API๋ฅผ ์ด์šฉํ•˜๋Š” ๊ฒƒ๋„ ํ•˜๋‚˜์˜ ๋ฐฉ๋ฒ•์ด ๋  ์ˆ˜ ์žˆ๋‹ค.
  • Google Speech API๋กœ ์ƒ์„ฑํ•œ text์˜ Quality๊ฐ€ ์ข‹์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ˆ˜์ž‘์—…์œผ๋กœ (์•„์ฃผ) ๋งŽ์ด ๋ณด์ •ํ•ด ์ฃผ์–ด์•ผ ํ•œ๋‹ค.
  • ํŠน๋ณ„ํžˆ data๋ฅผ ํ™•๋ณดํ•  ๋ฐฉ๋ฒ•์ด ์—†์œผ๋ฉด, carpedm20์—์„œ ์„ค๋ช…ํ•˜๊ณ  ์žˆ๋Š”๋Œ€๋กœ ํ•˜๋ฉด ๋œ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” data๋ฅผ ๋‹ค์šด๋ฐ›์€ ํ›„, ์นจ๋ฌต(silence)๊ตฌ๊ฐ„์„ ๊ธฐ์ค€์œผ๋กœ ์ž๋ฅธ ํ›„, Google Speech API๋ฅผ ์ด์šฉํ•˜์—ฌ text์™€ sync๋ฅผ ๋งž์ถ”๊ณ  ์žˆ๋‹ค.
  • ํ•œ๊ธ€ data๋Š” KSS Dataset๊ฐ€ ์žˆ๊ณ , ์˜์–ด data๋Š” LJ Speech Dataset, VCTK corpus ๋“ฑ์ด ์žˆ๋‹ค.
  • KSS Dataset์ด๋‚˜ LJ Speech Dataset๋Š” ์ด๋ฏธ ์ ๋‹นํ•œ ๊ธธ์ด๋กœ ๋‚˜๋ˆ„์–ด์ ธ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, data์˜ Quality๋Š” ์šฐ์ˆ˜ํ•˜๋‹ค.
  • ๊ฐ speaker๋ณ„๋กœ wav ํŒŒ์ผ์„ ํŠน์ • directory์— ๋ชจ์€ ํ›„, text์™€ wavํŒŒ์ผ์˜ ๊ด€๊ณ„๋ฅผ ์„ค์ •ํ•˜๋Š” ํŒŒ์ผ์„ ๋งŒ๋“  ํ›„, preprocess.py๋ฅผ ์‹คํ–‰ํ•˜๋ฉด ๋œ๋‹ค. ๋‹ค์Œ์˜ ์˜ˆ๋Š” son.py์—์„œ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ๋“ฏ์ด 'son-recognition-All.json'์— ํ•„์š”ํ•œ ์ •๋ณด๋ฅผ ๋ชจ์•„ ๋†“์•˜๋‹ค.
  • ๊ฐ์ž์˜ ์ƒํ™ฉ์— ๋งž๊ฒŒ preprocessingํ•˜๋Š” ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•ด์•ผ ํ•œ๋‹ค. ์ด project์—์„œ๋Š” son, moon 2๊ฐœ์˜ example์ด ํฌํ•จ๋˜์–ด ์žˆ๋‹ค.

python preprocess.py --num_workers 8 --name son --in_dir .\datasets\son --out_dir .\data\son

  • ์œ„์˜ ๊ณผ์ •์„ ๊ฑฐ์น˜๋“  ๋˜๋Š” ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋“  speaker๋ณ„ data ๋””๋ ‰ํ† ๋ฆฌ์— npzํŒŒ์ผ์ด ์ƒ์„ฑ๋˜๋ฉด trainํ• ์ˆ˜ ์žˆ๋Š” ์ค€๋น„๊ฐ€ ๋๋‚œ๋‹ค. npzํŒŒ์ผ์—๋Š” dictํ˜•์˜ data๊ฐ€ ๋“ค์–ด๊ฐ€๊ฒŒ ๋˜๋Š”๋ฐ, key๋Š” ['audio', 'mel', 'linear', 'time_steps', 'mel_frames', 'text', 'tokens', 'loss_coeff']๋กœ ๋˜์–ด ์žˆ๋‹ค. ์ค‘์š”ํ•œ ๊ฒƒ์€ audio์˜ ๊ธธ์ด๊ฐ€ mel, linear์˜ hop_size ๋ฐฐ๋กœ ๋˜์–ด์•ผ ๋œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

Tacotron Training

  • train_tacotron.py ๋‚ด์—์„œ '--data_paths'๋ฅผ ์ง€์ •ํ•œ ํ›„, trainํ•  ์ˆ˜ ์žˆ๋‹ค.
parser.add_argument('--data_paths', default='.\\data\\moon,.\\data\\son')
  • train์„ ์ด์–ด์„œ ๊ณ„์†ํ•˜๋Š” ๊ฒฝ์šฐ์—๋Š” '--load_path'๋ฅผ ์ง€์ •ํ•ด ์ฃผ๋ฉด ๋œ๋‹ค.
parser.add_argument('--load_path', default='logdir-tacotron/moon+son_2018-12-25_19-03-21')
  • speaker๊ฐ€ 1๋ช… ์ผ ๋•Œ๋Š”, hparams์˜ model_type = 'single'๋กœ ํ•˜๊ณ  train_tacotron.py ๋‚ด์—์„œ '--data_paths'๋ฅผ 1๊ฐœ๋งŒ ๋„ฃ์–ด์ฃผ๋ฉด ๋œ๋‹ค.
parser.add_argument('--data_paths', default='D:\\Tacotron-Wavenet-Vocoder\\data\\moon')
  • ํ•˜์ดํผํŒŒ๋ผ๋ฉ”ํ„ฐ๋ฅผ hparmas.py์—์„œ argument๋ฅผ train_tacotron.py์—์„œ ๋‹ค ์„ค์ •ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์—, train ์‹คํ–‰์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‹จ์ˆœํ•ฉ๋‹ˆ๋‹ค.

python train_tacotron.py

  • train ํ›„, ์Œ์„ฑ์„ ์ƒ์„ฑํ•˜๋ ค๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ•˜๋ฉด ๋œ๋‹ค. '--num_speaker', '--speaker_id'๋Š” ์ž˜ ์ง€์ •๋˜์–ด์•ผ ํ•œ๋‹ค.

python synthesizer.py --load_path logdir-tacotron/moon+son_2018-12-25_19-03-21 --num_speakers 2 --speaker_id 0 --text "์˜ค์ŠคํŠธ๋ž„๋กœํ”ผํ…Œ์ฟ ์Šค ์•„ํŒŒ๋ Œ์‹œ์Šค๋Š” ๋ฉธ์ข…๋œ ์‚ฌ๋žŒ์กฑ ์ข…์œผ๋กœ, ํ˜„์žฌ์—๋Š” ๋ผˆ ํ™”์„์ด ๋ฐœ๊ฒฌ๋˜์–ด ์žˆ๋‹ค."

Wavenet Vocoder Training

  • train_vocoder.py ๋‚ด์—์„œ '--data_dir'๋ฅผ ์ง€์ •ํ•œ ํ›„, trainํ•  ์ˆ˜ ์žˆ๋‹ค.
  • memory ๋ถ€์กฑ์œผ๋กœ training ๋˜์ง€ ์•Š๊ฑฐ๋‚˜ ๋„ˆ๋ฌด ๋Š๋ฆฌ๋ฉด, hyper paramerter ์ค‘ sample_size๋ฅผ ์ค„์ด๋ฉด ๋œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ receptive field๋ณด๋‹ค ์ ๊ฒŒ ํ•˜๋ฉด ์•ˆ๋œ๋‹ค. ๋ฌผ๋ก  batch_size๋ฅผ ์ค„์ผ ์ˆ˜๋„ ์žˆ๋‹ค.
DATA_DIRECTORY =  'D:\\Tacotron-Wavenet-Vocoder\\data\\moon,D:\\Tacotron-Wavenet-Vocoder\\data\\son'
parser.add_argument('--data_dir', type=str, default=DATA_DIRECTORY, help='The directory containing the VCTK corpus.')
  • train์„ ์ด์–ด์„œ ๊ณ„์†ํ•˜๋Š” ๊ฒฝ์šฐ์—๋Š” '--logdir'๋ฅผ ์ง€์ •ํ•ด ์ฃผ๋ฉด ๋œ๋‹ค.
LOGDIR = './/logdir-wavenet//train//2018-12-21T22-58-10'
parser.add_argument('--logdir', type=str, default=LOGDIR)
  • wavenet train ํ›„, tacotron์ด ์ƒ์„ฑํ•œ mel spectrogram(npyํŒŒ์ผ)์„ local condition์œผ๋กœ ๋„ฃ์–ด์„œ STT์˜ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.

python generate.py --mel ./logdir-wavenet/mel-moon.npy --gc_cardinality 2 --gc_id 0 ./logdir-wavenet/train/2018-12-21T22-58-10

Result

  • tacotron๋ชจ๋ธ์—์„œ๋Š” griffin lim vocoder๋ฅผ ํ†ตํ•ด์„œ audio sample์„ ๋งŒ๋“ค์–ด ๋‚ด๋Š”๋ฐ, ์Œ์งˆ์ด ๋‚˜์˜์ง€ ์•Š๋‹ค.
  • wavenet vocoder๋Š” train step์ด ๋ถ€์กฑํ•  ๋•Œ๋Š” ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป๊ธฐ ์–ด๋ ต๋‹ค. ๋‹ค์Œ issue๋“ค์—์„œ๋„ ๊ทธ๋Ÿฐ ์‚ฌ์‹ค์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
    • r9y9/wavenet_vocoder#110 : 1000K ์ด์ƒ trainํ•ด์•ผ noise ์—†๋Š” ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค๊ณ  ๋งํ•˜๊ณ  ์žˆ๋‹ค.
    • keithito/tacotron#64 : train ์†๋„๊ฐ€ ๋Š๋ฆฌ๊ณ , ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์ง€ ๋ชปํ–ˆ๋‹ค๊ณ  ๋งํ•˜๊ณ  ์žˆ๋‹ค.
    • r9y9/wavenet_vocoder#1 : step 80K, 90K ๊ฒฐ๊ณผ๊ฐ€ ์ฒจ๋ถ€๋˜์–ด ์žˆ๋Š”๋ฐ, ๊ฒฐ๊ณผ๊ฐ€ ์ข‹์ง€๋Š” ๋ชปํ•˜๋‹ค.
    • https://r9y9.github.io/wavenet_vocoder/ : ๊ทธ๋Ÿผ์—๋„ ์ข€ ๋” ๋งŽ์€ train step์„ ์ˆ˜ํ–‰ํ•˜๋ฉด ์ข‹์€ ๊ฒฐ๊ณผ๊ฐ€ ์–ป์–ด์ง€๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
  • ์ด project์—์„œ ์–ป์€ ๊ฒฐ๊ณผ: wavenet vocoder๋กœ ๋ถ€ํ„ฐ ์–ป์€ ๊ฒฐ๊ณผ๋Š” train step ๋ถ€์กฑ์œผ๋กœ ๊ฒฐ๊ณผ๊ฐ€ ์ข‹์ง€๋Š” ๋ชปํ•˜๋‹ค. ์„ฑ๋Šฅ์ด ์ข‹์€ GPU๋กœ trainํ•˜๋ฉด ๋” ์ข‹์€ ๊ฒฐ๊ณผ๊ฐ€ ์žˆ์„ ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€ํ•ฉ๋‹ˆ๋‹ค.

์Œ์„ฑ์„ ์ฒ˜์Œ ๊ณต๋ถ€ํ•˜๋Š” ๋ถ„๋“ค๊ป˜

  • Tensorflow์˜ Simple Audio Recognition์€ ์Œ์„ฑ๊ด€๋ จ ๊ณต๋ถ€๋ฅผ ์ฒ˜์Œ ์‹œ์ž‘ํ•˜๋Š” ์‚ฌ๋žŒ๋“ค์—๊ฒŒ ์ข‹์€ ์‹œ์ž‘์ ์ด ๋  ์ˆ˜ ์žˆ๋‹ค.
  • ์ด๋ฅผ ํ†ตํ•ด, wav๋กœ ๋œ ์Œ์„ฑ์„ stft์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ๋‹ค์‹œ mel spectrogram์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์„ ๊ณต๋ถ€ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • Simple Audio Recognition์„ ๊ณต๋ถ€ํ•œ ํ›„์—๋Š” Tacotron์„ ๊ณต๋ถ€ํ•  ์ˆ˜ ์žˆ์ˆ˜๋„ ์žˆ์ง€๋งŒ, ๋”ฅ๋Ÿฌ๋‹์—์„œ์˜ ๊ธฐ๋ณธ์ธ RNN, Attention์— ๊ด€ํ•œ ๊ณต๋ถ€๋ฅผ ๋ฏธ๋ฆฌํ•ด ๋‘๋ฉฐ ๋”์šฑ ์ข‹๋‹ค.
  • ์ด ์ž๋ฃŒ๋Š” ์Œ์„ฑ์ธ์‹ ๊ธฐ์ดˆ, Tacotron, Wavenet ๋“ฑ์— ๊ด€ํ•œ ๋‚ด์šฉ์„ ์ œ๊ฐ€ ์ •๋ฆฌํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค(page 133).
  • ๋˜ํ•œ Tensorflow์—์„œ Attention Mechanism์ด ์–ด๋–ป๊ฒŒ ์ž‘๋™๋˜๋Š”์ง€์— ๊ด€ํ•œ ์ž๋ฃŒ๋„ ์ •๋ฆฌ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค(page 69).
  • Facebook TFKR์— ์ œ๊ฐ€ ์ž‘์„ฑํ•œ ๊ธ€๋„ ์ฐธ๊ณ ํ•˜์„ธ์š”.