• Stars
    star
    318
  • Rank 128,650 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 4 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network

Modified VocGAN


This repo implements modified version of [VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network](https://arxiv.org/abs/2007.15256) using Pytorch, for actual VocGAN checkout to `baseline` branch. I bit modify the VocGAN's generator and used Full-Band MelGAN's discriminator instead of VocGAN's discriminator, as in my research I found MelGAN's discriminator is very fast while training and enough powerful to train Generator to produce high fidelity voice whereas VocGAN Hierarchically-nested JCU discriminator is quite huge and extremely slows the training process.

Tested on Python 3.6

pip install -r requirements.txt

Prepare Dataset

  • Download dataset for training. This can be any wav files with sample rate 22050Hz. (e.g. LJSpeech was used in paper)
  • preprocess: python preprocess.py -c config/default.yaml -d [data's root path]
  • Edit configuration yaml file

Train & Tensorboard

  • python trainer.py -c [config yaml file] -n [name of the run]

    • cp config/default.yaml config/config.yaml and then edit config.yaml
    • Write down the root path of train/validation files to 2nd/3rd line.
  • tensorboard --logdir logs/

Notes

  1. This repo implements modified VocGAN for faster training although for true VocGAN implementation please checkout baseline branch, In my testing I am available to generate High-Fidelity audio in real time from Modified VocGAN.
  2. Training cost for baseline VocGAN's Discriminator is too high (2.8 sec/it on P100 with batch size 16) as compared to Generator (7.2 it/sec on P100 with batch size 16), so it's unfeasible for me to train this model for long time.
  3. May be we can optimizer baseline VocGAN's Discriminator by downsampling the audio on pre-processing stage instead of Training stage (currently I used torchaudio.transform.Resample as layer for downsampling the audio), this step might be speed-up overall Discriminator training.
  4. I trained baseline model for 300 epochs (with batch size 16) on LJSpeech, and quality of generated audio is similar to the MelGAN at same epoch on same dataset. Author recommend to train model till 3000 epochs which is not feasible at current training speed (2.80 sec/it).
  5. I am open for any suggestion and modification on this repo.
  6. For more complete and end to end Voice cloning or Text to Speech (TTS) toolbox 🤖 please visit Deepsync Technologies.

Inference

  • python inference.py -p [checkpoint path] -i [input mel path]

Pretrained models

Two pretrained model are provided. Both pretrained models are trained using modified-VocGAN structure.

Audio Samples

Using pretrained models, we can reconstruct audio samples. Visit here to listen.

Results

[WIP]

References

More Repositories

1

ViViT-pytorch

Implementation of ViViT: A Video Vision Transformer
Python
470
star
2

ResUnet

Pytorch implementation of ResUnet and ResUnet ++
Python
425
star
3

FNet-pytorch

Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms
Python
248
star
4

FastSpeech2

PyTorch Implementation of FastSpeech 2 : Fast and High-Quality End-to-End Text to Speech
Jupyter Notebook
216
star
5

convolution-vision-transformers

PyTorch Implementation of CvT: Introducing Convolutions to Vision Transformers
Python
215
star
6

iSTFTNet-pytorch

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform
Python
210
star
7

MLP-Mixer-pytorch

Unofficial implementation of MLP-Mixer: An all-MLP Architecture for Vision
Python
206
star
8

hifigan-denoiser

HiFi-GAN: High Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks
Python
195
star
9

CrossViT-pytorch

Implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
Python
175
star
10

AdaSpeech

AdaSpeech: Adaptive Text to Speech for Custom Voice
Jupyter Notebook
157
star
11

HiFiplusplus-pytorch

HiFi++: a Unified Framework for Neural Vocoding, Bandwidth Extension and Speech Enhancement
Python
144
star
12

SoundStorm-pytorch

Google's SoundStorm: Efficient Parallel Audio Generation
Python
115
star
13

Avocodo-pytorch

Avocodo: Generative Adversarial Network for Artifact-free Vocoder
Python
114
star
14

Fre-GAN-pytorch

Fre-GAN: Adversarial Frequency-consistent Audio Synthesis
Python
100
star
15

CeiT-pytorch

Implementation of Convolutional enhanced image Transformer
Python
98
star
16

vae_tacotron2

VAE Tacotron 2, an alternative of GST Tacotron
Python
85
star
17

TalkNet2-pytorch

TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction.
Python
85
star
18

TFGAN

TFGAN: Time and Frequency Domain Based Generative Adversarial Network for High-fidelity Speech Synthesis
Python
82
star
19

LightSpeech

LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search
Python
79
star
20

HiFi-GAN

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
Python
78
star
21

AdaSpeech2

AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data
Jupyter Notebook
69
star
22

UnivNet-pytorch

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation
Python
68
star
23

NaturalSpeech2

Python
68
star
24

melgan

MelGAN implementation with Multi-Band and Full Band supports...
Jupyter Notebook
60
star
25

AudioMAE-pytorch

Unofficial PyTorch implementation of Masked Autoencoders that Listen
Python
60
star
26

Liveness-Detection

Liveness Detection for human face
Python
52
star
27

iSTFT-Avocodo-pytorch

Ultrafast GAN based Vocoder for Text to Speech
Python
51
star
28

gmvae_tacotron

Gaussian Mixture VAE Tacotron
Python
51
star
29

Phone-Level-Mixture-Density-Network-for-TTS

Rich Prosody Diversity Modelling with Phone-level Mixture Density Network
Jupyter Notebook
45
star
30

LSTM-Time-Series-Analysis

Using LSTM network for time series forecasting
Jupyter Notebook
44
star
31

NU-Wave-pytorch

NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling
Python
38
star
32

PPSpeech

PPSpeech: Phrase based Parallel End-to-End TTS System
Python
35
star
33

ResMLP-pytorch

ResMLP: Feedforward networks for image classification with data-efficient training
Python
35
star
34

Zero-Shot-TTS

Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
Python
34
star
35

NU-Wave2-pytorch

NU-Wave 2: A General Neural Audio Upsampling Model for Various Sampling Rates [WIP]
Python
24
star
36

SiT-pytorch

SiT: Self-supervised vision Transformer
Python
18
star
37

rectified-linear-attention

Sparse Attention with Linear Units
Python
17
star
38

CoaT-pytorch

CoaT: Co-Scale Conv-Attentional Image Transformers
Python
16
star
39

Movie-Recommender-System

Python
13
star
40

LocalViT-pytorch

LocalViT: Bringing Locality to Vision Transformers
Python
9
star
41

Bidirectional-LEM-pytorch

Pytorch Implementation of Bidirectional Long Expressive Memory
Python
9
star
42

compact-convolution-transformer

Compact Convolution Transformers
Python
8
star
43

WaveFlow

WaveFlow : A Compact Flow-based Model for Raw Audio
Python
4
star
44

McKinsey-Hiring-Hack-Challenge

My solution for Online McKinsey Hiring Hack Challenge hosted by Analytics Vidhya.
Jupyter Notebook
4
star
45

Word2Vec

Word2Vec tutorial using tensorflow
Jupyter Notebook
3
star
46

IMDB-Movie-Review-sentiment-Analysis

Jupyter Notebook
3
star
47

Meme-recognizer

Recognize the given image is Meme or not
Jupyter Notebook
3
star
48

Introduction-to-Tensorflow

Tensorflow tutorial from scratch
Jupyter Notebook
2
star
49

Image-classifier-for-all

Universal Image classifier
Jupyter Notebook
2
star
50

fastspeech2_samples

2
star
51

MyApplication

Android application in which audio and image play simultaneously
Java
2
star
52

Loan-Prediction-Challenge

Jupyter Notebook
2
star
53

CNN-Visualization

Jupyter Notebook
2
star
54

Twins-SVT-pytorch

Twins: Revisiting the Design of Spatial Attention in Vision Transformers
2
star
55

Avito-Duplicate-ads

Jupyter Notebook
1
star
56

PropertySetUp

1
star
57

Document-Classifier

Classify documents using Machine learning
Jupyter Notebook
1
star
58

Data-Analysis

Jupyter Notebook
1
star
59

Natural-Language-Processing

Jupyter Notebook
1
star
60

Keras

Predictive analysis using Keras a powerful Neural network library run over theano for python
Jupyter Notebook
1
star
61

rishikksh20.github.io

My Github Blog
HTML
1
star
62

Data-Mining-Algos

Famous Data Mining Algos written in python using scikit-learn library
Python
1
star
63

Identify-Question-Type

Given a question, the aim is to identify the category it belongs to. The four categories to handle for this assignment are : Who, What, When, Affirmation(yes/no). Label any sentence that does not fall in any of the above four as "Unknown" type.
Jupyter Notebook
1
star
64

Inception-Transformer-pytorch

iFormer: Inception Transformer
1
star
65

Email-Classification-Statement-Contract

classify emails into statements and contracts
Python
1
star
66

Movie-Recommendation-System

Hybrid Movie recommendation system
Jupyter Notebook
1
star
67

SystemInfo

Jupyter Notebook
1
star
68

LSTM_syntheic_gradient

Jupyter Notebook
1
star