• Stars
    star
    187
  • Rank 206,464 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 5 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Fine-tuned pre-trained GPT2 for custom topic specific text generation. Such system can be used for Text Augmentation.

TextAugmentation-GPT2

GPT2 model size representation Fine-tuned pre-trained GPT2 for topic specific text generation. Such system can be used for Text Augmentation.

Getting Started

  1. git clone https://github.com/prakhar21/TextAugmentation-GPT2.git
  2. Move your data to data/ dir.

* Please refer to data/SMSSpamCollection to get the idea of file format.

Tuning for own Corpus

  1. Assuming are done with Point 2 under Getting Started
2. Run python3 train.py --data_file <filename> --epoch <number_of_epochs> --warmup <warmup_steps> --model_name <model_name> --max_len <max_seq_length> --learning_rate <learning_rate> --batch <batch_size>

Generating Text

1. python3 generate.py --model_name <model_name> --sentences <number_of_sentences> --label <class_of_training_data>

* It is recommended that you tune the parameters for your task. Not doing so may result in choosing default parameters and eventually giving sub-optimal performace.

Quick Testing

I had fine-tuned the model on SPAM/HAM dataset. You can download it from here and follow the steps mentioned under Generation Text section.

Sample Results

SPAM: You have 2 new messages. Please call 08719121161 now. £3.50. Limited time offer. Call 090516284580.<|endoftext|>
SPAM: Want to buy a car or just a drink? This week only 800p/text betta...<|endoftext|>
SPAM: FREE Call Todays top players, the No1 players and their opponents and get their opinions on www.todaysplay.co.uk Todays Top Club players are in the draw for a chance to be awarded the £1000 prize. TodaysClub.com<|endoftext|>
SPAM: you have been awarded a £2000 cash prize. call 090663644177 or call 090530663647<|endoftext|>

HAM: Do you remember me?<|endoftext|>
HAM: I don't think so. You got anything else?<|endoftext|>
HAM: Ugh I don't want to go to school.. Cuz I can't go to exam..<|endoftext|>
HAM: K.,k:)where is my laptop?<|endoftext|>

Important Points to Note

  • Top-k and Top-p Sampling (Variant of Nucleus Sampling) has been used while decoding the sequence word-by-word. You can read more about it here

Note: First time you run, it will take considerable amount of time because of the following reasons -

  1. Downloads pre-trained gpt2-medium model (Depends on your Network Speed)
  2. Fine-tunes the gpt2 with your dataset (Depends on size of the data, Epochs, Hyperparameters, etc)

All the experiments were done on IntelDevCloud Machines

More Repositories

1

50-Days-of-ML

A day to day plan for this challenge (50 Days of Machine Learning) . Covers both theoretical and practical aspects
Jupyter Notebook
255
star
2

Writing-with-BERT

Using BERT for doing the task of Conditional Natural Language Generation by fine-tuning pre-trained BERT on custom dataset.
Jupyter Notebook
40
star
3

T5-Text-to-Text-Transfer-Transformer

Demo of the T5 model for various pre-trained task.
Jupyter Notebook
36
star
4

Automatic-Glossary-Generation

The projects lets you extract glossary words and their definitions from a given piece of text automatically using NLP techniques
Jupyter Notebook
29
star
5

EmbedViz-Streamlit

Embedding Visualizer (EmbedViz) data app made with Streamlit library
Python
18
star
6

Machine-Learning-in-Trading

This repo is my understanding and learnings from Machine Learning for Trading Specialization from Coursera
13
star
7

Text2Emoji

Text2Emoji helps you add necessary emojis to your text by analysing the emotion behind the writing.
Python
12
star
8

spark-streaming

Twitter Spark Streaming using PySpark
HTML
8
star
9

Fill-in-the-BERT

Fill-in-the-BERT uses pre-trained BERT Masked Language Model for Infering the task of fill in the blanks.
HTML
6
star
10

Learning-Data-Structures-from-Scratch

Learning Data Structures from Scratch #100DaysOfDSA
Python
6
star
11

TechViz-TheDataScienceGuy-VideoList

This repository contains a structured list for all the videos from the Youtube channel (TechViz-The Data Science Guy)
6
star
12

Text-Generation-Markov-Chains

Text Generation using 1st Order Markov Models
Jupyter Notebook
5
star
13

Hindi-Transliteration

Keyboard friendly Ascii Language to Hindi and Indian Romanized Transliteration - Rule Based Implementation
HTML
5
star
14

News-Crawler

Crawls news from news18 website and gives you option to tweet it's heading as well as reads out the news for you
Python
3
star
15

IMDB---Scrapper

Scraps title and other information of the movies for different parameters
Python
3
star
16

keyword-extraction

Unsupervised keyword extraction algorithms from text document.
3
star
17

FewClick-TrainAndHost

FewClick-TrainAndHost is a platform to auto train a text classification model and later convert it to a flask based web-app in just few clicks.
Python
3
star
18

Investment-Risk-Return

Sharpe and Sortino Ratios for multiple stocks
HTML
3
star
19

Go

From Initial Baby steps to Pro in GO Programming
Go
2
star
20

Automatic-Ticket-Booking

Python
2
star
21

MEDIQA-CHAT-2023-NewAgeHealthWarriors

Python
2
star
22

Foreign-Arrival-Analysis

Jupyter Notebook
1
star
23

Sit-and-relax

Automates some of your basic tasks
Python
1
star
24

AlexaSkills

Alexa Skills
1
star
25

Reinforcement-Learning-Playground

This repo. has implementation of various RL algorithms
Python
1
star
26

String-Matching

List of some string distance / matching algorithms
Python
1
star
27

zomato_crawler

Scraps Data out of zomato for given location and sub-locations
Python
1
star
28

YOLOv3-Playgroud

This yolo playground repository contains possible usecases for building a object detection system.
Python
1
star
29

Resources

Collection of Useful books in practice
1
star
30

sudoku-solver

Python
1
star
31

Email-Spammer

Education Purpose only !!
Python
1
star
32

swiggy-analysis

Python
1
star
33

Twitter-Kibana-ElasticSearch

Streams live tweets to elastic search and later displays and helps in vusualizing them in browser with kibana
Python
1
star
34

profile

About Me!
CSS
1
star