• This repository has been archived on 31/Jul/2024
  • Stars
    star
    186
  • Rank 207,316 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created over 3 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART

Official code release of our NAACL 2021 work, Unified Pre-training for Program Understanding and Generation .

***** PLBART's performances on the downstream tasks are recorded in this spreadsheet. *****

News β€’ Setup β€’ Pre-training β€’ Fine-tuning β€’ FAQs β€’ Acknowledgement β€’ License β€’ Citation


PLBART is a Transformer model

  • PLBART is a sequence-to-sequence model pre-trained on a large collection Java and Python functions and natural language descriptions collected from Github and StackOverflow, respectively.
  • PLBART is pre-trained via denoising autoencoding (DAE) and uses three noising strategies: token masking, token deletion, and token infilling (shown below in the three examples).
Noisy Input Original Sequence
Is 0 the [MASK] Fibonacci [MASK] ? <En> <En> Is 0 the first Fibonacci number ?
public static main ( String args [ ] ) { date = Date ( ) ; System . out . ( String . format ( " Current Date : % tc " , ) ) ; } <java> <java> public static void main ( String args [ ] ) { Date date = new Date ( ) ; System . out . printf ( String . format ( " Current Date : % tc " , date ) ) ; }
def addThreeNumbers ( x , y , z ) : NEW_LINE INDENT return [MASK] <python> <python> def addThreeNumbers ( x , y , z ) : NEW_LINE INDENT return x + y + z

News


Setup

We can setup a conda environment in order to run PLBART experiments, the first step is to download the dependencies. We assume anaconda is installed. The additional requirements (noted in requirements.txt) can be installed by running the following script:

bash install_env.sh

Pre-training

Step1. Download Github data

Go to data/github directory and follow instructions.

Step2. Download StackOverflow data

Go to data/stackoverflow directory and follow instructions.

Step3. Binarize the data and pre-train

cd pretrain
bash binarize.sh
bash pretrain.sh GPU_IDS

[Note] We pre-trained PLBART on 8 GeForce RTX 2080 (11gb) GPUs (took ~11.5 days). If you want to pre-train PLBART using more GPUs or GPUs with more memory, adjust MAX_SENTENCES, MAX_TOKENS, UPDATE_FREQ accordingly to maintain an effective batch size of 2048. According to fairseq, effective batch size is equal to:

PER_GPU_TRAIN_BATCH_SIZE * NUM_GPU * UPDATE_FREQ

Note that, MAX_TOKENS refers to the size of each mini-batch, in terms of the number of tokens. During our experiments, we noticed that in an 11gb GPU, maximum 2048 tokens can be accommodated which is equivalent to 4-5 examples. Therefore, we set UPDATE_FREQ to 60, so that we can achieve an effective batch size of ~2048.


Fine-tuning

We fine-tune and evaluate PLBART on three types of downstream tasks.

Type Task Language(s) Data Scripts Checkpoints
Code to Text Code summarization Python, Java, Ruby,
PHP, Javascript, Go
[LINK] [LINK] [LINK]
Text to Code Code generation Java [LINK] [LINK] [LINK]
Code to Code Code translation Java, C# [LINK] [LINK] [LINK]
Code refinement Java [LINK] [LINK]
Clone detection Java [LINK] [LINK]
Defect detection C/C++ [LINK] [LINK]

Step1. Download PLBART checkpoint

cd pretrain
bash download.sh
cd ..

Step2. Download the data

cd data/codeXglue
bash download.sh
cd ../..

Step3. Build parser for CodeBLEU evaluation

cd evaluation/CodeBLEU/parser
bash build.sh
cd ../../..

Step4. Prepare the data, train and evaluate PLBART

For example, we want to fine-tune PLBART on Text-to-Code task. Then,

cd scripts/text_to_code
bash prepare.sh
bash run.sh GPU_IDS
cd ../..

Note. We fine-tuned PLBART on 1 GeForce RTX 2080 (11gb) GPU.


FAQs

[NOTE] We present the file structure of this repository here .

How to download Github data from Google BigQuery?

We provided a detailed guide here .

Mismatch in performance reported in the paper and achieved using the released checkpoints.

There is a difference between PLBART's performances mentioned in the paper and the performance achieved with the released checkpoints. We noted them here. Note that, there is no change in the hyper-parameter setting. We provided the exact same value we used in the bash scripts. The performance difference we observed is perhaps due to running experiments at different point of time. Although we didn't but we recommend to fine-tune PLBART with multiple different seeds and report the average scores.

mbart_base task is not present in fairseq==0.9.0 official release.

Although we used fairseq==0.9.0 but we used a different commit which consists of mbart_base task. You may do the following which should work.

git clone https://github.com/pytorch/fairseq
cd fairseq
git checkout 698e3b91ffa832c286c48035bdff78238b0de8ae
pip install .

Otherwise, you may consider installing fairseq==0.10.0. Please refer to this issue to make other adjustments.

What can be the maximum input and output lengths for PLBART?

The maximum length is 512.


Acknowledgement

PLBART uses Fairseq, codeXglue, and TransCoder and thanks the authors of these works for their contribution.


License

Contents of this repository is under the MIT license. The license applies to the pre-trained and fine-tuned models as well.


Citation

@inproceedings{ahmad-etal-2021-unified,
    title = "Unified Pre-training for Program Understanding and Generation",
    author = "Ahmad, Wasi  and
      Chakraborty, Saikat  and
      Ray, Baishakhi  and
      Chang, Kai-Wei",
    booktitle = "Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jun,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.naacl-main.211",
    pages = "2655--2668"
}

More Repositories

1

Awesome-LLM-Synthetic-Data

A reading list on LLM based Synthetic Data Generation πŸ”₯
666
star
2

paraphrase_identification

Examine two sentences and determine whether they have the same meaning.
HTML
219
star
3

NeuralCodeSum

Official implementation of our work, A Transformer-based Approach for Source Code Summarization [ACL 2020].
Python
192
star
4

context_attentive_ir

Official implementation of our ICLR 2018 and SIGIR 2019 papers on Context-aware Neural Information Retrieval
Python
119
star
5

AVATAR

Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.
Python
53
star
6

GATE

Official implementation of our work, GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and Event Extraction [AAAI 2021].
Python
48
star
7

aol_query_log_analysis

This project aims to analyze different aspects of the AOL query log
Java
26
star
8

transferable_sent2vec

Official code of our work, Robust, Transferable Sentence Representations for Text Classification [Arxiv 2018].
Python
20
star
9

Syntax-MBERT

Official code of our work, Syntax-augmented Multilingual BERT for Cross-lingual Transfer [ACL 2021].
Python
16
star
10

ACE05-Processor

UDPipe based preprocessing of the ACE05 dataset
Python
16
star
11

PolicyQA

Official code of our work, PolicyQA: A Reading Comprehension Dataset for Privacy Policies [Findings of EMNLP 2020].
Python
12
star
12

SumGenToBT

Official code of our work, Summarize and Generate to Back-Translate: Unsupervised Translation of Programming Languages [arXiv].
Python
11
star
13

PolicyIE

Official code of our work, Intent Classification and Slot Filling for Privacy Policies [ACL 2021].
Python
10
star
14

NeuralKpGen

An Empirical Study on Pre-trained Language Models for Neural Keyphrase Generation
Python
8
star
15

cross_lingual_parsing

Official code for our CoNLL 2019 paper on Cross-lingual Dependency Parsing with Unlabeled Auxiliary Languages
Python
7
star
16

intent_aware_privacy_protection_in_pws

Intent-aware Query-obfuscation for Privacy Protection in Personalized Web Search
Java
5
star
17

mining_wikipedia

Extract mentions and category taxonomy from Wikipedia
Java
4
star
18

topic_based_privacy_protection_in_pws

Topic Model based Privacy Protection in Personalized Web Search
Java
4
star
19

PrivacyQA

Unofficial model implementations for the PrivacyQA benchmark (https://github.com/AbhilashaRavichander/PrivacyQA_EMNLP)
Python
3
star
20

wasiahmad.github.io

My Personal Website
JavaScript
1
star