• This repository has been archived on 31/Jul/2024
  • Stars
    star
    192
  • Rank 202,019 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created over 4 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Official implementation of our work, A Transformer-based Approach for Source Code Summarization [ACL 2020].

A Transformer-based Approach for Source Code Summarization

Official implementation of our ACL 2020 paper on Source Code Summarization. [arxiv]

Installing C2NL

You may consider installing the C2NL package. C2NL requires Linux and Python 3.6 or higher. It also requires installing PyTorch version 1.3. Its other dependencies are listed in requirements.txt. CUDA is strongly recommended for speed, but not necessary.

Run the following commands to clone the repository and install C2NL:

git clone https://github.com/wasiahmad/NeuralCodeSum.git
cd NeuralCodeSum; pip install -r requirements.txt; python setup.py develop

Training/Testing Models

We provide a RNN-based sequence-to-sequence (Seq2Seq) model implementation along with our Transformer model. To perform training and evaluation, first go the scripts directory associated with the target dataset.

$ cd  scripts/DATASET_NAME

Where, choices for DATASET_NAME are ["java", "python"].

To train/evaluate a model, run:

$ bash script_name.sh GPU_ID MODEL_NAME

For example, to train/evaluate the transformer model, run:

$ bash transformer.sh 0,1 code2jdoc

Generated log files

While training and evaluating the models, a list of files are generated inside a tmp directory. The files are as follows.

  • MODEL_NAME.mdl
    • Model file containing the parameters of the best model.
  • MODEL_NAME.mdl.checkpoint
    • A model checkpoint, in case if we need to restart the training.
  • MODEL_NAME.txt
    • Log file for training.
  • MODEL_NAME.json
    • The predictions and gold references are dumped during validation.
  • MODEL_NAME_test.txt
    • Log file for evaluation (greedy).
  • MODEL_NAME_test.json
    • The predictions and gold references are dumped during evaluation (greedy).
  • MODEL_NAME_beam.txt
    • Log file for evaluation (beam).
  • MODEL_NAME_beam.json
    • The predictions and gold references are dumped during evaluation (beam).

[Structure of the JSON files] Each line in a JSON file is a JSON object. An example is provided below.

{
    "id": 0,
    "code": "private int current Depth ( ) { try { Integer one Based = ( ( Integer ) DEPTH FIELD . get ( this ) ) ; return one Based - NUM ; } catch ( Illegal Access Exception e ) { throw new Assertion Error ( e ) ; } }",
    "predictions": [
        "returns a 0 - based depth within the object graph of the current object being serialized ."
    ],
    "references": [
        "returns a 0 - based depth within the object graph of the current object being serialized ."
    ],
    "bleu": 1,
    "rouge_l": 1
}

Generating Summaries for Source Codes

We may want to generate summaries for source codes using a trained model. And this can be done by running generate.sh script. The input source code file must be under java or python directory. We need to manually set the value of the DATASET variable in the bash script.

A sample Java and Python code file is provided at [data/java/sample.code] and [data/python/sample.code].

$ cd scripts
$ bash generate.sh 0 code2jdoc sample.code

The above command will generate tmp/code2jdoc_beam.json file that will contain the predicted summaries.

Running experiments on CPU/GPU/Multi-GPU

  • If GPU_ID is set to -1, CPU will be used.
  • If GPU_ID is set to one specific number, only one GPU will be used.
  • If GPU_ID is set to multiple numbers (e.g., 0,1,2), then parallel computing will be used.

Acknowledgement

We borrowed and modified code from DrQA, OpenNMT. We would like to expresse our gratitdue for the authors of these repositeries.

Citation

@inproceedings{ahmad2020summarization,
 author = {Ahmad, Wasi Uddin and Chakraborty, Saikat and Ray, Baishakhi and Chang, Kai-Wei},
 booktitle = {Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL)},
 title = {A Transformer-based Approach for Source Code Summarization},
 year = {2020}
}

More Repositories

1

Awesome-LLM-Synthetic-Data

A reading list on LLM based Synthetic Data Generation 🔥
666
star
2

paraphrase_identification

Examine two sentences and determine whether they have the same meaning.
HTML
219
star
3

PLBART

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].
Python
186
star
4

context_attentive_ir

Official implementation of our ICLR 2018 and SIGIR 2019 papers on Context-aware Neural Information Retrieval
Python
119
star
5

AVATAR

Official code of our work, AVATAR: A Parallel Corpus for Java-Python Program Translation.
Python
53
star
6

GATE

Official implementation of our work, GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and Event Extraction [AAAI 2021].
Python
48
star
7

aol_query_log_analysis

This project aims to analyze different aspects of the AOL query log
Java
26
star
8

transferable_sent2vec

Official code of our work, Robust, Transferable Sentence Representations for Text Classification [Arxiv 2018].
Python
20
star
9

Syntax-MBERT

Official code of our work, Syntax-augmented Multilingual BERT for Cross-lingual Transfer [ACL 2021].
Python
16
star
10

ACE05-Processor

UDPipe based preprocessing of the ACE05 dataset
Python
16
star
11

PolicyQA

Official code of our work, PolicyQA: A Reading Comprehension Dataset for Privacy Policies [Findings of EMNLP 2020].
Python
12
star
12

SumGenToBT

Official code of our work, Summarize and Generate to Back-Translate: Unsupervised Translation of Programming Languages [arXiv].
Python
11
star
13

PolicyIE

Official code of our work, Intent Classification and Slot Filling for Privacy Policies [ACL 2021].
Python
10
star
14

NeuralKpGen

An Empirical Study on Pre-trained Language Models for Neural Keyphrase Generation
Python
8
star
15

cross_lingual_parsing

Official code for our CoNLL 2019 paper on Cross-lingual Dependency Parsing with Unlabeled Auxiliary Languages
Python
7
star
16

intent_aware_privacy_protection_in_pws

Intent-aware Query-obfuscation for Privacy Protection in Personalized Web Search
Java
5
star
17

mining_wikipedia

Extract mentions and category taxonomy from Wikipedia
Java
4
star
18

topic_based_privacy_protection_in_pws

Topic Model based Privacy Protection in Personalized Web Search
Java
4
star
19

PrivacyQA

Unofficial model implementations for the PrivacyQA benchmark (https://github.com/AbhilashaRavichander/PrivacyQA_EMNLP)
Python
3
star
20

wasiahmad.github.io

My Personal Website
JavaScript
1
star