• Stars
    star
    111
  • Rank 314,510 (Top 7 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created about 2 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

CodeBERTScore: an automatic metric for code generation, based on BERTScore

CodeBERTScore

This is the official implementation of the paper:

Shuyan Zhou, Uri Alon, Sumit Agarwal, Graham Neubig, CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code

CodeBERTScore is an Automatic Evaluation Metric for Code, based on BERTScore. This repository is based on the code of BERTScore, and we are grateful to the authors for releasing their code.

April 2023 - CodeBERTScore is now available on pypi, which means that you can simply pip install code-bert-score!


Example:

Figure (a) shows a reference code snippet in Java. Figures (b) and (c) show two generated predictions. Among these two candidates and given the reference, BLEU prefers (scores higher) the code in (b), which is not functionally equivalent to the reference, while CodeBERTScore prefers the code in (c), which is functionaly equivalent to the reference.

How does it work?

As BERTScore, CodeBERTScore leverages the pre-trained contextual embeddings from a model such as CodeBERT and matches words in candidate and reference sentences by cosine similarity. Differently from BERTScore, CodeBERTScore also encodes natural language input or other context along with the generated code, but does not use that context to compute cosine similarities.

This example shows how CodeBERTScore can compute the similarity between the Python expressions x ** 0.5 and math.sqrt(x), which are functionally equivalent, even though they have very few overlapping tokens.

Usage

import code_bert_score
pred_results = code_bert_score.score(cands=predictions, refs=refs, lang='python')

Where pred_results is a 4-tuple of (precision, recall, F1, F3), where each is a 1-D tensor of scores for each prediction-reference pair. F3 is similar to the well-known F1 score, that considers recall 3 times as important as precision. See the definition on Wikipedia.

See our example.py script. Additional details are shown in the original BERTScore demo notebook.

Huggingface 🤗 Models

We fine-tuned the microsoft/codebert-base-mlm model for 1,000,000 steps (with batch_size=32) on several languages separately.

We released the following models to the Huggingface hub:

  • neulab/codebert-python (the default model for lang='python')
  • neulab/codebert-javascript (the default model for lang='javascript' or 'js')
  • neulab/codebert-c (the default model for lang='c')
  • neulab/codebert-cpp (the default model for lang='cpp' or 'c++')
  • neulab/codebert-java (the default model for lang='java')

The appropriate model will be loaded automatically when passing the lang argument to the score(..) function, for example: lang='python'. For other uses, these models can be loaded using (for example):

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("neulab/codebert-python")
model = AutoModelForMaskedLM.from_pretrained("neulab/codebert-python")

Additional Features

  • We found that in NL->Code problems, more accurate results are achieved by encoding the NL sources with the code prediction, but then measuring similarity only for the encoded code:
pred_results = code_bert_score.score(cands=predictions, refs=refs, lang='python', sources=sources)
  • We also found that using Inverse Document Frequencies improve the results, similarly to the original BERTScore. We included an example script that shows how to precompute them here compute_idf.py. Then, the resulting dictionary can be used with the argument idf=idf_dict. Our IDF dicts can be found in ./idf_dicts/.

  • Tuning the layer that the similarity is computed from is also helpful, using num_layers=N where N is between 5-10:

  • We found that more accurate results are achieved by encoding the entire inputs, but measures the similarity only between non-punctuation and non-whitespace tokens. To disable the removal of punctuation tokens, use no_punc=False.

See also our example.py script. Additional details are shown in the original BERTScore demo notebook.

Training

The run_mlm.py script can be used to fine-tune the base model microsoft/codebert-base-mlm on specific languages.

Evaluation

The code to reproduce the results in the paper can be found in the evaluation.

Human Evaluation

We find that CodeBERTScore is more correlated with human preference compared to a variety of common metrics. See more details in the paper.

Functional Correctness

We find that CodeBERTScore is more correlated with functional correctness compared to a variety of common metrics. See more details in the paper.

Citation

@article{zhou2023codebertscore,
  url = {https://arxiv.org/abs/2302.05527},
  author = {Zhou, Shuyan and Alon, Uri and Agarwal, Sumit and Neubig, Graham},
  title = {CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code},  
  publisher = {arXiv},
  year = {2023},
}

More Repositories

1

prompt2model

prompt2model - Generate Deployable Models from Natural Language Instructions
Python
1,953
star
2

Text-Summarization-Papers

An Exhaustive Paper List for Text Summarization
HTML
500
star
3

compare-mt

A tool for holistic analysis of language generations systems
Python
450
star
4

nn4nlp-concepts

A repository of concepts related to neural networks for NLP
Python
447
star
5

ExplainaBoard

Interpretable Evaluation for AI Systems
Python
361
star
6

awesome-align

A neural word aligner based on multilingual BERT
Python
319
star
7

BARTScore

BARTScore: Evaluating Generated Text as Text Generation
Python
317
star
8

knn-transformers

PyTorch + HuggingFace code for RetoMaton: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022), including an implementation of kNN-LM and kNN-MT
Python
249
star
9

InterpretEval

Interpretable Evaluation for (Almost) All NLP Tasks
HTML
193
star
10

ReviewAdvisor

Heavy Workload on Reviewing Papers? ReviewAdvisor Helps out
Python
192
star
11

xnmt

eXtensible Neural Machine Translation
Python
185
star
12

gemini-benchmark

Jupyter Notebook
150
star
13

RIPPLe

Code for the paper "Weight Poisoning Attacks on Pre-trained Models" (ACL 2020)
Jupyter Notebook
138
star
14

SpanNER

SpanNER: Named EntityRe-/Recognition as Span Prediction
Python
124
star
15

word-embeddings-for-nmt

Supplementary material for "When and Why Are Pre-trained Word Embeddings Useful for Neural Machine Translation?" at NAACL 2018
Python
119
star
16

guided_summarization

GSum: A General Framework for Guided Neural Abstractive Summarization
Python
112
star
17

external-knowledge-codegen

Code and data for ACL20 paper "Incorporating External Knowledge through Pre-training for Natural Language to Code Generation"
Python
96
star
18

cmu-multinlp

Generalizing Natural Language Analysis through Span-relation Representations
Python
88
star
19

REALSumm

REALSumm: Re-evaluating Evaluation in Text Summarization
Python
71
star
20

langrank

A program to choose transfer languages for cross-lingual learning
Python
66
star
21

retomaton

PyTorch code for the RetoMaton paper: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022)
Python
60
star
22

dynet-benchmark

Benchmarks for DyNet
Python
56
star
23

newlang-tech

A guide to building language technology in new languages.
56
star
24

ragged

Retrieval Augmented Generation Generalized Evaluation Dataset
Jupyter Notebook
51
star
25

contextual-mt

A repository with the code related to experiments around context-aware machine translation
Python
48
star
26

extreme-adaptation-for-personalized-translation

Code for the paper "Extreme Adaptation for Personalized Neural Machine Translation"
Python
43
star
27

lrlm

Code for the paper "Latent Relation Language Models" at AAAI-20.
Python
41
star
28

incremental_tree_edit

Code for "Learning Structural Edits via Incremental Tree Transformations" (ICLR'21)
Python
40
star
29

wikiasp

Code for WikiAsp: Multi-document aspect-based summarization.
Shell
39
star
30

tranX-plugin

A plugin for code generation in PyCharm/IntelliJ using tranX
Java
35
star
31

neural-lpcfg

The Return of Lexical Dependencies: Neural Lexicalized PCFGs (TACL)
Python
33
star
32

covid19-datashare

A repo for sharing language resources related to the outbreak (in machine readable format)
GLSL
27
star
33

ToM-Language-Acquisition

Code used to run experiments for the ICLR 2023 paper "Computational Language Acquisition with Theory of Mind".
Python
14
star
34

cmulab

CMU Linguistic Annotation Backend
Python
14
star
35

AfricanVoices

Hosts text-to-speech corpus and speech synthesizers for African languages.
Shell
12
star
36

cmu-ner

NER System Developed at CMU
Python
12
star
37

lti-llm-deployment

Python
12
star
38

explainaboard_web

Mustache
8
star
39

KGxBoard

Explainable and Interactive Leaderboard for Evaluation of Knowledge Graph Completion Models
6
star
40

DGT

WNGT 2019, DGT Task.
Python
6
star
41

tranx-study

HTML
5
star
42

Reliable-NLPPP

Jupyter Notebook
5
star
43

cord19

cord19 related stuff
Python
5
star
44

globalbench

GlobalBench: A Benchmark for Global Progress in Language Technology
Python
5
star
45

jsalt2019-informal

A repository for random things from the JSALT informal translation group
Python
5
star
46

cmu-edl

Python
3
star
47

code-mining

Stuff for code mining
OpenEdge ABL
2
star
48

ocr-web-interface

OCR web interface using CMULAB backend
JavaScript
1
star
49

explainaboard_client

Python
1
star