• This repository has been archived on 11/Apr/2023
  • Stars
    star
    2,155
  • Rank 21,404 (Top 0.5 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created almost 6 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Datasets, tools, and benchmarks for representation learning of code.

Tests License: MIT Python 3.6 Weights-And-Biases

The CodeSearchNet challenge has been concluded

We would like to thank all participants for their submissions and we hope that this challenge provided insights to practitioners and researchers about the challenges in semantic code search and motivated new research. We would like to encourage everyone to continue using the dataset and the human evaluations, which we now provide publicly. Please, see below for details, specifically the Evaluation section.

No new submissions to the challenge will be accepted.

Table of Contents

Quickstart

If this is your first time reading this, we recommend skipping this section and reading the following sections. The below commands assume you have Docker and Nvidia-Docker, as well as a GPU that supports CUDA 9.0 or greater. Note: you should only have to run script/setup once to download the data.

# clone this repository
git clone https://github.com/github/CodeSearchNet.git
cd CodeSearchNet/
# download data (~3.5GB) from S3; build and run the Docker container
script/setup
# this will drop you into the shell inside a Docker container
script/console
# optional: log in to W&B to see your training metrics,
# track your experiments, and submit your models to the benchmark
wandb login

# verify your setup by training a tiny model
python train.py --testrun
# see other command line options, try a full training run with default values,
# and explore other model variants by extending this baseline script
python train.py --help
python train.py

# generate predictions for model evaluation
python predict.py -r github/CodeSearchNet/0123456 # this is the org/project_name/run_id

Finally, you can submit your run to the community benchmark by following these instructions.

Introduction

Project Overview

CodeSearchNet is a collection of datasets and benchmarks that explore the problem of code retrieval using natural language. This research is a continuation of some ideas presented in this blog post and is a joint collaboration between GitHub and the Deep Program Understanding group at Microsoft Research - Cambridge. We aim to provide a platform for community research on semantic code search via the following:

  1. Instructions for obtaining large corpora of relevant data
  2. Open source code for a range of baseline models, along with pre-trained weights
  3. Baseline evaluation metrics and utilities
  4. Mechanisms to track progress on a shared community benchmark hosted by Weights & Biases

We hope that CodeSearchNet is a step towards engaging with the broader machine learning and NLP community regarding the relationship between source code and natural language. We describe a specific task here, but we expect and welcome other uses of our dataset.

More context regarding the motivation for this problem is in this technical report. Please, cite the dataset and the challenge as

@article{husain2019codesearchnet,
  title={{CodeSearchNet} challenge: Evaluating the state of semantic code search},
  author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
  journal={arXiv preprint arXiv:1909.09436},
  year={2019}
}

Data

The primary dataset consists of 2 million (comment, code) pairs from open source libraries. Concretely, a comment is a top-level function or method comment (e.g. docstrings in Python), and code is an entire function or method. Currently, the dataset contains Python, Javascript, Ruby, Go, Java, and PHP code. Throughout this repo, we refer to the terms docstring and query interchangeably. We partition the data into train, validation, and test splits such that code from the same repository can only exist in one partition. Currently this is the only dataset on which we train our model. Summary statistics about this dataset can be found in this notebook

For more information about how to obtain the data, see this section.

Evaluation

The metric we use for evaluation is Normalized Discounted Cumulative Gain. Please reference this paper for further details regarding model evaluation. The evaluation script can be found here.

Annotations

We manually annotated retrieval results for the six languages from 99 general queries. This dataset is used as groundtruth data for evaluation only. Please refer to this paper for further details on the annotation process. These annotations were used to compute the scores in the leaderboard. Now that the competition has been concluded, you can find the annotations, along with the annotator comments here.

Setup

You should only have to perform the setup steps once to download the data and prepare the environment.

  1. Due to the complexity of installing all dependencies, we prepared Docker containers to run this code. You can find instructions on how to install Docker in the official docs. Additionally, you must install Nvidia-Docker to satisfy GPU-compute related dependencies. For those who are new to Docker, this blog post provides a gentle introduction focused on data science.

  2. After installing Docker, you need to download the pre-processed datasets, which are hosted on S3. You can do this by running script/setup.

    script/setup
    

    This will build Docker containers and download the datasets. By default, the data is downloaded into the resources/data/ folder inside this repository, with the directory structure described here.

The datasets you will download (most of them compressed) have a combined size of only ~ 3.5 GB.

  1. To start the Docker container, run script/console:
    script/console
    
    This will land you inside the Docker container, starting in the /src directory. You can detach from/attach to this container to pause/continue your work.

For more about the data, see Data Details below, as well as this notebook.

Data Details

Data Acquisition

If you have run the setup steps above you will already have the data, and nothing more needs to be done. The data will be available in the /resources/data folder of this repository, with this directory structure.

Schema & Format

Data is stored in jsonlines format. Each line in the uncompressed file represents one example (usually a function with an associated comment). A prettified example of one row is illustrated below.

  • repo: the owner/repo
  • path: the full path to the original file
  • func_name: the function or method name
  • original_string: the raw string before tokenization or parsing
  • language: the programming language
  • code: the part of the original_string that is code
  • code_tokens: tokenized version of code
  • docstring: the top-level comment or docstring, if it exists in the original string
  • docstring_tokens: tokenized version of docstring
  • sha: this field is not being used [TODO: add note on where this comes from?]
  • partition: a flag indicating what partition this datum belongs to of {train, valid, test, etc.} This is not used by the model. Instead we rely on directory structure to denote the partition of the data.
  • url: the url for the code snippet including the line numbers

Code, comments, and docstrings are extracted in a language-specific manner, removing artifacts of that language.

{
  'code': 'def get_vid_from_url(url):\n'
          '        """Extracts video ID from URL.\n'
          '        """\n'
          "        return match1(url, r'youtu\\.be/([^?/]+)') or \\\n"
          "          match1(url, r'youtube\\.com/embed/([^/?]+)') or \\\n"
          "          match1(url, r'youtube\\.com/v/([^/?]+)') or \\\n"
          "          match1(url, r'youtube\\.com/watch/([^/?]+)') or \\\n"
          "          parse_query_param(url, 'v') or \\\n"
          "          parse_query_param(parse_query_param(url, 'u'), 'v')",
  'code_tokens': ['def',
                  'get_vid_from_url',
                  '(',
                  'url',
                  ')',
                  ':',
                  'return',
                  'match1',
                  '(',
                  'url',
                  ',',
                  "r'youtu\\.be/([^?/]+)'",
                  ')',
                  'or',
                  'match1',
                  '(',
                  'url',
                  ',',
                  "r'youtube\\.com/embed/([^/?]+)'",
                  ')',
                  'or',
                  'match1',
                  '(',
                  'url',
                  ',',
                  "r'youtube\\.com/v/([^/?]+)'",
                  ')',
                  'or',
                  'match1',
                  '(',
                  'url',
                  ',',
                  "r'youtube\\.com/watch/([^/?]+)'",
                  ')',
                  'or',
                  'parse_query_param',
                  '(',
                  'url',
                  ',',
                  "'v'",
                  ')',
                  'or',
                  'parse_query_param',
                  '(',
                  'parse_query_param',
                  '(',
                  'url',
                  ',',
                  "'u'",
                  ')',
                  ',',
                  "'v'",
                  ')'],
  'docstring': 'Extracts video ID from URL.',
  'docstring_tokens': ['Extracts', 'video', 'ID', 'from', 'URL', '.'],
  'func_name': 'YouTube.get_vid_from_url',
  'language': 'python',
  'original_string': 'def get_vid_from_url(url):\n'
                      '        """Extracts video ID from URL.\n'
                      '        """\n'
                      "        return match1(url, r'youtu\\.be/([^?/]+)') or \\\n"
                      "          match1(url, r'youtube\\.com/embed/([^/?]+)') or "
                      '\\\n'
                      "          match1(url, r'youtube\\.com/v/([^/?]+)') or \\\n"
                      "          match1(url, r'youtube\\.com/watch/([^/?]+)') or "
                      '\\\n'
                      "          parse_query_param(url, 'v') or \\\n"
                      "          parse_query_param(parse_query_param(url, 'u'), "
                      "'v')",
  'partition': 'test',
  'path': 'src/you_get/extractors/youtube.py',
  'repo': 'soimort/you-get',
  'sha': 'b746ac01c9f39de94cac2d56f665285b0523b974',
  'url': 'https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/youtube.py#L135-L143'
}

Summary statistics such as row counts and token length histograms can be found in this notebook

Downloading Data from S3

The shell script /script/setup will automatically download these files into the /resources/data directory. Here are the links to the relevant files for visibility:

The s3 links follow this pattern:

https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/{python,java,go,php,javascript,ruby}.zip

For example, the link for the java is:

https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/java.zip

The size of the dataset is approximately 20 GB. The various files and the directory structure are explained here.

Human Relevance Judgements

To train neural models with a large dataset we use the documentation comments (e.g. docstrings) as a proxy. For evaluation (and the leaderboard), we collected human relevance judgements of pairs of realistic-looking natural language queries and code snippets. Now that the challenge has been concluded, we provide the data here as a .csv, with the following fields:

  • Language: The programming language of the snippet.
  • Query: The natural language query
  • GitHubUrl: The URL of the target snippet. This matches the URL key in the data (see here).
  • Relevance: the 0-3 human relevance judgement, where "3" is the highest score (very relevant) and "0" is the lowest (irrelevant).
  • Notes: a free-text field with notes that annotators optionally provided.

Running Our Baseline Model

We encourage you to reproduce and extend these models, though most variants take several hours to train (and some take more than 24 hours on an AWS P3-V100 instance).

Model Architecture

Our baseline models ingest a parallel corpus of (comments, code) and learn to retrieve a code snippet given a natural language query. Specifically, comments are top-level function and method comments (e.g. docstrings in Python), and code is an entire function or method. Throughout this repo, we refer to the terms docstring and query interchangeably.

The query has a single encoder, whereas each programming language has its own encoder. The available encoders are Neural-Bag-Of-Words, RNN, 1D-CNN, Self-Attention (BERT), and a 1D-CNN+Self-Attention Hybrid.

The diagram below illustrates the general architecture of our baseline models:

alt text

Training

This step assumes that you have a suitable Nvidia-GPU with Cuda v9.0 installed. We used AWS P3-V100 instances (a p3.2xlarge is sufficient).

  1. Start the model run environment by running script/console:

    script/console
    

    This will drop you into the shell of a Docker container with all necessary dependencies installed, including the code in this repository, along with data that you downloaded earlier. By default, you will be placed in the src/ folder of this GitHub repository. From here you can execute commands to run the model.

  2. Set up W&B (free for open source projects) per the instructions below if you would like to share your results on the community benchmark. This is optional but highly recommended.

  3. The entry point to this model is src/train.py. You can see various options by executing the following command:

    python train.py --help
    

    To test if everything is working on a small dataset, you can run the following command:

    python train.py --testrun
    
  4. Now you are prepared for a full training run. Example commands to kick off training runs:

  • Training a neural-bag-of-words model on all languages

    python train.py --model neuralbow
    

    The above command will assume default values for the location(s) of the training data and a destination where you would like to save the output model. The default location for training data is specified in /src/data_dirs_{train,valid,test}.txt. These files each contain a list of paths where data for the corresponding partition exists. If more than one path specified (separated by a newline), the data from all the paths will be concatenated together. For example, this is the content of src/data_dirs_train.txt:

    $ cat data_dirs_train.txt
    ../resources/data/python/final/jsonl/train
    ../resources/data/javascript/final/jsonl/train
    ../resources/data/java/final/jsonl/train
    ../resources/data/php/final/jsonl/train
    ../resources/data/ruby/final/jsonl/train
    ../resources/data/go/final/jsonl/train
    

    By default, models are saved in the resources/saved_models folder of this repository.

  • Training a 1D-CNN model on Python data only:

    python train.py --model 1dcnn /trained_models ../resources/data/python/final/jsonl/train ../resources/data/python/final/jsonl/valid ../resources/data/python/final/jsonl/test
    

    The above command overrides the default locations for saving the model to trained_models and also overrides the source of the train, validation, and test sets.

Additional notes:

  • Options for --model are currently listed in src/model_restore_helper.get_model_class_from_name.

  • Hyperparameters are specific to the respective model/encoder classes. A simple trick to discover them is to kick off a run without specifying hyperparameter choices, as that will print a list of all used hyperparameters with their default values (in JSON format).

References

Benchmark

We are using a community benchmark for this project to encourage collaboration and improve reproducibility. It is hosted by Weights & Biases (W&B), which is free for open source projects. Our entries in the benchmark link to detailed logs of our training and evaluation metrics, as well as model artifacts, and we encourage other participants to provide as much detail as possible.

We invite the community to submit their runs to this benchmark to facilitate transparency by following these instructions.

How to Contribute

We anticipate that the community will design custom architectures and use frameworks other than Tensorflow. Furthermore, we anticipate that additional datasets will be useful. It is not our intention to integrate these models, approaches, and datasets into this repository as a superset of all available ideas. Rather, we intend to maintain the baseline models and links to the data in this repository as a central place of reference. We are accepting PRs that update the documentation, link to your project(s) with improved benchmarks, fix bugs, or make minor improvements to the code. Here are more specific guidelines for contributing to this repository; note particularly our Code of Conduct. Please open an issue if you are unsure of the best course of action.

Other READMEs

W&B Setup

To initialize W&B:

  1. Navigate to the /src directory in this repository.

  2. If it's your first time using W&B on a machine, you will need to log in:

    $ wandb login
    
  3. You will be asked for your API key, which appears on your W&B profile settings page.

Licenses

The licenses for source code used as data for this project are provided with the data download for each language in _licenses.pkl files.

This code and documentation for this project are released under the MIT License.

More Repositories

1

gitignore

A collection of useful .gitignore templates
160,684
star
2

copilot-docs

Documentation for GitHub Copilot
23,229
star
3

docs

The open-source repo for docs.github.com
JavaScript
14,053
star
4

opensource.guide

πŸ“š Community guides for open source creators
HTML
12,947
star
5

gh-ost

GitHub's Online Schema-migration Tool for MySQL
Go
11,302
star
6

linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
Ruby
10,684
star
7

semantic

Parsing, analyzing, and comparing source code across many languages
Haskell
8,865
star
8

copilot.vim

Neovim plugin for GitHub Copilot
Vim Script
8,286
star
9

codeql

CodeQL: the libraries and queries that power security researchers around the world, as well as code scanning in GitHub Advanced Security
CodeQL
7,579
star
10

roadmap

GitHub public roadmap
7,393
star
11

scientist

πŸ”¬ A Ruby library for carefully refactoring critical paths.
Ruby
7,389
star
12

personal-website

Code that'll help you kickstart a personal website that showcases your work as a software developer.
HTML
7,243
star
13

markup

Determines which markup library to use to render a content file (e.g. README) on GitHub
Ruby
5,678
star
14

dmca

Repository with text of DMCA takedown notices as received. GitHub does not endorse or adopt any assertion contained in the following notices. Users identified in the notices are presumed innocent until proven guilty. Additional information about our DMCA policy can be found at
DIGITAL Command Language
5,457
star
15

swift-style-guide

**Archived** Style guide & coding conventions for Swift projects
4,770
star
16

gemoji

Emoji images and names.
Ruby
4,280
star
17

training-kit

Open source courseware for Git and GitHub
HTML
4,247
star
18

explore

Community-curated topic and collection pages on GitHub
Ruby
3,840
star
19

mona-sans

Mona Sans, a variable font from GitHub
3,680
star
20

hubot-scripts

DEPRECATED, see https://github.com/github/hubot-scripts/issues/1113 for details - optional scripts for hubot, opt in via hubot-scripts.json
CoffeeScript
3,538
star
21

choosealicense.com

A site to provide non-judgmental guidance on choosing a license for your open source project
Ruby
3,379
star
22

git-sizer

Compute various size metrics for a Git repository, flagging those that might cause problems
Go
3,160
star
23

secure_headers

Manages application of security headers with many safe defaults
Ruby
3,104
star
24

gov-takedowns

Text of government takedown notices as received. GitHub does not endorse or adopt any assertion contained in the following notices.
3,088
star
25

archive-program

The GitHub Archive Program & Arctic Code Vault
3,000
star
26

scripts-to-rule-them-all

Set of boilerplate scripts describing the normalized script pattern that GitHub uses in its projects.
Shell
2,859
star
27

hotkey

Trigger an action on an element with a keyboard shortcut.
JavaScript
2,851
star
28

relative-time-element

Web component extensions to the standard <time> element.
JavaScript
2,799
star
29

janky

Continuous integration server built on top of Jenkins and Hubot
Ruby
2,759
star
30

github-elements

GitHub's Web Component collection.
JavaScript
2,523
star
31

renaming

Guidance for changing the default branch name for GitHub repositories
2,408
star
32

view_component

A framework for building reusable, testable & encapsulated view components in Ruby on Rails.
Ruby
2,370
star
33

VisualStudio

GitHub Extension for Visual Studio
C#
2,365
star
34

glb-director

GitHub Load Balancer Director and supporting tooling.
C
2,255
star
35

SoftU2F

Software U2F authenticator for macOS
Swift
2,201
star
36

accessibilityjs

Client side accessibility error scanner.
JavaScript
2,180
star
37

balanced-employee-ip-agreement

GitHub's employee intellectual property agreement, open sourced and reusable
2,126
star
38

github-services

Legacy GitHub Services Integration
Ruby
1,902
star
39

platform-samples

A public place for all platform sample projects.
Shell
1,885
star
40

hubot-sans

Hubot Sans, a variable font from GitHub
Shell
1,832
star
41

pages-gem

A simple Ruby Gem to bootstrap dependencies for setting up and maintaining a local Jekyll environment in sync with GitHub Pages
Ruby
1,782
star
42

india

GitHub resources and information for the developer community in India
Ruby
1,769
star
43

haikus-for-codespaces

EJS
1,753
star
44

site-policy

Collaborative development on GitHub's site policies, procedures, and guidelines
1,743
star
45

government.github.com

Gather, curate, and feature stories of public servants and civic hackers using GitHub as part of their open government innovations
HTML
1,727
star
46

advisory-database

Security vulnerability database inclusive of CVEs and GitHub originated security advisories from the world of open source software.
1,711
star
47

objective-c-style-guide

**Archived** Style guide & coding conventions for Objective-C projects
1,682
star
48

covid19-dashboard

A site that displays up to date COVID-19 stats, powered by fastpages.
Jupyter Notebook
1,644
star
49

lightcrawler

Crawl a website and run it through Google lighthouse
JavaScript
1,471
star
50

rest-api-description

An OpenAPI description for GitHub's REST API
1,372
star
51

feedback

Public feedback discussions for: GitHub for Mobile, GitHub Discussions, GitHub Codespaces, GitHub Sponsors, GitHub Issues and more!
1,359
star
52

developer.github.com

GitHub Developer site
Ruby
1,314
star
53

backup-utils

GitHub Enterprise Backup Utilities
1,190
star
54

brubeck

A Statsd-compatible metrics aggregator
C
1,185
star
55

dev

Press the . key on any repo
1,184
star
56

catalyst

Catalyst is a set of patterns and techniques for developing components within a complex application.
TypeScript
1,183
star
57

codeql-action

Actions for running CodeQL analysis
TypeScript
1,152
star
58

securitylab

Resources related to GitHub Security Lab
C
1,150
star
59

opensourcefriday

🚲 Contribute to the open source community every Friday
HTML
1,143
star
60

graphql-client

A Ruby library for declaring, composing and executing GraphQL queries
Ruby
1,139
star
61

Rebel

Cocoa framework for improving AppKit
Objective-C
1,127
star
62

gh-actions-importer

GitHub Actions Importer helps you plan and automate the migration of Azure DevOps, Bamboo, Bitbucket, CircleCI, GitLab, Jenkins, and Travis CI pipelines to GitHub Actions.
C#
982
star
63

licensed

A Ruby gem to cache and verify the licenses of dependencies
Ruby
942
star
64

.github

Community health files for the @GitHub organization
869
star
65

swordfish

EXPERIMENTAL password management app. Don't use this.
Ruby
740
star
66

details-dialog-element

A modal dialog that's opened with <details>.
JavaScript
739
star
67

stack-graphs

Rust implementation of stack graphs
Rust
725
star
68

codeql-cli-binaries

Binaries for the CodeQL CLI
725
star
69

github-ds

A collection of Ruby libraries for working with SQL on top of ActiveRecord's connection
Ruby
667
star
70

email_reply_parser

Small library to parse plain text email content
Ruby
658
star
71

vulcanizer

GitHub's ops focused Elasticsearch library
Go
657
star
72

github-ospo

Helping open source program offices get started
641
star
73

webauthn-json

πŸ” A small WebAuthn API wrapper that translates to/from pure JSON using base64url.
TypeScript
638
star
74

gh-copilot

Ask for assistance right in your terminal.
637
star
75

rubocop-github

Code style checking for GitHub's Ruby projects
Ruby
616
star
76

safe-settings

JavaScript
606
star
77

codespaces-jupyter

Explore machine learning and data science with Codespaces
Jupyter Notebook
591
star
78

dat-science

Replaced by https://github.com/github/scientist
Ruby
582
star
79

maven-plugins

Official GitHub Maven Plugins
Java
581
star
80

details-menu-element

A menu opened with <details>.
JavaScript
554
star
81

trilogy

Trilogy is a client library for MySQL-compatible database servers, designed for performance, flexibility, and ease of embedding.
C
543
star
82

freno

freno: cooperative, highly available throttler service
Go
534
star
83

smimesign

An S/MIME signing utility for use with Git
Go
519
star
84

brasil

Recursos e informaçáes do GitHub para a comunidade de desenvolvedores no Brasil.
Ruby
515
star
85

gh-valet

Valet helps facilitate the migration of Azure DevOps, CircleCI, GitLab CI, Jenkins, and Travis CI pipelines to GitHub Actions.
C#
511
star
86

include-fragment-element

A client-side includes tag.
JavaScript
508
star
87

covid-19-repo-data

Data archive of identifiable COVID-19 related public projects on GitHub
505
star
88

vscode-github-actions

GitHub Actions extension for VS Code
TypeScript
492
star
89

vscode-codeql-starter

Starter workspace to use with the CodeQL extension for Visual Studio Code.
CodeQL
477
star
90

how-engineering-communicates

A community version of the "common API" for how the GitHub Engineering organization communicates
474
star
91

Archimedes

Geometry functions for Cocoa and Cocoa Touch
Objective-C
466
star
92

codeql-go

The CodeQL extractor and libraries for Go.
465
star
93

open-source-survey

The Open Source Survey
431
star
94

synsanity

netfilter (iptables) target for high performance lockless SYN cookies for SYN flood mitigation
C
424
star
95

entitlements-app

The Ruby Gem that Powers Entitlements - GitHub's Identity and Access Management System
Ruby
409
star
96

MVG

MVG = Minimum Viable Governance
379
star
97

issue-metrics

Gather metrics on issues/prs/discussions such as time to first response, count of issues opened, closed, etc.
Python
378
star
98

roskomnadzor

deprecated archive β€” moved to https://github.com/github/gov-takedowns/tree/master/Russia
376
star
99

clipboard-copy-element

Copy element text content or input values to the clipboard.
JavaScript
374
star
100

codespaces-react

JavaScript
364
star