• Stars
    star
    6,247
  • Rank 6,404 (Top 0.2 %)
  • Language
  • License
    Creative Commons ...
  • Created over 7 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Cool links & research papers related to Machine Learning applied to source code (MLonCode)

Awesome Machine Learning On Source Code Awesome Machine Learning On Source Code CI Status

Awesome Machine Learning On Source Code

Notice: This repository is no longer actively maintained, and no further updates will be done, nor issues/PRs will be answered or attended. An alternative actively maintained can be found at ml4code.github.io repository.

A curated list of awesome research papers, datasets and software projects devoted to machine learning and source code. #MLonCode

Contents

  • Posts
  • Talks
  • Software
  • Datasets
  • Credits
  • Contributions
  • License
  • Digests

    Conferences

    Competitions

    • CodRep - competition on automatic program repair: given a source line, find the insertion point.

    Papers

    Program Synthesis and Induction

    Source Code Analysis and Language modeling

    Neural Network Architectures and Algorithms

    Embeddings in Software Engineering

    Program Translation

    Code Suggestion and Completion

    Program Repair and Bug Detection

    APIs and Code Mining

    Code Optimization

    Topic Modeling

    Sentiment Analysis

    Code Summarization

    Clone Detection

    Differentiable Interpreters

    Related research

    AST Differencing

    Binary Data Modeling

    Soft Clustering Using T-mixture Models

    Natural Language Parsing and Comprehension

    Posts

    Talks

    Software

    Machine Learning

    • Differentiable Neural Computer (DNC) - TensorFlow implementation of the Differentiable Neural Computer.
    • sourced.ml - Abstracts feature extraction from source code syntax trees and working with ML models.
    • vecino - Finds similar Git repositories.
    • apollo - Source code deduplication as scale, research.
    • gemini - Source code deduplication as scale, production.
    • enry - Insanely fast file based programming language detector.
    • hercules - Git repository mining framework with batteries on top of go-git.
    • DeepCS - Keras and Pytorch implementations of DeepCS (Deep Code Search).
    • Code Neuron - Recurrent neural network to detect code blocks in natural language text.
    • Naturalize - Language agnostic framework for learning coding conventions from a codebase and then expoiting this information for suggesting better identifier names and formatting changes in the code.
    • Extreme Source Code Summarization - Convolutional attention neural network that learns to summarize source code into a short method name-like summary by just looking at the source code tokens.
    • Summarizing Source Code using a Neural Attention Model - CODE-NN, uses LSTM networks with attention to produce sentences that describe C# code snippets and SQL queries from StackOverflow. Torch over C#/SQL
    • Probabilistic API Miner - Near parameter-free probabilistic algorithm for mining the most interesting API patterns from a list of API call sequences.
    • Interesting Sequence Miner - Novel algorithm that mines the most interesting sequences under a probabilistic model. It is able to efficiently infer interesting sequences directly from the database.
    • TASSAL - Tool for the automatic summarization of source code using autofolding. Autofolding automatically creates a summary of a source code file by folding non-essential code and comment blocks.
    • JNice2Predict - Efficient and scalable open-source framework for structured prediction, enabling one to build new statistical engines more quickly.
    • Clone Digger - clone detection for Python and Java.
    • Sensibility - Uses LSTMs to detect and correct syntax errors in Java source code.
    • DeepBugs - Framework for learning bug detectors from an existing code corpus.
    • DeepSim - a deep learning-based approach to measure code functional similarity.
    • rnn-autocomplete - Neural code autocompletion with RNN (bachelor's thesis).
    • MindsDB - MindsDB is an Explainable AutoML framework for developers. With MindsDB you can build, train and use state of the art ML models in as simple as one line of code.

    Utilities

    • go-git - Highly extensible Git implementation in pure Go which is friendly to data mining.
    • bblfsh - Self-hosted server for source code parsing.
    • engine - Scalable and distributed data retrieval pipeline for source code.
    • minhashcuda - Weighted MinHash implementation on CUDA to efficiently find duplicates.
    • kmcuda - k-means on CUDA to cluster and to search for nearest neighbors in dense space.
    • wmd-relax - Python package which finds nearest neighbors at Word Mover's Distance.
    • Tregex, Tsurgeon and Semgrex - Tregex is a utility for matching patterns in trees, based on tree relationships and regular expression matches on nodes (the name is short for "tree regular expressions").
    • source{d} models - Machine Learning models for MLonCode trained using the source{d} stack.

    Datasets

    Credits

    Contributions

    See CONTRIBUTING.md. TL;DR: create a pull request which is signed off.

    License

    License: CC BY-SA 4.0

    More Repositories

    1

    go-git

    Project has been moved to: https://github.com/go-git/go-git
    Go
    4,904
    star
    2

    hercules

    Gaining advanced insights from Git repository history.
    Go
    2,613
    star
    3

    gitbase

    SQL interface to git repositories, written in Go. https://docs.sourced.tech/gitbase
    Go
    2,063
    star
    4

    go-mysql-server

    An extensible MySQL server implementation in Go.
    Go
    1,040
    star
    5

    go-kallax

    Kallax is a PostgreSQL typesafe ORM for the Go language.
    Go
    858
    star
    6

    kmcuda

    Large scale K-means and K-nn implementation on NVIDIA GPU / CUDA
    Jupyter Notebook
    800
    star
    7

    proteus

    Generate .proto files from Go source code.
    Go
    734
    star
    8

    wmd-relax

    Calculates Word Mover's Distance Insanely Fast
    Python
    461
    star
    9

    enry

    A faster file programming language detector
    Go
    460
    star
    10

    datasets

    source{d} datasets ("big code") for source code analysis and machine learning on source code
    Jupyter Notebook
    323
    star
    11

    guide

    Aiming to be a fully transparent company. All information about source{d} and what it's like to work here.
    JavaScript
    294
    star
    12

    lapjv

    Linear Assignmment Problem solver using Jonker-Volgenant algorithm - Python 3 native module.
    C++
    252
    star
    13

    go-license-detector

    Reliable project licenses detector.
    Go
    237
    star
    14

    engine-deprecated

    [DISCONTINUED] Go to https://github.com/src-d/sourced-ce/
    Go
    217
    star
    15

    go-billy

    The missing interface filesystem abstraction for Go
    Go
    199
    star
    16

    sourced-ce

    source{d} Community Edition (CE)
    Go
    188
    star
    17

    beanstool

    Dependency free beanstalkd admin tool
    Go
    151
    star
    18

    lookout

    Assisted code review, running custom code analyzers on pull requests
    Go
    149
    star
    19

    ml

    sourced.ml is a library and command line tools to build and apply machine learning models on top of Universal Abstract Syntax Trees
    Python
    141
    star
    20

    reading-club

    Paper reading club at source{d}
    115
    star
    21

    minhashcuda

    Weighted MinHash implementation on CUDA (multi-gpu).
    C++
    114
    star
    22

    go-siva

    siva - seekable indexed verifiable archiver
    Go
    98
    star
    23

    jgit-spark-connector

    jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.
    Scala
    71
    star
    24

    gitbase-web

    gitbase web client; source{d} CE comes with a new UI, check it at https://docs.sourced.tech/community-edition/
    Go
    57
    star
    25

    gemini

    Advanced similarity and duplicate source code at scale.
    Scala
    54
    star
    26

    apollo

    Advanced similarity and duplicate source code proof of concept for our research efforts.
    Python
    52
    star
    27

    borges

    borges collects and stores Git repositories.
    Go
    52
    star
    28

    okrs

    Objectives & Key Results repository for the source{d} team
    48
    star
    29

    go-queue

    Queue is a generic interface to abstract the details of implementation of queue systems.
    Go
    47
    star
    30

    vecino

    Vecino is a command line application to discover Git repositories which are similar to the one that the user provides.
    Python
    46
    star
    31

    jgscm

    Jupyter support for Google Cloud Storage
    Python
    45
    star
    32

    code2vec

    MLonCode community effort to implement Learning Distributed Representations of Code (https://arxiv.org/pdf/1803.09473.pdf)
    Python
    40
    star
    33

    coreos-nvidia

    Yet another NVIDIA driver container for Container Linux (aka CoreOS)
    Makefile
    38
    star
    34

    style-analyzer

    Lookout Style Analyzer: fixing code formatting and typos during code reviews
    Jupyter Notebook
    32
    star
    35

    code-annotation

    🐈 Code Annotation Tool
    JavaScript
    28
    star
    36

    flamingo

    Flamingo is a very thin and simple platform-agnostic chat bot framework
    Go
    27
    star
    37

    blog

    source{d} blog
    HTML
    27
    star
    38

    sparkpickle

    Pure Python implementation of reading SequenceFile-s with pickles written by Spark's saveAsPickleFile()
    Python
    24
    star
    39

    go-errors

    Yet another errors package, implementing error handling primitives.
    Go
    23
    star
    40

    homebrew

    Real homebrew!
    22
    star
    41

    infrastructure-dockerfiles

    Dockerfile-s to build the images which power source{d}'s computing infrastructure.
    Dockerfile
    22
    star
    42

    conferences

    Tracking events, CfPs, abstracts, slides, and all other even related things
    22
    star
    43

    tmsc

    Python
    21
    star
    44

    models

    Machine learning models for MLonCode trained using the source{d} stack
    19
    star
    45

    terraform-provider-online

    Terraform provider for Online.net
    Go
    19
    star
    46

    modelforge

    Python library to share machine learning models easily and reliably.
    Python
    18
    star
    47

    identity-matching

    source{d} extension to match Git signatures to real people.
    Go
    17
    star
    48

    tensorflow-swivel

    C++
    16
    star
    49

    seriate

    Optimal ordering of elements in a set given their distance matrix.
    Python
    16
    star
    50

    gitcollector

    Go
    15
    star
    51

    go-vitess

    An automatic filter-branch of Go libraries from the great Vitess project.
    Go
    15
    star
    52

    rovers

    Rovers is a service to retrieve repository URLs from multiple repository hosting providers.
    HTML
    14
    star
    53

    go-parse-utils

    Go
    14
    star
    54

    ml-core

    source{d} MLonCode foundation - core algorithms and models.
    Python
    14
    star
    55

    charts

    Applications for Kubernetes
    Smarty
    12
    star
    56

    role2vec

    TeX
    12
    star
    57

    snippet-ranger

    Jupyter Notebook
    12
    star
    58

    fsbench

    a small tool for benchmarking filesystems
    Go
    11
    star
    59

    dev-similarity

    Jupyter Notebook
    11
    star
    60

    go-log

    Log is a generic logging library based on logrus
    Go
    11
    star
    61

    tab-vs-spaces

    Jupyter Notebook
    10
    star
    62

    ghsync

    GitHub API v3 > PostgreSQL
    Go
    9
    star
    63

    diffcuda

    Accelerated bulk diff on GPU
    C
    9
    star
    64

    ml-mining

    Python
    8
    star
    65

    go-billy-siva

    A limited go-billy filesystem implementation based on siva.
    Go
    8
    star
    66

    go-compose-installer

    A toolkit to create installers based on docker compose.
    Go
    8
    star
    67

    github-reminder

    A GitHub application to handle deadline reminders in a GitHub idiomatic way.
    Go
    8
    star
    68

    go-git-fixtures

    several git fixtures to run go-git tests
    Go
    8
    star
    69

    docsrv

    docsrv is an app to serve versioned documentation for GitHub projects on demand
    Go
    7
    star
    70

    go-cli

    CLI scaffolding for Go
    Go
    7
    star
    71

    shell-complete

    Python
    7
    star
    72

    kubernetes-local-pv-provisioner

    Helping you setting up local persistent volumes
    Go
    7
    star
    73

    engine-analyses

    Analyses of open source projects with source{d} Engine
    Jupyter Notebook
    7
    star
    74

    sourced-ui

    source{d} UI
    JavaScript
    7
    star
    75

    gypogit

    [UNMAINTAINED] go-git wrapper for Python
    Python
    6
    star
    76

    go-borges

    Go
    6
    star
    77

    treediff

    Python
    6
    star
    78

    engine-tour

    Temporary storage for useful guides for the source{d} engine
    Jupyter Notebook
    6
    star
    79

    jupyter-spark-docker

    Dockerfile with jupyter and scala installed
    Dockerfile
    6
    star
    80

    imports

    Go
    6
    star
    81

    git-validate

    Go
    6
    star
    82

    k8s-pod-headless-service-operator

    Go
    6
    star
    83

    landing

    landing for source{d}
    HTML
    5
    star
    84

    lookout-terraform-analyzer

    This is a lookout analyzer that checks if your PR has been Terraform fmt'ed when submitting it.
    Go
    5
    star
    85

    swivel-spark-prep

    Distributed equivalent of prep.py and fastprep from Swivel using Apache Spark.
    Scala
    5
    star
    86

    ci

    Make-based build system for Go projects at source{d}
    Shell
    5
    star
    87

    framework

    [DEPRECATED]
    Go
    4
    star
    88

    platform-starter

    Starter and basic configuration for platform frontend projects.
    Go
    4
    star
    89

    metadata-retrieval

    Go
    4
    star
    90

    lookout-sdk

    SDK for lookout analyzers
    Python
    4
    star
    91

    code-completion

    autocompletion prototype
    Python
    4
    star
    92

    siva-java

    siva format implemented in Java
    Java
    4
    star
    93

    design

    All things design at source{d}: branding, guidelines, UI assets, media & co.
    4
    star
    94

    berserker

    Large scale UAST extractor [DEPRECATED]
    Shell
    4
    star
    95

    combustion

    Go
    3
    star
    96

    tm-experiments

    Topic Modeling Experiments on Source Code
    Python
    3
    star
    97

    go-YouTokenToMe

    Go
    3
    star
    98

    lookout-sdk-ml

    SDK for ML based Lookout analyzers
    Python
    3
    star
    99

    go-asdf

    Advanced Scientific Data Format reader library in pure Go.
    Go
    3
    star
    100

    google-cloud-dns-healthcheck

    Go
    3
    star