• Stars
    star
    2,613
  • Rank 17,523 (Top 0.4 %)
  • Language
    Go
  • License
    Other
  • Created about 8 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Gaining advanced insights from Git repository history.

Hercules

Fast, insightful and highly customizable Git history analysis.

GoDoc Travis build Status AppVeyor build status PyPi package status Docker build status Code coverage Go Report Card Apache 2.0 license

OverviewHow To UseInstallationContributionsLicense


Table of Contents

Overview

Hercules is an amazingly fast and highly customizable Git repository analysis engine written in Go. Batteries are included. Powered by go-git.

Notice (November 2020): the main author is back from the limbo and is gradually resuming the development. See the roadmap.

There are two command-line tools: hercules and labours. The first is a program written in Go which takes a Git repository and executes a Directed Acyclic Graph (DAG) of analysis tasks over the full commit history. The second is a Python script which shows some predefined plots over the collected data. These two tools are normally used together through a pipe. It is possible to write custom analyses using the plugin system. It is also possible to merge several analysis results together - relevant for organizations. The analyzed commit history includes branches, merges, etc.

Hercules has been successfully used for several internal projects at source{d}. There are blog posts: 1, 2 and a presentation. Please contribute by testing, fixing bugs, adding new analyses, or coding swagger!

Hercules DAG of Burndown analysis

The DAG of burndown and couples analyses with UAST diff refining. Generated with hercules --burndown --burndown-people --couples --feature=uast --dry-run --dump-dag doc/dag.dot https://github.com/src-d/hercules

git/git image

torvalds/linux line burndown (granularity 30, sampling 30, resampled by year). Generated with hercules --burndown --first-parent --pb https://github.com/torvalds/linux | labours -f pb -m burndown-project in 1h 40min.

Installation

Grab hercules binary from the Releases page. labours is installable from PyPi:

pip3 install labours

pip3 is the Python package manager.

Numpy and Scipy can be installed on Windows using http://www.lfd.uci.edu/~gohlke/pythonlibs/

Build from source

You are going to need Go (>= v1.11) and protoc.

git clone https://github.com/src-d/hercules && cd hercules
make
pip3 install -e ./python

GitHub Action

It is possible to run Hercules as a GitHub Action: Hercules on GitHub Marketplace. Please refer to the sample workflow which demonstrates how to setup.

Contributions

...are welcome! See CONTRIBUTING and code of conduct.

License

Apache 2.0

Usage

The most useful and reliably up-to-date command line reference:

hercules --help

Some examples:

# Use "memory" go-git backend and display the burndown plot. "memory" is the fastest but the repository's git data must fit into RAM.
hercules --burndown https://github.com/go-git/go-git | labours -m burndown-project --resample month
# Use "file system" go-git backend and print some basic information about the repository.
hercules /path/to/cloned/go-git
# Use "file system" go-git backend, cache the cloned repository to /tmp/repo-cache, use Protocol Buffers and display the burndown plot without resampling.
hercules --burndown --pb https://github.com/git/git /tmp/repo-cache | labours -m burndown-project -f pb --resample raw

# Now something fun
# Get the linear history from git rev-list, reverse it
# Pipe to hercules, produce burndown snapshots for every 30 days grouped by 30 days
# Save the raw data to cache.yaml, so that later is possible to labours -i cache.yaml
# Pipe the raw data to labours, set text font size to 16pt, use Agg matplotlib backend and save the plot to output.png
git rev-list HEAD | tac | hercules --commits - --burndown https://github.com/git/git | tee cache.yaml | labours -m burndown-project --font-size 16 --backend Agg --output git.png

labours -i /path/to/yaml allows to read the output from hercules which was saved on disk.

Caching

It is possible to store the cloned repository on disk. The subsequent analysis can run on the corresponding directory instead of cloning from scratch:

# First time - cache
hercules https://github.com/git/git /tmp/repo-cache

# Second time - use the cache
hercules --some-analysis /tmp/repo-cache

GitHub Action

The action produces the artifact named hercules_charts. Since it is currently impossible to pack several files in one artifact, all the charts and Tensorflow Projector files are packed in the inner tar archive. In order to view the embeddings, go to projector.tensorflow.org, click "Load" and choose the two TSVs. Then use UMAP or T-SNE.

Docker image

docker run --rm srcd/hercules hercules --burndown --pb https://github.com/git/git | docker run --rm -i -v $(pwd):/io srcd/hercules labours -f pb -m burndown-project -o /io/git_git.png

Built-in analyses

Project burndown

hercules --burndown
labours -m burndown-project

Line burndown statistics for the whole repository. Exactly the same what git-of-theseus does but much faster. Blaming is performed efficiently and incrementally using a custom RB tree tracking algorithm, and only the last modification date is recorded while running the analysis.

All burndown analyses depend on the values of granularity and sampling. Granularity is the number of days each band in the stack consists of. Sampling is the frequency with which the burnout state is snapshotted. The smaller the value, the more smooth is the plot but the more work is done.

There is an option to resample the bands inside labours, so that you can define a very precise distribution and visualize it different ways. Besides, resampling aligns the bands across periodic boundaries, e.g. months or years. Unresampled bands are apparently not aligned and start from the project's birth date.

Files

hercules --burndown --burndown-files
labours -m burndown-file

Burndown statistics for every file in the repository which is alive in the latest revision.

Note: it will generate separate graph for every file. You don't want to run it on repository with many files.

People

hercules --burndown --burndown-people [--people-dict=/path/to/identities]
labours -m burndown-person

Burndown statistics for the repository's contributors. If --people-dict is not specified, the identities are discovered by the following algorithm:

  1. We start from the root commit towards the HEAD. Emails and names are converted to lower case.
  2. If we process an unknown email and name, record them as a new developer.
  3. If we process a known email but unknown name, match to the developer with the matching email, and add the unknown name to the list of that developer's names.
  4. If we process an unknown email but known name, match to the developer with the matching name, and add the unknown email to the list of that developer's emails.

If --people-dict is specified, it should point to a text file with the custom identities. The format is: every line is a single developer, it contains all the matching emails and names separated by |. The case is ignored.

Overwrites matrix

Wireshark top 20 overwrites matrix

Wireshark top 20 devs - overwrites matrix

hercules --burndown --burndown-people [--people-dict=/path/to/identities]
labours -m overwrites-matrix

Beside the burndown information, --burndown-people collects the added and deleted line statistics per developer. Thus it can be visualized how many lines written by developer A are removed by developer B. This indicates collaboration between people and defines expertise teams.

The format is the matrix with N rows and (N+2) columns, where N is the number of developers.

  1. First column is the number of lines the developer wrote.
  2. Second column is how many lines were written by the developer and deleted by unidentified developers (if --people-dict is not specified, it is always 0).
  3. The rest of the columns show how many lines were written by the developer and deleted by identified developers.

The sequence of developers is stored in people_sequence YAML node.

Code ownership

Ember.js top 20 code ownership

Ember.js top 20 devs - code ownership

hercules --burndown --burndown-people [--people-dict=/path/to/identities]
labours -m ownership

--burndown-people also allows to draw the code share through time stacked area plot. That is, how many lines are alive at the sampled moments in time for each identified developer.

Couples

Linux kernel file couples

torvalds/linux files' coupling in Tensorflow Projector

hercules --couples [--people-dict=/path/to/identities]
labours -m couples -o <name> [--couples-tmp-dir=/tmp]

Important: it requires Tensorflow to be installed, please follow official instructions.

The files are coupled if they are changed in the same commit. The developers are coupled if they change the same file. hercules records the number of couples throughout the whole commit history and outputs the two corresponding co-occurrence matrices. labours then trains Swivel embeddings - dense vectors which reflect the co-occurrence probability through the Euclidean distance. The training requires a working Tensorflow installation. The intermediate files are stored in the system temporary directory or --couples-tmp-dir if it is specified. The trained embeddings are written to the current working directory with the name depending on -o. The output format is TSV and matches Tensorflow Projector so that the files and people can be visualized with t-SNE implemented in TF Projector.

Structural hotness

      46  jinja2/compiler.py:visit_Template [FunctionDef]
      42  jinja2/compiler.py:visit_For [FunctionDef]
      34  jinja2/compiler.py:visit_Output [FunctionDef]
      29  jinja2/environment.py:compile [FunctionDef]
      27  jinja2/compiler.py:visit_Include [FunctionDef]
      22  jinja2/compiler.py:visit_Macro [FunctionDef]
      22  jinja2/compiler.py:visit_FromImport [FunctionDef]
      21  jinja2/compiler.py:visit_Filter [FunctionDef]
      21  jinja2/runtime.py:__call__ [FunctionDef]
      20  jinja2/compiler.py:visit_Block [FunctionDef]

Thanks to Babelfish, hercules is able to measure how many times each structural unit has been modified. By default, it looks at functions; refer to Semantic UAST XPath manual to switch to something else.

hercules --shotness [--shotness-xpath-*]
labours -m shotness

Couples analysis automatically loads "shotness" data if available.

Jinja2 functions grouped by structural hotness

hercules --shotness --pb https://github.com/pallets/jinja | labours -m couples -f pb

Aligned commit series

tensorflow/tensorflow

tensorflow/tensorflow aligned commit series of top 50 developers by commit number.

hercules --devs [--people-dict=/path/to/identities]
labours -m devs -o <name>

We record how many commits made, as well as lines added, removed and changed per day for each developer. We plot the resulting commit time series using a few tricks to show the temporal grouping. In other words, two adjacent commit series should look similar after normalization.

  1. We compute the distance matrix of the commit series. Our distance metric is Dynamic Time Warping. We use FastDTW algorithm which has linear complexity proportional to the length of time series. Thus the overall complexity of computing the matrix is quadratic.
  2. We compile the linear list of commit series with Seriation technique. Particularly, we solve the Travelling Salesman Problem which is NP-complete. However, given the typical number of developers which is less than 1,000, there is a good chance that the solution does not take much time. We use Google or-tools solver.
  3. We find 1-dimensional clusters in the resulting path with HDBSCAN algorithm and assign colors accordingly.
  4. Time series are smoothed by convolving with the Slepian window.

This plot allows to discover how the development team evolved through time. It also shows "commit flashmobs" such as Hacktoberfest. For example, here are the revealed insights from the tensorflow/tensorflow plot above:

  1. "Tensorflow Gardener" is classified as the only outlier.
  2. The "blue" group of developers covers the global maintainers and a few people who left (at the top).
  3. The "red" group shows how core developers join the project or become less active.

Added vs changed lines through time

tensorflow/tensorflow

tensorflow/tensorflow added and changed lines through time.

hercules --devs [--people-dict=/path/to/identities]
labours -m old-vs-new -o <name>

--devs from the previous section allows to plot how many lines were added and how many existing changed (deleted or replaced) through time. This plot is smoothed.

Efforts through time

kubernetes/kubernetes

kubernetes/kubernetes efforts through time.

hercules --devs [--people-dict=/path/to/identities]
labours -m devs-efforts -o <name>

Besides, --devs allows to plot how many lines have been changed (added or removed) by each developer. The upper part of the plot is an accumulated (integrated) lower part. It is impossible to have the same scale for both parts, so the lower values are scaled, and hence there are no lower Y axis ticks. There is a difference between the efforts plot and the ownership plot, although changing lines correlate with owning lines.

Sentiment (positive and negative comments)

Django sentiment

It can be clearly seen that Django comments were positive/optimistic in the beginning, but later became negative/pessimistic.
hercules --sentiment --pb https://github.com/django/django | labours -m sentiment -f pb

We extract new and changed comments from source code on every commit, apply BiDiSentiment general purpose sentiment recurrent neural network and plot the results. Requires libtensorflow. E.g. sadly, we need to hide the rect from the documentation finder for now is negative and Theano has a built-in optimization for logsumexp (...) so we can just write the expression directly is positive. Don't expect too much though - as was written, the sentiment model is general purpose and the code comments have different nature, so there is no magic (for now).

Hercules must be built with "tensorflow" tag - it is not by default:

make TAGS=tensorflow

Such a build requires libtensorflow.

Everything in a single pass

hercules --burndown --burndown-files --burndown-people --couples --shotness --devs [--people-dict=/path/to/identities]
labours -m all

Plugins

Hercules has a plugin system and allows to run custom analyses. See PLUGINS.md.

Merging

hercules combine is the command which joins several analysis results in Protocol Buffers format together.

hercules --burndown --pb https://github.com/go-git/go-git > go-git.pb
hercules --burndown --pb https://github.com/src-d/hercules > hercules.pb
hercules combine go-git.pb hercules.pb | labours -f pb -m burndown-project --resample M

Bad unicode errors

YAML does not support the whole range of Unicode characters and the parser on labours side may raise exceptions. Filter the output from hercules through fix_yaml_unicode.py to discard such offending characters.

hercules --burndown --burndown-people https://github.com/... | python3 fix_yaml_unicode.py | labours -m people

Plotting

These options affects all plots:

labours [--style=white|black] [--backend=] [--size=Y,X]

--style sets the general style of the plot (see labours --help). --background changes the plot background to be either white or black. --backend chooses the Matplotlib backend. --size sets the size of the figure in inches. The default is 12,9.

(required in macOS) you can pin the default Matplotlib backend with

echo "backend: TkAgg" > ~/.matplotlib/matplotlibrc

These options are effective in burndown charts only:

labours [--text-size] [--relative]

--text-size changes the font size, --relative activate the stretched burndown layout.

Custom plotting backend

It is possible to output all the information needed to draw the plots in JSON format. Simply append .json to the output (-o) and you are done. The data format is not fully specified and depends on the Python code which generates it. Each JSON file should contain "type" which reflects the plot kind.

Caveats

  1. Processing all the commits may fail in some rare cases. If you get an error similar to #106 please report there and specify --first-parent as a workaround.
  2. Burndown collection may fail with an Out-Of-Memory error. See the next session for the workarounds.
  3. Parsing YAML in Python is slow when the number of internal objects is big. hercules' output for the Linux kernel in "couples" mode is 1.5 GB and takes more than an hour / 180GB RAM to be parsed. However, most of the repositories are parsed within a minute. Try using Protocol Buffers instead (hercules --pb and labours -f pb).
  4. To speed up yaml parsing
    # Debian, Ubuntu
    apt install libyaml-dev
    # macOS
    brew install yaml-cpp libyaml
    
    # you might need to re-install pyyaml for changes to make effect
    pip uninstall pyyaml
    pip --no-cache-dir install pyyaml
    

Burndown Out-Of-Memory

If the analyzed repository is big and extensively uses branching, the burndown stats collection may fail with an OOM. You should try the following:

  1. Read the repo from disk instead of cloning into memory.
  2. Use --skip-blacklist to avoid analyzing the unwanted files. It is also possible to constrain the --language.
  3. Use the hibernation feature: --hibernation-distance 10 --burndown-hibernation-threshold=1000. Play with those two numbers to start hibernating right before the OOM.
  4. Hibernate on disk: --burndown-hibernation-disk --burndown-hibernation-dir /path.
  5. --first-parent, you win.

Roadmap

  • Switch from src-d/go-git to go-git/go-git. Upgrade the codebase to be compatible with the latest Go version.
  • Update the docs regarding the copyrights and such.
  • Fix the reported bugs.
  • Remove the dependency on Babelfish for parsing the code. It is abandoned and a better alternative should be found.
  • Remove the ad-hoc analyses added while source{d} was agonizing.

More Repositories

1

awesome-machine-learning-on-source-code

Cool links & research papers related to Machine Learning applied to source code (MLonCode)
6,247
star
2

go-git

Project has been moved to: https://github.com/go-git/go-git
Go
4,904
star
3

gitbase

SQL interface to git repositories, written in Go. https://docs.sourced.tech/gitbase
Go
2,063
star
4

go-mysql-server

An extensible MySQL server implementation in Go.
Go
1,040
star
5

go-kallax

Kallax is a PostgreSQL typesafe ORM for the Go language.
Go
858
star
6

kmcuda

Large scale K-means and K-nn implementation on NVIDIA GPU / CUDA
Jupyter Notebook
800
star
7

proteus

Generate .proto files from Go source code.
Go
734
star
8

wmd-relax

Calculates Word Mover's Distance Insanely Fast
Python
461
star
9

enry

A faster file programming language detector
Go
460
star
10

datasets

source{d} datasets ("big code") for source code analysis and machine learning on source code
Jupyter Notebook
323
star
11

guide

Aiming to be a fully transparent company. All information about source{d} and what it's like to work here.
JavaScript
294
star
12

lapjv

Linear Assignmment Problem solver using Jonker-Volgenant algorithm - Python 3 native module.
C++
252
star
13

go-license-detector

Reliable project licenses detector.
Go
237
star
14

engine-deprecated

[DISCONTINUED] Go to https://github.com/src-d/sourced-ce/
Go
217
star
15

go-billy

The missing interface filesystem abstraction for Go
Go
199
star
16

sourced-ce

source{d} Community Edition (CE)
Go
188
star
17

beanstool

Dependency free beanstalkd admin tool
Go
151
star
18

lookout

Assisted code review, running custom code analyzers on pull requests
Go
149
star
19

ml

sourced.ml is a library and command line tools to build and apply machine learning models on top of Universal Abstract Syntax Trees
Python
141
star
20

reading-club

Paper reading club at source{d}
115
star
21

minhashcuda

Weighted MinHash implementation on CUDA (multi-gpu).
C++
114
star
22

go-siva

siva - seekable indexed verifiable archiver
Go
98
star
23

jgit-spark-connector

jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis.
Scala
71
star
24

gitbase-web

gitbase web client; source{d} CE comes with a new UI, check it at https://docs.sourced.tech/community-edition/
Go
57
star
25

gemini

Advanced similarity and duplicate source code at scale.
Scala
54
star
26

apollo

Advanced similarity and duplicate source code proof of concept for our research efforts.
Python
52
star
27

borges

borges collects and stores Git repositories.
Go
52
star
28

okrs

Objectives & Key Results repository for the source{d} team
48
star
29

go-queue

Queue is a generic interface to abstract the details of implementation of queue systems.
Go
47
star
30

vecino

Vecino is a command line application to discover Git repositories which are similar to the one that the user provides.
Python
46
star
31

jgscm

Jupyter support for Google Cloud Storage
Python
45
star
32

code2vec

MLonCode community effort to implement Learning Distributed Representations of Code (https://arxiv.org/pdf/1803.09473.pdf)
Python
40
star
33

coreos-nvidia

Yet another NVIDIA driver container for Container Linux (aka CoreOS)
Makefile
38
star
34

style-analyzer

Lookout Style Analyzer: fixing code formatting and typos during code reviews
Jupyter Notebook
32
star
35

code-annotation

🐈 Code Annotation Tool
JavaScript
28
star
36

flamingo

Flamingo is a very thin and simple platform-agnostic chat bot framework
Go
27
star
37

blog

source{d} blog
HTML
27
star
38

sparkpickle

Pure Python implementation of reading SequenceFile-s with pickles written by Spark's saveAsPickleFile()
Python
24
star
39

go-errors

Yet another errors package, implementing error handling primitives.
Go
23
star
40

homebrew

Real homebrew!
22
star
41

infrastructure-dockerfiles

Dockerfile-s to build the images which power source{d}'s computing infrastructure.
Dockerfile
22
star
42

conferences

Tracking events, CfPs, abstracts, slides, and all other even related things
22
star
43

tmsc

Python
21
star
44

models

Machine learning models for MLonCode trained using the source{d} stack
19
star
45

terraform-provider-online

Terraform provider for Online.net
Go
19
star
46

modelforge

Python library to share machine learning models easily and reliably.
Python
18
star
47

identity-matching

source{d} extension to match Git signatures to real people.
Go
17
star
48

tensorflow-swivel

C++
16
star
49

seriate

Optimal ordering of elements in a set given their distance matrix.
Python
16
star
50

gitcollector

Go
15
star
51

go-vitess

An automatic filter-branch of Go libraries from the great Vitess project.
Go
15
star
52

rovers

Rovers is a service to retrieve repository URLs from multiple repository hosting providers.
HTML
14
star
53

go-parse-utils

Go
14
star
54

ml-core

source{d} MLonCode foundation - core algorithms and models.
Python
14
star
55

charts

Applications for Kubernetes
Smarty
12
star
56

role2vec

TeX
12
star
57

snippet-ranger

Jupyter Notebook
12
star
58

fsbench

a small tool for benchmarking filesystems
Go
11
star
59

dev-similarity

Jupyter Notebook
11
star
60

go-log

Log is a generic logging library based on logrus
Go
11
star
61

tab-vs-spaces

Jupyter Notebook
10
star
62

ghsync

GitHub API v3 > PostgreSQL
Go
9
star
63

diffcuda

Accelerated bulk diff on GPU
C
9
star
64

ml-mining

Python
8
star
65

go-billy-siva

A limited go-billy filesystem implementation based on siva.
Go
8
star
66

go-compose-installer

A toolkit to create installers based on docker compose.
Go
8
star
67

github-reminder

A GitHub application to handle deadline reminders in a GitHub idiomatic way.
Go
8
star
68

go-git-fixtures

several git fixtures to run go-git tests
Go
8
star
69

docsrv

docsrv is an app to serve versioned documentation for GitHub projects on demand
Go
7
star
70

go-cli

CLI scaffolding for Go
Go
7
star
71

shell-complete

Python
7
star
72

kubernetes-local-pv-provisioner

Helping you setting up local persistent volumes
Go
7
star
73

engine-analyses

Analyses of open source projects with source{d} Engine
Jupyter Notebook
7
star
74

sourced-ui

source{d} UI
JavaScript
7
star
75

gypogit

[UNMAINTAINED] go-git wrapper for Python
Python
6
star
76

go-borges

Go
6
star
77

treediff

Python
6
star
78

engine-tour

Temporary storage for useful guides for the source{d} engine
Jupyter Notebook
6
star
79

jupyter-spark-docker

Dockerfile with jupyter and scala installed
Dockerfile
6
star
80

imports

Go
6
star
81

git-validate

Go
6
star
82

k8s-pod-headless-service-operator

Go
6
star
83

landing

landing for source{d}
HTML
5
star
84

lookout-terraform-analyzer

This is a lookout analyzer that checks if your PR has been Terraform fmt'ed when submitting it.
Go
5
star
85

swivel-spark-prep

Distributed equivalent of prep.py and fastprep from Swivel using Apache Spark.
Scala
5
star
86

ci

Make-based build system for Go projects at source{d}
Shell
5
star
87

framework

[DEPRECATED]
Go
4
star
88

platform-starter

Starter and basic configuration for platform frontend projects.
Go
4
star
89

metadata-retrieval

Go
4
star
90

lookout-sdk

SDK for lookout analyzers
Python
4
star
91

code-completion

autocompletion prototype
Python
4
star
92

siva-java

siva format implemented in Java
Java
4
star
93

design

All things design at source{d}: branding, guidelines, UI assets, media & co.
4
star
94

berserker

Large scale UAST extractor [DEPRECATED]
Shell
4
star
95

combustion

Go
3
star
96

tm-experiments

Topic Modeling Experiments on Source Code
Python
3
star
97

go-YouTokenToMe

Go
3
star
98

lookout-sdk-ml

SDK for ML based Lookout analyzers
Python
3
star
99

go-asdf

Advanced Scientific Data Format reader library in pure Go.
Go
3
star
100

google-cloud-dns-healthcheck

Go
3
star