• Stars
    star
    2,488
  • Rank 18,464 (Top 0.4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 8 years ago
  • Updated 12 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Analyze how a Git repo grows over time

pypi badge

Some scripts to analyze Git repos. Produces cool looking graphs like this (running it on git itself):

git

Installing

Run pip install git-of-theseus

Running

First, you need to run git-of-theseus-analyze <path to repo> (see git-of-theseus-analyze --help for a bunch of config). This will analyze a repository and might take quite some time.

After that, you can generate plots! Some examples:

  1. Run git-of-theseus-stack-plot cohorts.json will create a stack plot showing the total amount of code broken down into cohorts (what year the code was added)
  2. Run git-of-theseus-line-plot authors.json --normalize will show a plot of the % of code contributed by the top 20 authors
  3. Run git-of-theseus-survival-plot survival.json

You can run --help to see various options.

If you want to plot multiple repositories, have to run git-of-theseus-analyze separately for each project and store the data in separate directories using the --outdir flag. Then you can run git-of-theseus-survival-plot <foo/survival.json> <bar/survival.json> (optionally with the --exp-fit flag to fit an exponential decay)

Help

AttributeError: Unknown property labels – upgrade matplotlib if you are seeing this. pip install matplotlib --upgrade

Some pics

Survival of a line of code in a set of interesting repos:

git

This curve is produced by the git-of-theseus-survival-plot script and shows the percentage of lines in a commit that are still present after x years. It aggregates it over all commits, no matter what point in time they were made. So for x=0 it includes all commits, whereas for x>0 not all commits are counted (because we would have to look into the future for some of them). The survival curves are estimated using Kaplan-Meier.

You can also add an exponential fit:

git

Linux – stack plot:

git

This curve is produced by the git-of-theseus-stack-plot script and shows the total number of lines in a repo broken down into cohorts by the year the code was added.

Node – stack plot:

git

Rails – stack plot:

git

Tensorflow – stack plot:

git

Rust – stack plot:

git

Plotting other stuff

git-of-theseus-analyze will write exts.json, cohorts.json and authors.json. You can run git-of-theseus-stack-plot authors.json to plot author statistics as well, or git-of-theseus-stack-plot exts.json to plot file extension statistics. For author statistics, you might want to create a .mailmap file in the root directory of the repository to deduplicate authors. If you need to create a .mailmap file the following command can list the distinct author-email combinations in a repository:

Mac / Linux

git log --pretty=format:"%an %ae" | sort | uniq

Windows Powershell

git log --pretty=format:"%an %ae" | Sort-Object | Select-Object -Unique

For instance, here's the author statistics for Kubernetes:

git

You can also normalize it to 100%. Here's author statistics for Git:

git

Other stuff

Markovtsev Vadim implemented a very similar analysis that claims to be 20%-6x faster than Git of Theseus. It's named Hercules and there's a great blog post about all the complexity going into the analysis of Git history.

More Repositories

1

ann-benchmarks

Benchmarks of approximate nearest neighbor libraries in Python
Python
4,909
star
2

deep-pink

Deep Pink is a chess AI that learns to play chess using deep learning.
Python
806
star
3

deep-fonts

Generate fonts using deep learning
Python
778
star
4

eigenstuff

Use the first eigenvector (stationary distribution) of Google searches for "move from X to Y" to say something about future popularity
Python
230
star
5

mta

Scrape & analyze MTA arrival times
Python
150
star
6

buffet

Simulate buffet lines (from blog post)
Python
95
star
7

software-estimation

Statistical analysis of software estimation
Python
91
star
8

ping

Ping the world!
Python
83
star
9

ann-presentation

Various gfx for a presentation at NYC ML meetup
Python
57
star
10

uncertainty

Jupyter Notebook
56
star
11

lang-pitch

Python
35
star
12

eclipse-finder

Find eclipses using Python and Modal
Python
17
star
13

hiring-model

Python
17
star
14

conversion

Because you're computing conversion rates wrong
Python
16
star
15

annul

Successor to Annoy https://github.com/spotify/annoy
C++
13
star
16

predictit

Python
10
star
17

rnn-lang-model

Train a deep recurrent neural network LSTM character-level language model using Keras
Python
10
star
18

options

Price options by fitting a Lévy distribution
Python
9
star
19

corporate-fraud

Python
8
star
20

luigi-example-workflow

Python
6
star
21

advent-of-code-2020

Python
5
star
22

modal-kth-demo

Python
4
star
23

advent-of-code-2022

Rust
4
star
24

proj-mgmt-sigma

Python
4
star
25

workflow-managers

List of workflow managers
4
star
26

advent-of-code-2021

Rust
3
star
27

salary-model

Python
3
star
28

coastlines

Python
2
star
29

atf

Another Task Framework
JavaScript
2
star
30

multi-file-modal-2

1
star
31

d3-3d

HTML
1
star
32

flask-railway-test

Python
1
star