• Stars
    star
    374
  • Rank 110,698 (Top 3 %)
  • Language
  • License
    Creative Commons ...
  • Created about 8 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A curated repository of software engineering repository mining data sets

Awesome Empirical Software Engineering Awesome

A curated repository of data sets and tools that can be used for conducting evidence-based, data-driven research on software systems. This research approach is often termed experimental, or empirical software engineering. Many of the data sets can also be useful in research using search-based software engineering methods. The repository is named after the Mining Software Repositories (MSR) conference series. For examples of such work see the MSR conference's Hall of Fame.

  • This list requires your input for its continuous improvement. Read the contribution guide for instructions on how you can contribute. Alternatively, you can send me an email if you find the process too cumbersome or confusing.
  • For more awesome lists, see awesome.

Contents

Repositories

Data Sets

  • AndroidTimeMachine - Graph-based dataset of commit history of 8,431 real-world Android apps.
  • AndroZoo - Collection of Android Applications.
  • Bug Prediction Dataset - Collection of models and metrics from Eclipse JDT Core, PDE UI, Equinox Framework, Lucene, Mylyn, and their histories.
  • Code Reviews - Code reviews of OpenStack, LibreOffice, AOSP, Qt, Eclipse.
  • CoREBench - Collection of 70 realistically Complex Regression Errors that were systematically extracted from the repositories and bug reports of four open-source software projects: Make, Grep, Findutils, and Coreutils.
  • Cryptocurrency GitHub Activity and Market Cap Dataset - Activity such as commits, stars, prices, and market cap of over 200 cryptocurrency projects on GitHub over time. Raw, historic data is also available.
  • Defects4J - Collection of 395 reproducible bugs collected with the goal of advancing software testing research.
  • Eclipse AERI stacktraces - Collection of stacktraces of Exceptions encountered by users of the Eclipse IDE, as retrieved by the AERI reporting system.
  • Enron Spreadsheets and Emails - All the spreadsheets and emails used in the paper 'Enron's Spreadsheets and Related Emails: A Dataset and Analysis'.
  • Findbugs-maven - Set of FindBugs reports for the Java projects of the Maven repository.
  • GHTorrent - Scalable, queriable, offline mirror of data offered through the GitHub REST API.
  • GitHub Bug Dataset - Bug Dataset of 15 Java open-source projects characterized by static source code metrics.
  • GitHub on Google BigQuery - GitHub data accessible through Google's BigQuery platform.
  • Grammar Zoo - Collection of grammars of DSLs and GPLs, some extracted from metamodels and document schemata.
  • KaVE - Developer tool interaction data.
  • Linux Kernel 4.21 Call Graphs - The Linux Kernel 4.21 Call Graphs produced using CScout.
  • Maven metrics - Collection of software complexity & sizing metrics for the Maven Repository.
  • Maven Dependency Graph - Snapshot of the whole Maven Central taken on September 6, 2018, stored in a graph database.
  • mzdata - Multi-extract and multi-level dataset of Mozilla issue tracking history.
  • npm-miner - The dataset contains the analysis results of 5 open source software quality tools eslint, escomplex, nsp, jsinspect and sonarjs for 2000 popular (in terms of stars and downloads) npm packages.
  • OCL Expressions on GitHub - Data set of 9188 OCL expressions originating from 504 EMF meta-models in 245 systematically selected GitHub repositories.
  • RepoReapers Data Set - Data set containing a collection of engineered software projects from GHTorrent.
  • Software Heritage Graph Dataset - Graph of the development history and file metadata of >80 million software projects from various forges (GitHub, Gitlab, Debian, PyPI, Google Code, etc) in a deduplicated and unified representation (paper here).
  • STAMINA - (STAte Machine INference Approaches) data are used to benchmark techniques for learning deterministic finite state machines (FSMs).
  • Stack Exchange - Anonymized dump of all user-contributed content on the Stack Exchange network.
  • TravisTorrent - Provides free and easy-to-use Traivs CI build analyses.
  • Ultimate Debian Database (UDD) - Data about various aspects of Debian (e.g. packages, bugs, mainteners) in the same SQL database.
  • Unified Bug Dataset - Static source code based datasets which includes the Bugcatchers Bug Dataset, the Bug Prediction Dataset, the Eclipse Bug Dataset, the GitHub Bug Dataset, some datasets from the PROMISE repository.
  • Unix history - Git repository with 46 years of Unix history evolution.

Tools

  • astminer - Library and tool for mining of path-based representations of code and other data derived from ASTs.
  • Boa - Domain-specific language and infrastructure that eases mining software repositories.
  • buckwheat - Multi-language tokenizer for extracting identifiers from source code.
  • ckjm - Chidamber and Kemerer Java Metrics.
  • Coming - A Java framework for analyzing code changes and mining instances of change patterns from Git repositories.
  • CryptOSS - Mine GitHub activity and market cap data for cryptocurrency projects.
  • DbDeo - Extract embedded SQL statements and detect database schema smells.
  • Designite - Compute source code metrics and detect a variety of implementation, design, and architecture smells for C#.
  • DesigniteJava - Compute source code metrics and detect a variety of implementation and design smells for Java.
  • Diggit - Agile Ruby Tool to analyze Git repositories.
  • GrimoireLab - Free/Libre/Open Source tools for Software Development Analytics.
  • MetricMiner - Lean Java DSL to mine and extract data (e.g. commits, developers, modifications, diffs) from Git and SVN repositories.
  • Maven-miner - Java tools and infrastructure to resolve the whole Maven dependency graph, hosted in Maven Central, in the form of a Neo4j Graph.
  • Perceval - Fetch repository data from tens of back-ends.
  • Puppeteer - Detect configuration smells in Puppet code.
  • PyDriller - Python Framework to analyse Git repositories.
  • qmcalc - Calculate quality metrics from C source code.
  • reaper - Python tool to compute a score for a repository from GHTorrent. The score quantifies the extent to which the project contained within the repository is engineered.
  • RefactoringMiner - Library/API for detection of refactorings in changes of Java code.
  • VulData7 - Java framework enabling the automated collection of commits fixing vulnerabilities that are reported in NVD (links NVD with Git).

Research Outlets

License

CC0

To the extent possible under law, Diomidis Spinellis has waived all copyright and related or neighboring rights to this work.

More Repositories

1

unix-history-repo

Continuous Unix commit history from 1970 until today
Assembly
6,318
star
2

latex-advice

Advice for writing LaTeX documents
TeX
1,050
star
3

git-issue

Git-based decentralized issue management
Shell
711
star
4

UMLGraph

Declarative specification and drawing of UML diagrams
Java
338
star
5

unix-history-make

Code and data to create a git repository representing the Unix source code history
Roff
317
star
6

dgsh

Shell supporting pipelines to and from multiple processes
C
316
star
7

pmonitor

Progress monitor: monitor a job's progress
Shell
185
star
8

cscout

C code refactoring browser
C
176
star
9

unix-v4man

Typeset the Fourth Research Edition Unix Programmer's Manual
Roff
137
star
10

ai-cli-lib

Add AI capabilities to any readline-enabled command-line program
C
112
star
11

ckjm

Chidamber and Kemerer Java Metrics
HTML
84
star
12

unix-architecture

Unix architecture evolution diagrams
Python
80
star
13

alexandria3k

Local relational access to openly-available publication data sets
Python
72
star
14

tokenizer

Convert source code into numerical tokens
C++
63
star
15

cqmetrics

C Quality Metrics
C++
56
star
16

effective-debugging

Code examples used in the book Effective Debugging (Addison-Wesley, 2016)
Java
42
star
17

bib2xhtml

Convert BibTeX references into XHTML
HTML
34
star
18

speak

Reviving the Research Edition Unix speak command
Rust
34
star
19

awesome-rest-apis

Currated collaborative list of open RESTful API web services
32
star
20

simple-rolap

Simple relational online analytical processing
Shell
27
star
21

unix-history-man

Manual page availability across major Unix releases
Perl
25
star
22

greek-vat-data

Retrieve the registration data associated with a Greek VAT number
Java
25
star
23

rdbunit

Unit testing for relational database queries
Python
23
star
24

unix-v3man

Typeset the Third Research Edition Unix Programmer's Manual
Roff
22
star
25

outwit

Command-line tools for accessing the Windows clipboard, registry, databases, document properties, and links.
C
20
star
26

lego-power-scratch

Control Lego power functions from Scratch
Python
17
star
27

oral-history-of-unix

Work by the late Michael Sean Mahoney, Professor of the History of Science at Princeton University, to create a history of Unix
HTML
16
star
28

kbd-layout-fix

Auto-correct text entered with the wrong keyboard layout
AutoHotkey
13
star
29

holiday-card

Simple Java AWT application to draw a Christmas card
Java
12
star
30

linux-history-repo

Reconstruction of the Linux kernel history with correct dates; see https://github.com/dspinellis/linux-history-make
C
11
star
31

socketpipe

Super efficient TCP connection between remote processes
C
11
star
32

manview

Unix man pages online viewer
CSS
10
star
33

OpenMIC

Open source implementation of the maximal information coefficient measure
C++
10
star
34

git-subst

Git plugin for substituting a regular expression with some text across all files under revision control
Shell
10
star
35

dostrace

A tool for logging MS-DOS system calls
C
9
star
36

word-master-ancient-greek

Ancient Greek version of the Wordly look-alike Word Master
JavaScript
9
star
37

greek-classifier

Classify surnames as Greek
Emacs Lisp
8
star
38

fileprune

Prune a file set according to a given age distribution
Roff
8
star
39

Kerberos

DSL-Configurable burglar alarm system for the Raspberry Pi
C
8
star
40

alt-truth

Alternative version of truth
C
7
star
41

linux-history-make

Reconstruct the Linux kernel history with correct dates
Shell
7
star
42

inaugural-analysis

Analysis of US inaugural presidential addresses
Python
7
star
43

Secrets-for-Java-SE

Decode Secrets for Android files on a Java SE platform
Java
6
star
44

cas2svg

Visualize Graphic 2 terminal .cas character descriptions
Perl
6
star
45

dgcmodem

Code fixes for the linuxant dgc modem drivers for 3.x kernels.
Shell
5
star
46

bibtools

Extract BibTeX records to standalone file for sharing with others
Perl
5
star
47

montty

Monitor input coming on a serial port
C
4
star
48

phd-reading-list

A reading list for research students (and their supervisors)
4
star
49

swill

Embedded web server interface library by S. Lampoudi and D. Beazley
C
4
star
50

PPS-monitor

Monitor a point-to-point (PPS) heating automation network link
Python
4
star
51

win32-bitmap-print

Demonstration of Win32 bitmap printing issue
C++
3
star
52

top-trumps-cards

"Top Trumps" cards for chemical elements
Perl
3
star
53

fast-libc

Improve C library performance (currently qsort) through multi-threading
C
3
star
54

mpcd

mpcd: Modular Performant Clone Detector
C++
3
star
55

madplay-playlist

Fork of MAD with a few extra features (see the commits)
C
3
star
56

athens-visitor-info

Information for Athens visitors
2
star
57

taru

Process and display space usage in tar files
Python
2
star
58

grconv

Greek character set converter
C++
2
star
59

git-mine-briefing

Presentation and handouts for MSR briefing on Git mining
HTML
2
star
60

rat-name

Rational C++ Naming Conventions
2
star
61

leap-sec

Leap second testing and visualization
C
2
star
62

gi-example

2
star
63

jit-binary

On demand compile and run programs distributed in source code form
Makefile
2
star
64

code-lifetime

Tools for analyzing the lifetime of code lines and tokens
Perl
2
star
65

sandbox

1
star
66

pcsecrets

Desktop client for the Secrets for Android password manager app
Java
1
star
67

bioinformatics

Adventures on Bioinformatics
1
star
68

CAGR

Compound Annual Growth Rate for Software
Perl
1
star
69

BlogRoll

BlogRoll of Software Data Analytics Blog and Mining Software Repositories
1
star
70

ax-178-logger

AXIO MET AX-178 multimeter logger
Python
1
star
71

elf-notes

Verify operation of ELF note section on Travis
C
1
star
72

scratch-joystick

Code that allows Scratch to read a joystick's values
Python
1
star
73

gi-issues

Issue management repository for gi
1
star
74

Favourite-movies

1
star