• Stars
    star
    3,160
  • Rank 13,589 (Top 0.3 %)
  • Language
    Go
  • License
    MIT License
  • Created about 6 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Compute various size metrics for a Git repository, flagging those that might cause problems

Happy Git repositories are all alike; every unhappy Git repository is unhappy in its own way. โ€”Linus Tolstoy

git-sizer

Is your Git repository bursting at the seams?

git-sizer computes various size metrics for a local Git repository, flagging those that might cause you problems or inconvenience. For example:

  • Is the repository too big overall? Ideally, Git repositories should be under 1 GiB, and (without special handling) they start to get unwieldy over 5 GiB. Big repositories take a long time to clone and repack, and take a lot of disk space. Suggestions:

    • Avoid storing generated files (e.g., compiler output, JAR files) in Git. It would be better to regenerate them when necessary, or store them in a package registry or even a fileserver.

    • Avoid storing large media assets in Git. You might want to look into Git-LFS or git-annex, which allow you to version your media assets in Git while actually storing them outside of your repository.

    • Avoid storing file archives (e.g., ZIP files, tarballs) in Git, especially if compressed. Different versions of such files don't delta well against each other, so Git can't store them efficiently. It would be better to store the individual files in your repository, or store the archive elsewhere.

  • Does the repository have too many references (branches and/or tags)? They all have to be transferred to the client for every fetch, even if your clone is up-to-date. Try to limit them to a few tens of thousands at most. Suggestions:

    • Delete unneeded tags and branches.

    • Avoid pushing your "remote-tracking" branches to a shared repository.

    • Consider using "git notes" rather than tags to attach auxiliary information to commits (for example, CI build results).

    • Perhaps store some of your rarely-needed tags and branches in a separate fork of your repository that is not fetched from by normal developers.

  • Does the repository include too many objects? The more objects, the longer it takes for Git to traverse the repository's history, for example when garbage-collecting. Suggestions:

    • Think about whether you are storing very many tiny files that could easily be collected into a few bigger files.

    • Consider breaking your project up into multiple subprojects.

  • Does the repository include gigantic blobs (files)? Git works best with small- to medium-sized files. It's OK to have a few files in the megabyte range, but they should generally be the exception. Suggestions:

    • Consider using Git-LFS for storing your large files, especially those (e.g., media assets) that don't diff and merge usefully.

    • See also the section "Is the repository too big overall?"

  • Does the repository include many, many versions of large text files, each one slightly changed from the one before? Such files delta very well, so they might not cause your repository to grow alarmingly. But it is expensive for Git to reconstruct the full files and to diff them, which it needs to do internally for many operations. Suggestions:

    • Avoid storing log files and database dumps in Git.

    • Avoid storing giant data files (e.g., enormous XML files) in Git, especially if they are modified frequently. Consider using a database instead.

  • Does the repository include gigantic trees (directories)? Every time a file is modified, Git has to create a new copy of every tree (i.e., every directory in the path) leading to the file. Huge trees make this expensive. Moreover, it is very expensive to traverse through history that contains huge trees, for example for git blame. Suggestions:

    • Avoid creating directories with more than a couple of thousand entries each.

    • If you must store very many files, it is better to shard them into a hierarchy of multiple, smaller directories.

  • Does the repository have the same (or very similar) files repeated over and over again at different paths in a single commit? If so, the repository might have a reasonable overall size, but when you check it out it balloons into an enormous working copy. (Taken to an extreme, this is called a "git bomb"; see below.) Suggestions:

    • Perhaps you can achieve your goals more effectively by using tags and branches or a build-time configuration system.
  • Does the repository include absurdly long path names? That's probably not going to work well with other tools. One or two hundred characters should be enough, even if you're writing Java.

  • Are there other bizarre and questionable things in the repository?

    • Annotated tags pointing at one another in long chains?

    • Octopus merges with dozens of parents?

    • Commits with gigantic log messages?

git-sizer computes many size-related statistics about your repository that can help reveal all of the problems described above. These practices are not wrong per se, but the more that you stretch Git beyond its sweet spot, the less you will be able to enjoy Git's legendary speed and performance. Especially if your Git repository statistics seem out of proportion to your project size, you might be able to make your life easier by adjusting how you use Git.

Getting started

  1. Make sure that you have the Git command-line client installed, version >= 2.6. NOTE: git-sizer invokes git commands to examine the contents of your repository, so it is required that the git command be in your PATH when you run git-sizer.

  2. Install git-sizer. Either:

    a. Install a released version of git-sizer(recommended):

    1. Go to the releases page and download the ZIP file corresponding to your platform.
    2. Unzip the file.
    3. Move the executable file (git-sizer or git-sizer.exe) into your PATH.

    b. Build and install from source. See the instructions in docs/BUILDING.md.

  3. Change to the directory containing a full, non-shallow clone of the Git repository that you'd like to analyze. Then run

    git-sizer [<option>...]
    

    No options are required. You can learn about available options by typing git-sizer -h or by reading on.

Pro tip: If you add git-sizer to your PATH, then you can run it by typing either git-sizer or git sizer. In the latter case, it is found and run for you by Git, and you can add extra Git options between the two words, like git -C /path/to/my/repo sizer. If you don't add git-sizer to your PATH, then of course you need to type its full path and filename to run it; e.g., /path/to/bin/git-sizer. In either case, the git executable must be in your PATH.

Usage

By default, git-sizer outputs its results in tabular format. For example, let's use it to analyze the Linux repository, using the --verbose option so that all statistics are output:

$ git-sizer --verbose
Processing blobs: 1652370
Processing trees: 3396199
Processing commits: 722647
Matching commits to trees: 722647
Processing annotated tags: 534
Processing references: 539
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size      |           |                                |
| * Commits                    |           |                                |
|   * Count                    |   723 k   | *                              |
|   * Total size               |   525 MiB | **                             |
| * Trees                      |           |                                |
|   * Count                    |  3.40 M   | **                             |
|   * Total size               |  9.00 GiB | ****                           |
|   * Total tree entries       |   264 M   | *****                          |
| * Blobs                      |           |                                |
|   * Count                    |  1.65 M   | *                              |
|   * Total size               |  55.8 GiB | *****                          |
| * Annotated tags             |           |                                |
|   * Count                    |   534     |                                |
| * References                 |           |                                |
|   * Count                    |   539     |                                |
|                              |           |                                |
| Biggest objects              |           |                                |
| * Commits                    |           |                                |
|   * Maximum size         [1] |  72.7 KiB | *                              |
|   * Maximum parents      [2] |    66     | ******                         |
| * Trees                      |           |                                |
|   * Maximum entries      [3] |  1.68 k   | *                              |
| * Blobs                      |           |                                |
|   * Maximum size         [4] |  13.5 MiB | *                              |
|                              |           |                                |
| History structure            |           |                                |
| * Maximum history depth      |   136 k   |                                |
| * Maximum tag depth      [5] |     1     |                                |
|                              |           |                                |
| Biggest checkouts            |           |                                |
| * Number of directories  [6] |  4.38 k   | **                             |
| * Maximum path depth     [7] |    13     | *                              |
| * Maximum path length    [8] |   134 B   | *                              |
| * Number of files        [9] |  62.3 k   | *                              |
| * Total size of files    [9] |   747 MiB |                                |
| * Number of symlinks    [10] |    40     |                                |
| * Number of submodules       |     0     |                                |

[1]  91cc53b0c78596a73fa708cceb7313e7168bb146
[2]  2cde51fbd0f310c8a2c5f977e665c0ac3945b46d
[3]  4f86eed5893207aca2c2da86b35b38f2e1ec1fc8 (refs/heads/master:arch/arm/boot/dts)
[4]  a02b6794337286bc12c907c33d5d75537c240bd0 (refs/heads/master:drivers/gpu/drm/amd/include/asic_reg/vega10/NBIO/nbio_6_1_sh_mask.h)
[5]  5dc01c595e6c6ec9ccda4f6f69c131c0dd945f8c (refs/tags/v2.6.11)
[6]  1459754b9d9acc2ffac8525bed6691e15913c6e2 (589b754df3f37ca0a1f96fccde7f91c59266f38a^{tree})
[7]  78a269635e76ed927e17d7883f2d90313570fdbc (dae09011115133666e47c35673c0564b0a702db7^{tree})
[8]  ce5f2e31d3bdc1186041fdfd27a5ac96e728f2c5 (refs/heads/master^{tree})
[9]  532bdadc08402b7a72a4b45a2e02e5c710b7d626 (e9ef1fe312b533592e39cddc1327463c30b0ed8d^{tree})
[10] f29a5ea76884ac37e1197bef1941f62fda3f7b99 (f5308d1b83eba20e69df5e0926ba7257c8dd9074^{tree})

The output is a table showing the thing that was measured, its numerical value, and a rough indication of which values might be a cause for concern. In all cases, only objects that are reachable from references are included (i.e., not unreachable objects, nor objects that are reachable only from the reflogs).

The "Overall repository size" section includes repository-wide statistics about distinct objects, not including repetition. "Total size" is the sum of the sizes of the corresponding objects in their uncompressed form, measured in bytes. The overall uncompressed size of all objects is a good indication of how expensive commands like git gc --aggressive (and git repack [-f|-F] and git pack-objects --no-reuse-delta), git fsck, and git log [-G|-S] will be. The uncompressed size of trees and commits is a good indication of how expensive reachability traversals will be, including clones and fetches and git gc.

The "Biggest objects" section provides information about the biggest single objects of each type, anywhere in the history.

In the "History structure" section, "maximum history depth" is the longest chain of commits in the history, and "maximum tag depth" reports the longest chain of annotated tags that point at other annotated tags.

The "Biggest checkouts" section is about the sizes of commits as checked out into a working copy. "Maximum path depth" is the largest number of path components for files in the working copy, and "maximum path length" is the longest path in terms of bytes. "Total size of files" is the sum of all file sizes in the single biggest commit, including multiplicities if the same file appears multiple times.

The "Value" column displays counts, using units "k" (thousand), "M" (million), "G" (billion) etc., and sizes, using units "B" (bytes), "KiB" (1024 bytes), "MiB" (1024 KiB), etc. Note that if a value overflows its counter (which should only happen for malicious repositories), the corresponding value is displayed as โˆž in tabular form, or truncated to 2ยณยฒ-1 or 2โถโด-1 (depending on the size of the counter) in JSON mode.

The "Level of concern" column uses asterisks to indicate values that seem high compared with "typical" Git repositories. The more asterisks, the more inconvenience this aspect of your repository might be expected to cause. Exclamation points indicate values that are extremely high (i.e., equivalent to more than 30 asterisks).

The footnotes list the SHA-1s of the "biggest" objects referenced in the table, along with a more human-readable <commit>:<path> description of where that object is located in the repository's history. Given the name of a large object, you could, for example, type

git cat-file -p <commit>:<path>

at the command line to view the contents of the object. (Use --names=none if you'd rather omit these footnotes.)

By default, only statistics above a minimal level of concern are reported. Use --verbose (as above) to request that all statistics be output. Use --threshold=<value> to suppress the reporting of statistics below a specified level of concern. (<value> is interpreted as a numerical value corresponding to the number of asterisks.) Use --critical to report only statistics with a critical level of concern (equivalent to --threshold=30).

If you'd like the output in machine-readable format, including exact numbers, use the --json option. You can use --json-version=1 or --json-version=2 to choose between old and new style JSON output.

To get a list of other options, run

git-sizer -h

The Linux repository is large by most standards. As you can see, it is pushing some of Git's limits. And indeed, some Git operations on the Linux repository (e.g., git fsck, git gc) do take a while. But due to its sane structure, none of its dimensions are wildly out of proportion to the size of the code base, so the kernel project is managed successfully using Git.

Here is the non-verbose output for one of the famous "git bomb" repositories:

$ git-sizer
[...]
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Biggest checkouts            |           |                                |
| * Number of directories  [1] |  1.11 G   | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |
| * Maximum path depth     [1] |    11     | *                              |
| * Number of files        [1] |     โˆž     | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |
| * Total size of files    [2] |  83.8 GiB | !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! |

[1]  c1971b07ce6888558e2178a121804774c4201b17 (refs/heads/master^{tree})
[2]  d9513477b01825130c48c4bebed114c4b2d50401 (18ed56cbc5012117e24a603e7c072cf65d36d469^{tree})

This repository is mischievously constructed to have a pathological tree structure, with the same directories repeated over and over again. As a result, even though the entire repository is less than 20 kb in size, when checked out it would explode into over a billion directories containing over ten billion files. (git-sizer prints โˆž for the blob count because the true number has overflowed the 32-bit counter used for that field.)

Contributing

git-sizer is in regular use and is still under active development. If you would like to help out, please see CONTRIBUTING.md.

More Repositories

1

gitignore

A collection of useful .gitignore templates
156,154
star
2

copilot-docs

Documentation for GitHub Copilot
23,177
star
3

docs

The open-source repo for docs.github.com
JavaScript
14,053
star
4

opensource.guide

๐Ÿ“š Community guides for open source creators
HTML
12,947
star
5

gh-ost

GitHub's Online Schema-migration Tool for MySQL
Go
11,302
star
6

linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
Ruby
10,684
star
7

semantic

Parsing, analyzing, and comparing source code across many languages
Haskell
8,827
star
8

copilot.vim

Neovim plugin for GitHub Copilot
Vim Script
7,500
star
9

roadmap

GitHub public roadmap
7,393
star
10

scientist

๐Ÿ”ฌ A Ruby library for carefully refactoring critical paths.
Ruby
7,295
star
11

personal-website

Code that'll help you kickstart a personal website that showcases your work as a software developer.
HTML
7,243
star
12

codeql

CodeQL: the libraries and queries that power security researchers around the world, as well as code scanning in GitHub Advanced Security
CodeQL
7,092
star
13

markup

Determines which markup library to use to render a content file (e.g. README) on GitHub
Ruby
5,678
star
14

dmca

Repository with text of DMCA takedown notices as received. GitHub does not endorse or adopt any assertion contained in the following notices. Users identified in the notices are presumed innocent until proven guilty. Additional information about our DMCA policy can be found at
DIGITAL Command Language
5,312
star
15

swift-style-guide

**Archived** Style guide & coding conventions for Swift projects
4,770
star
16

gemoji

Emoji images and names.
Ruby
4,280
star
17

training-kit

Open source courseware for Git and GitHub
HTML
4,125
star
18

explore

Community-curated topic and collection pages on GitHub
Ruby
3,840
star
19

hubot-scripts

DEPRECATED, see https://github.com/github/hubot-scripts/issues/1113 for details - optional scripts for hubot, opt in via hubot-scripts.json
CoffeeScript
3,538
star
20

mona-sans

Mona Sans, a variable font from GitHub
3,379
star
21

choosealicense.com

A site to provide non-judgmental guidance on choosing a license for your open source project
Ruby
3,379
star
22

secure_headers

Manages application of security headers with many safe defaults
Ruby
3,104
star
23

gov-takedowns

Text of government takedown notices as received. GitHub does not endorse or adopt any assertion contained in the following notices.
3,033
star
24

archive-program

The GitHub Archive Program & Arctic Code Vault
2,997
star
25

scripts-to-rule-them-all

Set of boilerplate scripts describing the normalized script pattern that GitHub uses in its projects.
Shell
2,859
star
26

hotkey

Trigger an action on an element with a keyboard shortcut.
JavaScript
2,851
star
27

relative-time-element

Web component extensions to the standard <time> element.
JavaScript
2,799
star
28

janky

Continuous integration server built on top of Jenkins and Hubot
Ruby
2,757
star
29

github-elements

GitHub's Web Component collection.
JavaScript
2,523
star
30

renaming

Guidance for changing the default branch name for GitHub repositories
2,383
star
31

view_component

A framework for building reusable, testable & encapsulated view components in Ruby on Rails.
Ruby
2,370
star
32

VisualStudio

GitHub Extension for Visual Studio
C#
2,349
star
33

glb-director

GitHub Load Balancer Director and supporting tooling.
C
2,255
star
34

SoftU2F

Software U2F authenticator for macOS
Swift
2,201
star
35

accessibilityjs

Client side accessibility error scanner.
JavaScript
2,180
star
36

balanced-employee-ip-agreement

GitHub's employee intellectual property agreement, open sourced and reusable
2,105
star
37

CodeSearchNet

Datasets, tools, and benchmarks for representation learning of code.
Jupyter Notebook
2,078
star
38

github-services

Legacy GitHub Services Integration
Ruby
1,902
star
39

platform-samples

A public place for all platform sample projects.
Shell
1,851
star
40

pages-gem

A simple Ruby Gem to bootstrap dependencies for setting up and maintaining a local Jekyll environment in sync with GitHub Pages
Ruby
1,782
star
41

hubot-sans

Hubot Sans, a variable font from GitHub
1,754
star
42

india

GitHub resources and information for the developer community in India
Ruby
1,749
star
43

objective-c-style-guide

**Archived** Style guide & coding conventions for Objective-C projects
1,682
star
44

government.github.com

Gather, curate, and feature stories of public servants and civic hackers using GitHub as part of their open government innovations
HTML
1,670
star
45

site-policy

Collaborative development on GitHub's site policies, procedures, and guidelines
1,652
star
46

covid19-dashboard

A site that displays up to date COVID-19 stats, powered by fastpages.
Jupyter Notebook
1,644
star
47

advisory-database

Security vulnerability database inclusive of CVEs and GitHub originated security advisories from the world of open source software.
1,595
star
48

haikus-for-codespaces

EJS
1,550
star
49

lightcrawler

Crawl a website and run it through Google lighthouse
JavaScript
1,471
star
50

feedback

Public feedback discussions for: GitHub for Mobile, GitHub Discussions, GitHub Codespaces, GitHub Sponsors, GitHub Issues and more!
1,359
star
51

developer.github.com

GitHub Developer site
Ruby
1,314
star
52

rest-api-description

An OpenAPI description for GitHub's REST API
1,304
star
53

brubeck

A Statsd-compatible metrics aggregator
C
1,185
star
54

catalyst

Catalyst is a set of patterns and techniques for developing components within a complex application.
TypeScript
1,183
star
55

backup-utils

GitHub Enterprise Backup Utilities
Shell
1,167
star
56

securitylab

Resources related to GitHub Security Lab
C
1,150
star
57

opensourcefriday

๐Ÿšฒ Contribute to the open source community every Friday
HTML
1,143
star
58

graphql-client

A Ruby library for declaring, composing and executing GraphQL queries
Ruby
1,139
star
59

Rebel

Cocoa framework for improving AppKit
Objective-C
1,127
star
60

dev

Press the . key on any repo
1,085
star
61

codeql-action

Actions for running CodeQL analysis
TypeScript
1,047
star
62

gh-actions-importer

GitHub Actions Importer helps you plan and automate the migration of Azure DevOps, Bamboo, Bitbucket, CircleCI, GitLab, Jenkins, and Travis CI pipelines to GitHub Actions.
C#
949
star
63

licensed

A Ruby gem to cache and verify the licenses of dependencies
Ruby
942
star
64

.github

Community health files for the @GitHub organization
795
star
65

swordfish

EXPERIMENTAL password management app. Don't use this.
Ruby
740
star
66

details-dialog-element

A modal dialog that's opened with <details>.
JavaScript
739
star
67

github-ds

A collection of Ruby libraries for working with SQL on top of ActiveRecord's connection
Ruby
667
star
68

vulcanizer

GitHub's ops focused Elasticsearch library
Go
657
star
69

codeql-cli-binaries

Binaries for the CodeQL CLI
657
star
70

email_reply_parser

Small library to parse plain text email content
Ruby
646
star
71

webauthn-json

๐Ÿ” A small WebAuthn API wrapper that translates to/from pure JSON using base64url.
TypeScript
638
star
72

stack-graphs

Rust implementation of stack graphs
Rust
626
star
73

rubocop-github

Code style checking for GitHub's Ruby projects
Ruby
616
star
74

github-ospo

Helping open source program offices get started
599
star
75

dat-science

Replaced by https://github.com/github/scientist
Ruby
582
star
76

maven-plugins

Official GitHub Maven Plugins
Java
581
star
77

details-menu-element

A menu opened with <details>.
JavaScript
554
star
78

trilogy

Trilogy is a client library for MySQL-compatible database servers, designed for performance, flexibility, and ease of embedding.
C
543
star
79

freno

freno: cooperative, highly available throttler service
Go
534
star
80

smimesign

An S/MIME signing utility for use with Git
Go
519
star
81

codespaces-jupyter

Explore machine learning and data science with Codespaces
Jupyter Notebook
518
star
82

gh-valet

Valet helps facilitate the migration of Azure DevOps, CircleCI, GitLab CI, Jenkins, and Travis CI pipelines to GitHub Actions.
C#
513
star
83

include-fragment-element

A client-side includes tag.
JavaScript
508
star
84

safe-settings

JavaScript
505
star
85

covid-19-repo-data

Data archive of identifiable COVID-19 related public projects on GitHub
491
star
86

Archimedes

Geometry functions for Cocoa and Cocoa Touch
Objective-C
466
star
87

codeql-go

The CodeQL extractor and libraries for Go.
462
star
88

vscode-github-actions

GitHub Actions extension for VS Code
TypeScript
443
star
89

vscode-codeql-starter

Starter workspace to use with the CodeQL extension for Visual Studio Code.
CodeQL
441
star
90

open-source-survey

The Open Source Survey
431
star
91

how-engineering-communicates

A community version of the "common API" for how the GitHub Engineering organization communicates
431
star
92

synsanity

netfilter (iptables) target for high performance lockless SYN cookies for SYN flood mitigation
C
424
star
93

brasil

Recursos e informaรงรตes do GitHub para a comunidade de desenvolvedores no Brasil.
Ruby
418
star
94

entitlements-app

The Ruby Gem that Powers Entitlements - GitHub's Identity and Access Management System
Ruby
393
star
95

gh-copilot

Ask for assistance right in your terminal.
383
star
96

roskomnadzor

deprecated archive โ€” moved to https://github.com/github/gov-takedowns/tree/master/Russia
376
star
97

clipboard-copy-element

Copy element text content or input values to the clipboard.
JavaScript
374
star
98

MVG

MVG = Minimum Viable Governance
364
star
99

pycon2011

Python
353
star
100

vscode-codeql

An extension for Visual Studio Code that adds rich language support for CodeQL
TypeScript
349
star