• Stars
    star
    1,335
  • Rank 35,213 (Top 0.7 %)
  • Language
    Rust
  • License
    MIT License
  • Created almost 6 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

git, implemented in rust, for fun and education πŸ¦€

git-rs

Implementing git in rust for fun and education!

If you're looking for a native Rust Git implementation ready for use in anger, you might consider looking at gitoxide instead!

This is actually my second stab at it, so big blocks will land in place from my first attempt. I'm trying again this year after reading more of "Programming Rust" (Blandy, Orendorff).

TODO

  • Read objects from loose store
  • Read objects from pack store
    • Read packfile indexes
    • Read delta'd objects
    • Fix interface so we don't need to run open for each read()
    • BUG: certain OFS deltas are misapplied.
      • Isolate the error case
      • Fix it
  • Load refs off of disk
  • Parse git signatures ("Identity"'s)
  • Create iterator for walking commit graph
  • Create iterator for walking trees
    • Materialize trees to disk (post gitindex?)
  • Create index from packfile
    • Rename Storage trait to Queryable
    • Rework object loading API from <Type + Boxed reader> to "we take a writable object"
      • Carry the rework out through StorageSet
    • Create the index
    • Wrap it in a nice API
  • refs v2
    • Load refs on demand
    • Load packed-refs
  • .git/index support
    • Read git index cache
    • Write git index cache
  • Create interface for writing new objects
  • Add benchmarks
  • Code coverage
  • Create packfile from list of objects (API TKTK)
  • Network protocol
    • receive-pack
    • send-pack
  • Try publishing to crates
    • Write documentation
    • Use crate in another project

PLAN

2022-04-30 Update

  • I did a bit of optimizing and my (completely unscientific) benchmarking has us pretty close to native git log!
  • I'm currently measuring the performance of git log --pretty=oneline >/dev/null vs. git_rs_log >/dev/null against a local checkout of nodejs/node.
    • This is on a M1 Pro Max Macbook Pro.
    • git_rs_log started out at ~500ms for a complete walk of the repo history. Vanilla git was seeing ~280-300ms.
  • I'm used to using DTrace + flamegraphs to profile, but to my dismay using DTrace requires booting into recovery mode on modern macOS & disabling system integrity protection.
    • at least, that was what my first investigation turned up. It looks like there may be other options I missed.
  • My coworker Eric suggested using macOS's Instruments.app instead via cargo-instruments which worked a treat.
  • Running cargo instruments necessitated exposing a way to set the current working directory for git_rs_log, so I added clap.
  • I'm pleased to report we were able to bring git_rs_log down to 300-320ms for a full walk of node's history. Here's what I did:
    • Switching deflate2 from it's miniz_oxide backend (the default, written in rust) to native zlib was the biggest boost
    • Switching a sort_by_key to sort_unstable_by_key in packfile index reads was a small (5-7ms) win. (Packfile indices have a fanout table, a list of ids in ascending order by id value, and a list of offsets-in-the-packfile whose order corresponds to the ids. In order to use a packfile index to read from a packfile, though, you need to be able to read the offset for your incoming id request and the next id in the packfile in offset order. Hence the sort_by_key call – once we've loaded the ids and offsets, we have to keep a mapping of position of id -> position in packfile.)
    • Tuning up the commit parser gave me another 10ms or so. Out of expediency, I had originally treated commits as HTTP transaction-like – newline-separated key/value headers followed by a double newline then the message. Now I actually store the well-known fields directly on the struct in parsed form. (There's still room for improvement here, too!)
      • This required adding an Id::from_ascii_bytes(&[u8]) for hexadecimal-encoded ids
        • Before this you'd have to bounce through std::str::from_utf8 which validates that the vector contains valid utf8 before we validate that it only contains hexadecimal chars; now we can do both in one step.
  • I'm pretty happy with that performance (for now), so I'm looking for something to pick up next. Options include:
    • Support for the worktree index file, .git/index. This is the start of the path for writing objects to the Git database.
    • Better support for refs.
    • Support for SHA256 object format. (Vanilla Git supports this now, so it'd be interesting to dig into how it works.)
    • Another Rust project, for a change of pace.
      • Postgres change data capture support, a la Debezium (but not tied to kafka.)
      • A return to WASM text parsing (I have a private project called "watto" for this.)

2022-04-27 Update

  • I'm back! I finally have some free time (and, maybe more importantly, available attention) so I'm revisiting this project after a few years.
  • Most recently, I fixed a bug with the "identity" parser. "Identities" are the bits of commit and tag metadata that look roughly like Humanname <email> 10100100100 +7300.
    • It turns out my parsing had a bug: it was dropping the last character of the email.
    • This was surfaced to me -- not by tests as it should have been, to my embarrassment -- but by trying to run git_rs_log against the Node repository. It turns out someone had committed without an email: Foobar <> 1010020203 -4000.
    • My off-by-one error turned this into a panic, with the program safely -- if unexpectedly -- crashing on that input.
    • I fixed the parser bug and golfed the parser itself down using match statements and constants.
  • While I was in that part of the code, I did a little editorializing: renaming identity to human_metadata.
  • I also took the oppportunity to lazily parse the human metadata. There's no need to walk that entire bytestring unless someone asks for it.
    • It turns out that we do ask for it during the course of git log: if we have multiple branches we need to load up the commit metadata to compare timestamps, as the output order depends on commit timestamp.
    • But for straightline chains of commits we don't need to load any of that up.
      • This saves ~10-20 milliseconds on git log in the node repository.
  • Burying the lede: git_rs_log is about 100-200ms slower than git log --pretty=oneline, run against the node repository.
    • Well, that certainly seems like a useful north star, does it not?
    • Where are we spending time that git isn't?
  • Buoyed by my recent deep dive into the LLVM ecosystem, I briefly explored profile-guided optimization.
    • I'm happy to report that I got a working setup and understood the results.
    • I'm less happy to report that, well, the results weren't stunning. This kind of checks out: if the performance gap is down to the number of I/O system calls we're making, assuming git makes fewer system calls that's where our perf gap will be.
    • So that's my current number one goal.
  • My number two goal is to modernize this repo and bring it up to the Rust standards that I picked up from $dayjob (and in particular, via @fishrock123.)
    • That means:
      • implementing more standard traits on types,
      • adding integration tests,
      • adding type docs,
      • and being a little bit more circumspect about what the crate exposes as a public API.

2019-02-08 Update

  • It's been a minute!
  • As you might have seen, figuring out packfile indexing has forced a lot of changes on the repo.
    • There's now a src/pack/read.rs file that holds generic read implementations for any BufRead + Seek.
    • The signature of the Storage trait changed -- instead of returning a boxed read object, it now accepts a Write destination.
    • Further, Storage is now Queryable (a better name!).
      • Because we moved from returning a Box to accepting generic Write, we could no longer box Queryables.
        • I didn't know this about Rust, so TIL!
      • StorageSet objects had to be rethought as a result -- they could no longer contain Box'd Storage objects.
        • Instead, we put the compiler to work -- because storage sets are known at compile time, I implemented Queryable for the unit type, (), two types (S, T), and arrays of single types Vec<T>.
        • This means that a StorageSet may hold a single, top-level Queryable, which might contain nested heterogenous Queryable definitions.
          • It gives me warm, fuzzy feelings πŸ’ž
  • You might also note that we're not actually done indexing packfiles. 😱
    • Here's the sitch: in order to create a packfile index, you have to run a CRC32 over the compressed bytes in the packfile.
    • The ZlibDecoder will pull more bytes from the underlying stream than it needs, so you can't take the route of handing it a CrcReader and get good results.
    • It's got to be a multi-pass deal.
      • The current plan is: run one pass to get offsets, un-delta'd shas and types.
      • Run a second pass to resolve CRCs and decompress deltas. This can be done in parallel.

2019-01-23 Update

  • It's time to start indexing packfiles.
    • This'll let us start talking to external servers and cloning things!
  • However, it's kind of a pain.
    • Packfiles (viewed as a store) aren't hugely useful until you have an index, so I had designed them as an object that takes an optional index from outside.
      • My thinking was that if an index was not given, we would build one in-memory.
    • That just blew up in my face, a little bit. πŸ’₯
    • In order to build an index from a packfile you have to iterate over all of the objects.
      • For each object, you want to record the offset and the SHA1 id of the object at that offset.
      • However, the object might be an offset or a reference delta.
        • That means that in order to index a packfile, you've got to be able to read delta'd objects at offsets within the packfile (implying you already have the Packfile instance created) and outside of the packfile ( implying you have a StorageSet.)
        • In other words: my assumptions about the program design are wrong.
    • So, in the next day or so I'll be reversing course.
      • It should be possible to produce a Packfile as a non-store object and iterate over it.
      • The "store" form of a packfile should be the combination of a Packfile and an Index (a PackfileStore.)
        • This means I'll be splitting the logic of src/stores/mmap_pack into "sequential packfile reads" and "random access packfile reads (with an index.)"
  • It's fun to be wrong πŸŽ‰

2019-01-21 Update

  • Well, that was a fun bug. Let's walk through it, shall we?
    • This occasionally showed up when a delta would decode another delta'd object.
      • I found a hash that would reliably fail to load.
      • We'd fail the read because the incoming base object would not match the 2nd delta's "base size". Here.
      • Removing the check to see if I got the deltas wrong would cause the thread to panic -- the delta's base size wasn't a lie.
    • First, I switched back to my old mmap-less packfile implementation, because I recently touched that code. "Revert the thing you touched last" is a winning strategy in these cases: doesn't cost expensive thinking, quickly puts bounds around the bug.
      • Alas, the old packfile implementation also had this bug. No dice.
    • I compared the output of this git implementation to my JS implementation.
      • I confirmed that the output of the JS implementation worked by comparing its output for the hash of concern to vanilla git.
      • After confirming that, I logged out the offsets being read from the file and the expected sizes. I compared this to similar debugging output I added to git-rs.
        • The offsets are the bound values sent into the packfile. For the outermost read ("Give me the object represented by eff4c6de1bd15cb0d28581e4da17035dbf9e8596"), the offsets come from the packfile index.
        • For OFS_DELTA ("offset delta") types, the start offset is obtained by reading a varint from the packfile and subtracting it from the current start offset.
      • The offsets and expected sizes matched!
        • This meant that:
          1. I was reading the correct data from the packfile
          2. I was reading varints correctly
          3. The bug must be in the application of the delta to a base object
      • From there I added logging above these state transitions, noting the particulars of the operation.
        • I added the same logging to the JS implementation, and found that (aside from totally bailing out ahead of the 2nd delta application) the commands were the same.
        • So it wasn't even that my delta code was wrong: it was my Read state machine.
    • At this point, I was like: "This is a read state machine bug. I know this."
      • So, one of the things this state machine does is carefully bail out if it can't write all of the bytes for a command. ("If there remains an extent to write, record the next state and break the loop.")
      • However, at this point we've already consumed the last command. There are no more instructions.
      • So if this function were to be called again, ...
    • The fix turned out to be simple, as these fixes usually are.
      • (I need to write a test for this, I know. I know. Pile your shame on me.)
  • So what did we learn?
    • Always test your state machines, folks.
    • (I've said it once, and I'm saying it again.) Malleable reference implementations will save your bacon.
      • Make sure you can trust your reference implementation.
  • Anyway. The tree reader works now! 🌲🌳🌲

2019-01-19 Update

  • It's slightly faster! πŸŽ‰
    • mmap sped things along nicely, shaving 20ms off of our runtime.
    • We're still reliably slower than git, though. It might be because we load the refset immediately.
    • I kept the immutable "file per read" packfile store around; I think it may come in handy in the future.
    • It would be excellent to capture this as a benchmark instead of running it ad-hoc.
  • I integrated the tree walking iterator and got a nice surprise:
    • There's a bug in my OFS delta code!
    • This is interesting, because it only appears for certain blobs in certain repositories.
      • Otherwise other OFS deltas seem to resolve cleanly.
      • Case in point: many of the commits I load as a test of the commit walk-er are OFS-delta'd.
    • Also of note: I've split from src/bin.rs into dedicated binaries for tree walking and commit walking.
  • Today's theme: isolate the bug in a test case.
    • EOD Update: It's really helpful to have a reference implementation.
    • I've confirmed that the reference implementation can read the object that breaks this project.
      • We are reading the same offsets, as well (phew)
    • I've further confirmed that swapping out the packfile implementation for the older, slower packfile doesn't affect anything.
    • I suspect this means there's either a problem in my delta code (highly possible!), my varint decoding code (very possible), or the Read implementation for Deltas. Yay, narrowed down results!

2019-01-15 Update

  • I added an (experimental) git_rs::walk::tree iterator to take a Tree and yield a path + a blob for each item.
    • It's probably slower than it should be: for each item it has to clone a PathBuf, because I couldn't work out the lifetimes.
    • If you know how to fix that, please open an issue and let me know πŸ’ž
  • I took some time to clean up the warnings during builds.
    • Oh! I also installed Clippy which warns about higher level antipatterns in Rust!
  • I'm still noodling over the 2-3x slowdown between vanilla git and Our Git.
    • I think I might create two packfile interfaces -- one "generic" and one "mmap"'d, to see if one or the other makes up the difference in performance.
      • This also has the virtue of being unsafe code, which is something I have not yet used in Rust!

2019-01-06 Update

  • I wrote an iterator for commits! The first cut kept a Vec of (Id, Commit) around, so we could always pick the most recent "next" commit out of the graph (since commits may have many parents.)
    • But in finishing up the collections section of "Programming Rust" I noticed that BinaryHeap was available, which keeps entries in sorted order. You don't often get to choose the underlying storage mechanism of your collections in JS, so this hadn't occurred to me!
    • Anyway. I swapped out the Vec for a BinaryHeap in this commit. Because this pushes the ordering into an Ord impl for a type, this opens up the possibility of using the one iterator definition for multiple different orderings. Neat!
  • Testing against a couple long-lived repo, the results coming out of git_rs are exactly the same as git!
    • However, it takes about twice the time: 60ms for git_rs where git takes 30ms.
    • I think I have a lead on this, and it has to do with packfile stores: each read from a packfile opens a new File instance.
  • I've added a TODO section to keep track of what comes next!

2019-01-02 Update

  • I implemented ref loading. It was a bit of a pain! Translating to and from Path types took a bit of doing.
  • I've been trying to read up on Rust idioms -- I found a couple of resources:
  • As a result, I've implemented FromStr for Id, (hopefully) giving it a more idiomatic API -- let id: Id = str.parse()?

2018-12-27 Update

  • Rust is feeling more natural. This chain felt natural to write. I was even able to cross-index a list with only a minimum of fighting the borrow checker.
  • I split the objects interface into Type + boxed read with a method for reifying the data into an Object. This feels good! It lets folks check to see, for example, if they're referring to a Blob without having to load the entire Blob into memory.
  • The closure interface for the loose interface works pretty well, but pack interfaces need to be able to ask the overarching set of stores for a reference due to REF_DELTA objects. This is a bummer, because it really quickly turns into "fighting the borrow checker." Right now I think the way forward is to build a StorageSet that holds a Vec of heterogenous Box<Storage> objects, where Storage is a new trait that specifies get(&Id, &StorageSet).
    • A sidenote re: the loose store: it feels kind of odd to have to produce a git_rs::Error instead of a std::io::Error. Room for improvement!
  • Oh! It was pretty easy to add a binary to this lib crate. And now we can git log other repos!

2018-12-21 Update

  • Decided to focus on moving a bit slower and making sure I have tests for primitives this time around.
  • Moved away from my original Box<Write> trait object design for object instance reading & storage format in favor of generics.

More Repositories

1

beefy

local development server that aims to make using browserify fast and fun
JavaScript
801
star
2

raf

requestAnimationFrame polyfill library
JavaScript
740
star
3

plate

a javascript template library, aimed at being compatible with django's template language.
JavaScript
183
star
4

ormnomnom

an orm that does 80% of the work and gets out of the way for the remaining 20%
JavaScript
169
star
5

cssauron

create matching selectors from css for your very own nested object hierarchy
JavaScript
113
star
6

bops

buffer operations
JavaScript
100
star
7

wilson

a node.js framework that glues together routes, views, orms, and things. designed to help tim allen navigate his day.
JavaScript
95
star
8

varint

use msb to create integer values of varying sizes
JavaScript
90
star
9

vkey

map ev.keyCode to human names
JavaScript
81
star
10

nojs

☁️☁️ nothing to see, a PLAN, and the ability to load lodash ☁️☁️
Python
73
star
11

node-python

a binding between node.js (really, the V8 engine) and python. super beta super buggy super great
C++
67
star
12

sse-stream

expose html5 server sent events (sse) as a writable stream
JavaScript
57
star
13

clawbang

#!/usr/bin/env πŸ¦€
Rust
52
star
14

glslmin

CLI tool to minify and bundle glsl programs
C
37
star
15

fullscreen

fullscreen polyfill api that presents an event emitter
JavaScript
36
star
16

reverse

πŸ’ž An HTTP router that provides reversing capabilities.
JavaScript
33
star
17

inflate

pure javascript inflate implemented as a through stream
JavaScript
33
star
18

provenance

visualize dependencies between functions in your JS
JavaScript
32
star
19

spatial-events

3d spatially aware event emitter
JavaScript
30
star
20

digraph-tag

ES6 string template tag for quickly generating directed graph data
JavaScript
30
star
21

browservefy

simplehttpserver / webrick replacement that automatically responds to a given entry point path with the results of browserify <that path>
JavaScript
28
star
22

python-javascript

A pure python implementation of JavaScript, just because.
Python
25
star
23

tide-http-auth

HTTP auth for tide! Pretty Basic, if you'll Bearer with me
Rust
25
star
24

jik

use css selectors to grep your JS codebase
JavaScript
24
star
25

node-runforcover

A require hook for node-bunker to emit coverage data.
JavaScript
24
star
26

rewrite-js

CLI tool to transform javascript programs using falafel
JavaScript
24
star
27

aabb-2d

2d axis aligned bounding boxes
JavaScript
20
star
28

setup-yq

JavaScript
20
star
29

aabb-3d

3d axis aligned bounding boxes
JavaScript
20
star
30

tracejs

Expand Node.js Error.stack traces into usable objects providing context and highlighting
JavaScript
20
star
31

kb-controls

present a polling interface for keyboard state given a binding object
JavaScript
19
star
32

asm-tag

compile x64 assembly into a callable function
JavaScript
19
star
33

escontrol

JavaScript
18
star
34

add-event-listener

add event listeners in IE and ... everywhere else
JavaScript
17
star
35

voxel-physical

create objects that have aabbs and respond to accel and vel updates
JavaScript
16
star
36

drag-stream

streamable mouse drag data
JavaScript
16
star
37

jabbascript

lightweight syntax additions to javascript, from a curmudgeon who loves javascript just the way it is
JavaScript
16
star
38

node-piano

A node.js require-hook for instrumenting your javascript code
JavaScript
15
star
39

mdn-cli

a CLI for accessing mozilla dev network docs in your terminal
Rust
14
star
40

cssauron-falafel

falafel bindings for cssauron
JavaScript
14
star
41

platoon

a javascript testing framework whose goals are to work gracefully with callbacks, both in node.js and the browser
JavaScript
14
star
42

essim

JavaScript
14
star
43

json-parse-stream

streaming json parser
JavaScript
14
star
44

stream-exhaust

Ensure that a stream is flowing data without mutating it
JavaScript
13
star
45

domnode-dom

DOMNode streams for HTMLElements
JavaScript
13
star
46

redispump

pipe stdout to a redis pubsub channel
JavaScript
13
star
47

jsl

a modular js linter
JavaScript
12
star
48

interact

a readable stream of mouse view events, wrapping up pointer-lock and drag-stream
JavaScript
12
star
49

simplee

A simple EventEmitter utility library for client-side use. Mimics Node.js's EventEmitter object.
JavaScript
12
star
50

yrc

you're real cool - a javascript parser written in c
C
12
star
51

tempisfugit

a node.js async binding to git. holy crap.
JavaScript
12
star
52

django-butter

Butter forms for Django (wokka-wokka-wokka)
Python
12
star
53

djamocha

Django and Mocha unit tests (with a CLI test runner)
JavaScript
11
star
54

npm-get-dependents

get all dependents of an npm package
JavaScript
11
star
55

tinybabygame

a little baby game i wrote in javascript using canvas.
JavaScript
10
star
56

voxel-control

manipulate voxel-physical objects using FPS-style controls
JavaScript
10
star
57

utensil

CLI tool to fork, run, and monitor node.js servers
JavaScript
10
star
58

worker.js

A semi-polyfill api for web workers. Aimed at making the experience fun (and not at all painful!)
JavaScript
9
star
59

git-to-js

translate git raw objects into javascript objects
JavaScript
9
star
60

narrativ

kind of a rip-off of docco, looking to make generating complex trees of literate-programming documentation easier.
JavaScript
9
star
61

directed-graph-to-dag

given a directed graph, return a set of edges to reverse to remove any cycles
JavaScript
9
star
62

waiter

a domain specific language/api for RESTful interfaces, inspired by Dolt
Python
9
star
63

normalize-css

normalize.css (from http://necolas.github.com/normalize.css/)
CSS
9
star
64

simplify-dag

given a directed acyclic graph, contract straight-line runs to single vertices
JavaScript
8
star
65

w3c-blob

w3c dom blob api in node and browser
JavaScript
8
star
66

phone-sensor

turn your phone into a sensor for great justice.
JavaScript
8
star
67

pointer-lock

pointer lock polyfill that presents an eventemitter / stream api
JavaScript
8
star
68

gsv

github search vehicle
JavaScript
8
star
69

postpie

a backend and transport for pieshop that uses ry's node_postgres binding!
JavaScript
8
star
70

tag_utils

a utility library that aims to alleviate some of the pain of writing django template tags using surlex.
Python
8
star
71

a-wild-version-appears

sometimes versions happen and you want to alert your users
JavaScript
8
star
72

django-bluebird

A Twitter OAuth multi-site authentication backend for Django, and so much more
Python
8
star
73

git-transport-protocol

a r/w stream that wraps a r/w stream and formats the data according to the git transfer protocol
JavaScript
7
star
74

nappingcat

a single-user ssh framework inspired by django
Python
7
star
75

git-the-latest

JavaScript
7
star
76

zigzag

zigzag signed integer encoding and decoding
JavaScript
7
star
77

reverse.js

reverse match your urls in clientside javascript (and node).
JavaScript
7
star
78

dom-event-stream

create a readable stream of dom events given an element
JavaScript
7
star
79

hq

aw heq – a jq-like CLI tool for modifying HTML trees based on CSS selectors
Rust
7
star
80

ls-stream

readable stream of file paths + stat objects
JavaScript
7
star
81

bfy-worker

web workers for browserify made fast, fun, and easy
JavaScript
7
star
82

keyframely

A JQuery-less rewrite of https://github.com/KuraFire/runloop -- usable in Node or in-browser.
JavaScript
7
star
83

git-fetch-pack

git's smart fetch-pack protocol
JavaScript
7
star
84

gl-triangle-strip-indexer

create element indices for triangle strip meshes
JavaScript
6
star
85

texture.js

browserify-compatible webgl 2D texture lib.
JavaScript
6
star
86

dst.js

Date.prototype.isDST() // true or false
JavaScript
6
star
87

tz.js

timezone sniffer in javascript.
JavaScript
6
star
88

git-packidx-parser

git pack index file parser
JavaScript
6
star
89

tar-parse

streaming tar parser
JavaScript
6
star
90

fpsjs

An in-browser multiplayer FPS using WebGL, pointer lock, WebSockets, and WebWorkers (oh my!)
JavaScript
6
star
91

scoped

command line tool exposing lexical-scope
JavaScript
6
star
92

dag-to-layers

assign each vertex of a dag to a layer, inserting dummy vertices
JavaScript
6
star
93

wilson-example

an example project using the wilson node.js framework
JavaScript
6
star
94

clone-packages

clone packages from one registry to another
JavaScript
6
star
95

pacman.js

A bitstream packer targeting canvas elements.
JavaScript
6
star
96

count-docula

An mdast-based tool for generating and testing documentation
JavaScript
6
star
97

cascadiajs-2013

my little parser talk abstract, transcript, demos for cascadiaJS 2013
JavaScript
6
star
98

spatial-trigger

enter/exit events for bounding boxes based on events from spatial-events
JavaScript
5
star
99

discard-stream

a transform stream that buffers up to N items, discarding on further writes
JavaScript
5
star
100

flows

a prototype streams implementation
JavaScript
5
star