hop
Simple archive format designed for quickly reading some files without extracting the entire archive. Possibly will be used in Bun.
25x faster than unzip
and 10x faster than tar
at reading individual files (uncompressed)
Format | Random access | Fast extraction | Fast archiving | Compression | Encryption | Append |
---|---|---|---|---|---|---|
hop | β | |||||
tar | ||||||
zip | β |
Features:
- Faster at printing individual files than
tar
&zip
(compression disabled) - Faster extraction than
zip
, comparable totar
(compression disabled) - Faster archiving than
zip
, comparable totar
(compression disabled)
Anti-features:
- Single-threaded (but doesn't need to be)
- I wrote it in about 3 hours and there are no tests
- No checksums yet. Probably not a good idea to use this for untrusted data until that's fixed.
- Ignores symlinks
- Can't be larger than 4 GB
- Archives are read-only and file names are not normalized across platforms
Usage
Download the binary from /releases
To create an archive:
hop ./path-to-folder
To extract an archive:
hop archive.hop
To print one file from the archive:
hop archive.hop package.json
Why?
Why can't software read many tiny files with similar performance characteristics as individual files?
- Reading and writing lots of tiny files incurs significant syscall overhead, and (npm) packages often have lots of tiny files. Zip files are unacceptably slow to read from like a directory. tar files extract quickly, but are slow at non-sequential access.
- Reading directory entries (
ls
) in large directory trees is slow
Some benchmarks
On macOS 12 with an M1X
Using tigerbeetle github repo as an example
Extracting:
Archiving:
On an Ubuntu AMD64 server
Extracting a node_modules
folder
Why faster?
- It stores an array of hashes for each file path and the list of files are sorted lexigraphically. This makes non-sequential access faster than tar, but can make creating new archives slower.
- Does not store directories, only files
- .hop files are read-only (more precisely, one could append but would have to rewrite all metadata)
copy_file_range
packed struct
makes serialization & deserialization very fast because there is very little encoding/decoding step.
How does it work?
- File contents go at the top, file metadata goes at the bottom
- This is the metadata it currently stores:
package Hop;
struct StringPointer {
uint32 off;
uint32 len;
}
struct File {
StringPointer name;
uint32 name_hash;
uint32 chmod;
uint32 mtime;
uint32 ctime;
StringPointer data;
}
message Archive {
uint32 version = 1;
uint32 content_offset = 2;
File[] files = 3;
uint32[] name_hashes = 4;
byte[] metadata = 5;
}