• Stars
    star
    606
  • Rank 71,423 (Top 2 %)
  • Language
    C
  • License
    BSD 2-Clause "Sim...
  • Created about 9 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Create an index on a compressed text file

Build Status codecov

zindex creates and queries an index on a compressed, line-based text file in a time- and space-efficient way.

The itch I had

I have many multigigabyte text gzipped log files and I'd like to be able to find data in them by an index. There's a key on each line that a simple regex can pull out. However, to find a particular record requires zgrep, which takes ages as it has to seek through gigabytes of previous data to get to each record.

Enter zindex which builds an index and also stores decompression checkpoints along the way which allows lightning fast random access. Pulling out single lines by either line number of by an index entry is then almost instant, even for huge files. The indices themselves are small too, typically ~10% of the compressed file size for a simple unique numeric index.

Creating an index

zindex needs to be told what part of each line constitutes the index. This can be done by a regular expression, by field, or by piping each line through an external program.

By default zindex creates an index of file.gz.zindex when asked to index file.gz.

Example: create an index on lines matching a numeric regular expression. The capture group indicates the part that's to be indexed, and the options show each line has a unique, numeric index.

$ zindex file.gz --regex 'id:([0-9]+)' --numeric --unique

Example: create an index on the second field of a CSV file:

$ zindex file.gz --delimiter , --field 2

Example: create an index on a JSON field orderId.id in any of the items in the document root's actions array (requires jq). The jq query creates an array of all the orderId.ids, then joins them with a space to ensure each individual line piped to jq creates a single line of output, with multiple matches separated by spaces (which is the default separator).

$ zindex file.gz --pipe "jq --raw-output --unbuffered '[.actions[].orderId.id] | join(\" \")'"

Multiple indices, and configuration of the index creation by JSON configuration file are supported, see below.

Querying the index

The zq program is used to query an index. It's given the name of the compressed file and a list of queries. For example:

$ zq file.gz 1023 4443 554

It's also possible to output by line number, so to print lines 1 and 1000 from a file:

$ zq file.gz --line 1 1000

Building from source

zindex uses CMake for its basic building (though has a bootstrapping Makefile), and requires a C++11 compatible compiler (GCC 4.8 or above and clang 3.4 and above). It also requires zlib. With the relevant compiler available, building ought to be as simple as:

$ git clone https://github.com/mattgodbolt/zindex.git
$ cd zindex
$ make

Binaries are left in build/Release.

Additionally a static binary can be built if you're happy to dip your toe into CMake:

$ cd path/to/build/directory
$ cmake path/to/zindex/checkout/dir -DStatic:BOOL=On -DCMAKE_BUILD_TYPE=Release
$ make

Multiple indices

To support more than one index, or for easier configuration than all the command-line flags that might be needed, there is a JSON configuration format. Pass the --config <yourconfigfile>.json option and put something like this in the configuration file:

{ 
    "indexes": [
        {
            "type": "field",
            "delimiter": "\t",
            "fieldNum": 1
        },
        {
            "name": "secondary",
            "type": "field",
            "delimiter": "\t",
            "fieldNum": 2
        }
    ]
}

This creates two indices, one on the first field and one on the second field, as delimited by tabs. One can then specify which index to query with the -i <index> option of zq.

Issues and feature requests

See the issue tracker for TODOs and known bugs. Please raise bugs there, and feel free to submit suggestions there also.

Feel free to contact me if you prefer email over bug trackers.

More Repositories

1

seasocks

Simple, small, C++ embeddable webserver with WebSockets support
C++
714
star
2

jsbeeb

Javascript BBC micro emulator
JavaScript
345
star
3

pt-three-ways

Path tracing, done three ways
C++
181
star
4

Miracle

JavaScript Sega Master System Emulator
JavaScript
112
star
5

reddog

Final resting place of the source to Argonaut Dreamcast game Red Dog Superior Firepower
C
107
star
6

agner

Reworking of Agner Fog's performance test programs for Linux
C++
97
star
7

xania

Xania MUD source
C++
50
star
8

owlet-editor

A modern BBC BASIC editor inspired by the BBC Micro Bot (https://bbcmicrobot.com)
JavaScript
49
star
9

path-tracer

A path tracer in rust
Rust
44
star
10

correct-by-construction

Correct by Construction : a presentation given at C++ on Sea 2020
HTML
20
star
11

cppcon-bits-between-bits

A presentation on how much stuff happens before your code even starts running
HTML
19
star
12

wolf-doom-quake

Slides on how Wolf and maybe one day other ID games worked
JavaScript
15
star
13

beebide

A web IDE for the BBC Micro
JavaScript
14
star
14

pt-three-ways-pres

Presentation for pt-three-ways - a CppCon 2019 presentation
HTML
14
star
15

cpponsea-2019

Slides for my 2019 presentation at C++ on Sea
HTML
13
star
16

verisnake

Snake in verilog
Python
12
star
17

memory-and-caches

Slides on memory and caches
HTML
11
star
18

cpu5things

5 things you never knew your CPU did for you
CSS
10
star
19

IRClient

1990s-era ARM assembly implementation of IRC. Preserved for posterity
Visual Basic
8
star
20

onslaught

An unreleased, unfinished BBC Micro game
7
star
21

cppp-fr-superpower

Slides for a talk at 2021 cppp.fr
HTML
7
star
22

performance-tuning

HTML
5
star
23

conan-test

An example showing how I use CMake and Conan
CMake
5
star
24

ce-behind-the-scenes

Slides for a "Behind the Scenes" presentation
HTML
5
star
25

ce-cppcon-talk

Slides from CppCon 2017
HTML
5
star
26

turing

JS implementation of turing machines
JavaScript
5
star
27

bbc-micro-emulation

Slides on emulating a BBC Micro in Javascript
HTML
5
star
28

godbolt-terraform

Personal terraform setup for various AWS things
HCL
4
star
29

frame

Hacking on a Raspberry Pi-based e-ink photo frame
C
4
star
30

miracle-sms-emulator

Presentation and supporting material for Miracle, a Sega Master System emulator
HTML
4
star
31

blog

My personal blog and other website files
Python
4
star
32

ce-lightning

Lightning talk about Compiler Explorer
CSS
3
star
33

whats-new-in-compiler-explorer-2023

Slides for a presentation on Compiler Explorer at C++ on Sea and CppNorth
HTML
3
star
34

yow-conversational-asm

Slides for a talk at YOW Tool Time
HTML
2
star
35

ccs-js

Cascading Configuration System (Javascript version)
JavaScript
2
star
36

dotfiles

Useful dotfiles (public)
Python
2
star
37

goto-2018-cpp

Slides for my GOTO 2018 talk (presented 26th April 2018)
HTML
2
star
38

zindex-pres

HTML
2
star
39

be6502

My take on Ben Eater's 6502 project
C++
1
star
40

pymu

Sketches
Python
1
star
41

ninja-jump

HTML5 game written with my 7 and 5 year olds
JavaScript
1
star
42

seasocks-slides

Slides for a 5 min talk I gave on Seasocks at OS:OM
JavaScript
1
star
43

ccs-conan

Conan package building for the CCS configuration library
Python
1
star