• Stars
    star
    10,377
  • Rank 3,321 (Top 0.07 %)
  • Language
    Rust
  • License
    The Unlicense
  • Created about 10 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A fast CSV command line toolkit written in Rust.

xsv is a command line program for indexing, slicing, analyzing, splitting and joining CSV files. Commands should be simple, fast and composable:

  1. Simple tasks should be easy.
  2. Performance trade offs should be exposed in the CLI interface.
  3. Composition should not come at the expense of performance.

This README contains information on how to install xsv, in addition to a quick tour of several commands.

Linux build status Windows build status

Dual-licensed under MIT or the UNLICENSE.

Available commands

  • cat - Concatenate CSV files by row or by column.
  • count - Count the rows in a CSV file. (Instantaneous with an index.)
  • fixlengths - Force a CSV file to have same-length records by either padding or truncating them.
  • flatten - A flattened view of CSV records. Useful for viewing one record at a time. e.g., xsv slice -i 5 data.csv | xsv flatten.
  • fmt - Reformat CSV data with different delimiters, record terminators or quoting rules. (Supports ASCII delimited data.)
  • frequency - Build frequency tables of each column in CSV data. (Uses parallelism to go faster if an index is present.)
  • headers - Show the headers of CSV data. Or show the intersection of all headers between many CSV files.
  • index - Create an index for a CSV file. This is very quick and provides constant time indexing into the CSV file.
  • input - Read CSV data with exotic quoting/escaping rules.
  • join - Inner, outer and cross joins. Uses a simple hash index to make it fast.
  • partition - Partition CSV data based on a column value.
  • sample - Randomly draw rows from CSV data using reservoir sampling (i.e., use memory proportional to the size of the sample).
  • reverse - Reverse order of rows in CSV data.
  • search - Run a regex over CSV data. Applies the regex to each field individually and shows only matching rows.
  • select - Select or re-order columns from CSV data.
  • slice - Slice rows from any part of a CSV file. When an index is present, this only has to parse the rows in the slice (instead of all rows leading up to the start of the slice).
  • sort - Sort CSV data.
  • split - Split one CSV file into many CSV files of N chunks.
  • stats - Show basic types and statistics of each column in the CSV file. (i.e., mean, standard deviation, median, range, etc.)
  • table - Show aligned output of any CSV data using elastic tabstops.

A whirlwind tour

Let's say you're playing with some of the data from the Data Science Toolkit, which contains several CSV files. Maybe you're interested in the population counts of each city in the world. So grab the data and start examining it:

$ curl -LO https://burntsushi.net/stuff/worldcitiespop.csv
$ xsv headers worldcitiespop.csv
1   Country
2   City
3   AccentCity
4   Region
5   Population
6   Latitude
7   Longitude

The next thing you might want to do is get an overview of the kind of data that appears in each column. The stats command will do this for you:

$ xsv stats worldcitiespop.csv --everything | xsv table
field       type     min            max            min_length  max_length  mean          stddev         median     mode         cardinality
Country     Unicode  ad             zw             2           2                                                   cn           234
City        Unicode   bab el ahmar  Þykkvibaer     1           91                                                  san jose     2351892
AccentCity  Unicode   Bâb el Ahmar  ïn Bou Chella  1           91                                                  San Antonio  2375760
Region      Unicode  00             Z9             0           2                                        13         04           397
Population  Integer  7              31480498       0           8           47719.570634  302885.559204  10779                   28754
Latitude    Float    -54.933333     82.483333      1           12          27.188166     21.952614      32.497222  51.15        1038349
Longitude   Float    -179.983333    180            1           14          37.08886      63.22301       35.28      23.8         1167162

The xsv table command takes any CSV data and formats it into aligned columns using elastic tabstops. You'll notice that it even gets alignment right with respect to Unicode characters.

So, this command takes about 12 seconds to run on my machine, but we can speed it up by creating an index and re-running the command:

$ xsv index worldcitiespop.csv
$ xsv stats worldcitiespop.csv --everything | xsv table
...

Which cuts it down to about 8 seconds on my machine. (And creating the index takes less than 2 seconds.)

Notably, the same type of "statistics" command in another CSV command line toolkit takes about 2 minutes to produce similar statistics on the same data set.

Creating an index gives us more than just faster statistics gathering. It also makes slice operations extremely fast because only the sliced portion has to be parsed. For example, let's say you wanted to grab the last 10 records:

$ xsv count worldcitiespop.csv
3173958
$ xsv slice worldcitiespop.csv -s 3173948 | xsv table
Country  City               AccentCity         Region  Population  Latitude     Longitude
zw       zibalonkwe         Zibalonkwe         06                  -19.8333333  27.4666667
zw       zibunkululu        Zibunkululu        06                  -19.6666667  27.6166667
zw       ziga               Ziga               06                  -19.2166667  27.4833333
zw       zikamanas village  Zikamanas Village  00                  -18.2166667  27.95
zw       zimbabwe           Zimbabwe           07                  -20.2666667  30.9166667
zw       zimre park         Zimre Park         04                  -17.8661111  31.2136111
zw       ziyakamanas        Ziyakamanas        00                  -18.2166667  27.95
zw       zizalisari         Zizalisari         04                  -17.7588889  31.0105556
zw       zuzumba            Zuzumba            06                  -20.0333333  27.9333333
zw       zvishavane         Zvishavane         07      79876       -20.3333333  30.0333333

These commands are instantaneous because they run in time and memory proportional to the size of the slice (which means they will scale to arbitrarily large CSV data).

Switching gears a little bit, you might not always want to see every column in the CSV data. In this case, maybe we only care about the country, city and population. So let's take a look at 10 random rows:

$ xsv select Country,AccentCity,Population worldcitiespop.csv \
  | xsv sample 10 \
  | xsv table
Country  AccentCity       Population
cn       Guankoushang
za       Klipdrift
ma       Ouled Hammou
fr       Les Gravues
la       Ban Phadèng
de       Lüdenscheid      80045
qa       Umm ash Shubrum
bd       Panditgoan
us       Appleton
ua       Lukashenkivske

Whoops! It seems some cities don't have population counts. How pervasive is that?

$ xsv frequency worldcitiespop.csv --limit 5
field,value,count
Country,cn,238985
Country,ru,215938
Country,id,176546
Country,us,141989
Country,ir,123872
City,san jose,328
City,san antonio,320
City,santa rosa,296
City,santa cruz,282
City,san juan,255
AccentCity,San Antonio,317
AccentCity,Santa Rosa,296
AccentCity,Santa Cruz,281
AccentCity,San Juan,254
AccentCity,San Miguel,254
Region,04,159916
Region,02,142158
Region,07,126867
Region,03,122161
Region,05,118441
Population,(NULL),3125978
Population,2310,12
Population,3097,11
Population,983,11
Population,2684,11
Latitude,51.15,777
Latitude,51.083333,772
Latitude,50.933333,769
Latitude,51.116667,769
Latitude,51.133333,767
Longitude,23.8,484
Longitude,23.2,477
Longitude,23.05,476
Longitude,25.3,474
Longitude,23.1,459

(The xsv frequency command builds a frequency table for each column in the CSV data. This one only took 5 seconds.)

So it seems that most cities do not have a population count associated with them at all. No matter—we can adjust our previous command so that it only shows rows with a population count:

$ xsv search -s Population '[0-9]' worldcitiespop.csv \
  | xsv select Country,AccentCity,Population \
  | xsv sample 10 \
  | xsv table
Country  AccentCity       Population
es       Barañáin         22264
es       Puerto Real      36946
at       Moosburg         4602
hu       Hejobaba         1949
ru       Polyarnyye Zori  15092
gr       Kandíla          1245
is       Ólafsvík         992
hu       Decs             4210
bg       Sliven           94252
gb       Leatherhead      43544

Erk. Which country is at? No clue, but the Data Science Toolkit has a CSV file called countrynames.csv. Let's grab it and do a join so we can see which countries these are:

curl -LO https://gist.githubusercontent.com/anonymous/063cb470e56e64e98cf1/raw/98e2589b801f6ca3ff900b01a87fbb7452eb35c7/countrynames.csv
$ xsv headers countrynames.csv
1   Abbrev
2   Country
$ xsv join --no-case  Country sample.csv Abbrev countrynames.csv | xsv table
Country  AccentCity       Population  Abbrev  Country
es       Barañáin         22264       ES      Spain
es       Puerto Real      36946       ES      Spain
at       Moosburg         4602        AT      Austria
hu       Hejobaba         1949        HU      Hungary
ru       Polyarnyye Zori  15092       RU      Russian Federation | Russia
gr       Kandíla          1245        GR      Greece
is       Ólafsvík         992         IS      Iceland
hu       Decs             4210        HU      Hungary
bg       Sliven           94252       BG      Bulgaria
gb       Leatherhead      43544       GB      Great Britain | UK | England | Scotland | Wales | Northern Ireland | United Kingdom

Whoops, now we have two columns called Country and an Abbrev column that we no longer need. This is easy to fix by re-ordering columns with the xsv select command:

$ xsv join --no-case  Country sample.csv Abbrev countrynames.csv \
  | xsv select 'Country[1],AccentCity,Population' \
  | xsv table
Country                                                                              AccentCity       Population
Spain                                                                                Barañáin         22264
Spain                                                                                Puerto Real      36946
Austria                                                                              Moosburg         4602
Hungary                                                                              Hejobaba         1949
Russian Federation | Russia                                                          Polyarnyye Zori  15092
Greece                                                                               Kandíla          1245
Iceland                                                                              Ólafsvík         992
Hungary                                                                              Decs             4210
Bulgaria                                                                             Sliven           94252
Great Britain | UK | England | Scotland | Wales | Northern Ireland | United Kingdom  Leatherhead      43544

Perhaps we can do this with the original CSV data? Indeed we can—because joins in xsv are fast.

$ xsv join --no-case Abbrev countrynames.csv Country worldcitiespop.csv \
  | xsv select '!Abbrev,Country[1]' \
  > worldcitiespop_countrynames.csv
$ xsv sample 10 worldcitiespop_countrynames.csv | xsv table
Country                      City                   AccentCity             Region  Population  Latitude    Longitude
Sri Lanka                    miriswatte             Miriswatte             36                  7.2333333   79.9
Romania                      livezile               Livezile               26      1985        44.512222   22.863333
Indonesia                    tawainalu              Tawainalu              22                  -4.0225     121.9273
Russian Federation | Russia  otar                   Otar                   45                  56.975278   48.305278
France                       le breuil-bois robert  le Breuil-Bois Robert  A8                  48.945567   1.717026
France                       lissac                 Lissac                 B1                  45.103094   1.464927
Albania                      lumalasi               Lumalasi               46                  40.6586111  20.7363889
China                        motzushih              Motzushih              11                  27.65       111.966667
Russian Federation | Russia  svakino                Svakino                69                  55.60211    34.559785
Romania                      tirgu pancesti         Tirgu Pancesti         38                  46.216667   27.1

The !Abbrev,Country[1] syntax means, "remove the Abbrev column and remove the second occurrence of the Country column." Since we joined with countrynames.csv first, the first Country name (fully expanded) is now included in the CSV data.

This xsv join command takes about 7 seconds on my machine. The performance comes from constructing a very simple hash index of one of the CSV data files given. The join command does an inner join by default, but it also has left, right and full outer join support too.

Installation

Binaries for Windows, Linux and macOS are available from Github.

If you're a macOS Homebrew user, then you can install xsv from homebrew-core:

$ brew install xsv

If you're a macOS MacPorts user, then you can install xsv from the official ports:

$ sudo port install xsv

If you're a Nix/NixOS user, you can install xsv from nixpkgs:

$ nix-env -i xsv

Alternatively, you can compile from source by installing Cargo (Rust's package manager) and installing xsv using Cargo:

cargo install xsv

Compiling from this repository also works similarly:

git clone git://github.com/BurntSushi/xsv
cd xsv
cargo build --release

Compilation will probably take a few minutes depending on your machine. The binary will end up in ./target/release/xsv.

Benchmarks

I've compiled some very rough benchmarks of various xsv commands.

Motivation

Here are several valid criticisms of this project:

  1. You shouldn't be working with CSV data because CSV is a terrible format.
  2. If your data is gigabytes in size, then CSV is the wrong storage type.
  3. Various SQL databases provide all of the operations available in xsv with more sophisticated indexing support. And the performance is a zillion times better.

I'm sure there are more criticisms, but the impetus for this project was a 40GB CSV file that was handed to me. I was tasked with figuring out the shape of the data inside of it and coming up with a way to integrate it into our existing system. It was then that I realized that every single CSV tool I knew about was woefully inadequate. They were just too slow or didn't provide enough flexibility. (Another project I had comprised of a few dozen CSV files. They were smaller than 40GB, but they were each supposed to represent the same kind of data. But they all had different column and unintuitive column names. Useful CSV inspection tools were critical here—and they had to be reasonably fast.)

The key ingredients for helping me with my task were indexing, random sampling, searching, slicing and selecting columns. All of these things made dealing with 40GB of CSV data a bit more manageable (or dozens of CSV files).

Getting handed a large CSV file once was enough to launch me on this quest. From conversations I've had with others, CSV data files this large don't seem to be a rare event. Therefore, I believe there is room for a tool that has a hope of dealing with data that large.

Naming collision

This project is unrelated to another similar project with the same name: https://mj.ucw.cz/sw/xsv/

More Repositories

1

ripgrep

ripgrep recursively searches directories for a regex pattern while respecting your gitignore
Rust
48,517
star
2

toml

TOML parser for Golang with reflection.
Go
4,464
star
3

quickcheck

Automated property based testing for Rust (with shrinking).
Rust
2,408
star
4

erd

Translates a plain text description of a relational database schema to a graphical entity-relationship diagram.
Haskell
1,805
star
5

fst

Represent large sets and maps compactly with finite state transducers.
Rust
1,771
star
6

jiff

A date-time library for Rust that encourages you to jump into the pit of success.
Rust
1,736
star
7

rust-csv

A CSV parser for Rust, with Serde support.
Rust
1,710
star
8

walkdir

Rust library for walking directories recursively.
Rust
1,283
star
9

nflgame

An API to retrieve and read NFL Game Center JSON data. It can work with real-time data, which can be used for fantasy football.
Python
1,257
star
10

nfldb

A library to manage and update NFL data in a relational database.
Python
1,079
star
11

aho-corasick

A fast implementation of Aho-Corasick in Rust.
Rust
1,028
star
12

byteorder

Rust library for reading/writing numbers in big-endian and little-endian.
Rust
980
star
13

wingo

A fully-featured window manager written in Go.
Go
958
star
14

memchr

Optimized string search routines for Rust.
Rust
888
star
15

bstr

A string type for Rust that is not required to be valid UTF-8.
Rust
795
star
16

advent-of-code

Rust solutions to AoC 2018
Rust
479
star
17

xgb

The X Go Binding is a low-level API to communicate with the X server. It is modeled on XCB and supports many X extensions.
Go
472
star
18

termcolor

Cross platform terminal colors for Rust.
Rust
462
star
19

rust-snappy

Snappy compression implemented in Rust (including the Snappy frame format).
Rust
449
star
20

go-sumtype

A simple utility for running exhaustiveness checks on Go "sum types."
Go
421
star
21

chan

Multi-producer, multi-consumer concurrent channel for Rust.
Rust
392
star
22

regex-automata

A low level regular expression library that uses deterministic finite automata.
Rust
352
star
23

cargo-benchcmp

A small utility to compare Rust micro-benchmarks.
Rust
342
star
24

suffix

Fast suffix arrays for Rust (with Unicode support).
Rust
262
star
25

rure-go

Go bindings to Rust's regex engine.
Go
250
star
26

tabwriter

Elastic tabstops for Rust.
Rust
247
star
27

rebar

A biased barometer for gauging the relative speed of some regex engines on a curated set of tasks.
Python
227
star
28

imdb-rename

A command line tool to rename media files based on titles from IMDb.
Rust
226
star
29

critcmp

A command line tool for comparing benchmarks run by Criterion.
Rust
216
star
30

ty

Easy parametric polymorphism at run time using completely unidiomatic Go.
Go
198
star
31

xgbutil

A utility library to make use of the X Go Binding easier. (Implements EWMH and ICCCM specs, key binding support, etc.)
Go
191
star
32

pytyle3

An updated (and much faster) version of pytyle that uses xpybutil and is compatible with Openbox Multihead.
Python
181
star
33

dotfiles

My configuration files and personal collection of scripts.
Vim Script
154
star
34

rsc-regexp

Translations of a simple C program to Rust.
Rust
138
star
35

rust-cbor

CBOR (binary JSON) for Rust with automatic type based decoding and encoding.
Rust
129
star
36

chan-signal

Respond to OS signals with channels.
Rust
125
star
37

goim

Goim is a robust command line utility to maintain and query the Internet Movie Database (IMDb).
Go
117
star
38

same-file

Cross platform Rust library for checking whether two file paths are the same file.
Rust
101
star
39

clibs

A smattering of miscellaneous C libraries. Includes sane argument parsing, a thread-safe multi-producer/multi-consumer queue, and implementation of common data structures (hashmaps, vectors and linked lists).
C
98
star
40

ucd-generate

A command line tool to generate Unicode tables as source code.
Rust
95
star
41

nflvid

An experimental library to map play meta data to footage of that play.
Python
91
star
42

rust-stats

Basic statistical functions on streams for Rust.
Rust
87
star
43

migration

Package migration for Golang automatically handles versioning of a database schema by applying a series of migrations supplied by the client.
Go
81
star
44

winapi-util

Safe wrappers for various Windows specific APIs.
Rust
64
star
45

xpybutil

An incomplete xcb-util port plus some extras
Python
62
star
46

graphics-go

Automatically exported from code.google.com/p/graphics-go
Go
60
star
47

rust-pcre2

High level Rust bindings to PCRE2.
C
56
star
48

blog

My blog.
Rust
52
star
49

rust-sorts

Implementations of common sorting algorithms in Rust with comprehensive tests and benchmarks.
Rust
51
star
50

openbox-multihead

Openbox with patches for enhanced multihead support.
C
46
star
51

nakala

A low level embedded information retrieval system.
Rust
45
star
52

nflfan

View your fantasy teams with nfldb using a web interface.
JavaScript
43
star
53

globset

A globbing library for Rust.
Rust
42
star
54

utf8-ranges

Convert contiguous ranges of Unicode codepoints to UTF-8 byte ranges.
Rust
42
star
55

rtmpdump-ksv

rtmpdump with ksv's patch. Intended to track upstream git://git.ffmpeg.org/rtmpdump as well.
C
40
star
56

regexp

A regular expression library implemented in Rust.
Rust
37
star
57

xdg

A Go package for reading config and data files according to the XDG Base Directory specification.
Go
35
star
58

locker

A simple Golang package for conveniently using named read/write locks. Useful for synchronizing access to session based storage in web applications.
Go
34
star
59

nflcmd

A collection of command line utilities for viewing NFL statistics and rankings with nfldb.
Python
32
star
60

notes

A collection of small notes that aren't appropriate for my blog.
31
star
61

mempool

A fast thread safe memory pool for reusing allocations.
Rust
29
star
62

gribble

A command oriented language whose environment is defined through Go struct types by reflection.
Go
28
star
63

vcr

A simple wrapper tool around ffmpeg to capture video from a VCR.
Rust
27
star
64

encoding_rs_io

Streaming I/O adapters for the encoding_rs crate.
Rust
25
star
65

rust-cmail

A simple command line utility for periodically sending email containing the output of long-running commands.
Rust
21
star
66

cluster

A simple API for managing a network cluster with smart peer discovery.
Go
19
star
67

pager-multihead

A pager that supports per-monitor desktops (compatible with Openbox Multihead and Wingo)
Python
15
star
68

cablastp

Performs BLAST on compressed proteomic data.
Go
15
star
69

rust-error-handling-case-study

Code for the case study in my blog post: http://blog.burntsushi.net/rust-error-handling
Rust
15
star
70

rg-cratesio-typosquat

The source code of the 'rg' crate. It is an intentional typo-squat that redirects folks to 'ripgrep'.
Rust
15
star
71

imgv

An image viewer for Linux written in Go.
Go
14
star
72

cmd

A convenience library for executing commands in Go, including executing commands in parallel with a pool.
Go
14
star
73

fanfoot

View your fantasy football leagues and get text alerts when one of your players scores.
Python
12
star
74

cmail

cmail runs a command and sends the output to your email address at certain intervals.
Go
12
star
75

gohead

An xrandr wrapper script to manage multi-monitor configurations. With hooks.
Go
12
star
76

burntsushi-blog

A small Go application for my old blog.
CSS
12
star
77

intern

A simple package for interning strings, with a focus on efficiently representing dense pairwise data.
Go
11
star
78

crev-proofs

My crev reviews.
10
star
79

pytyle1

A lightweight X11 tool for simulating tiling in a stacking window manager.
Python
9
star
80

cif

A golang package for reading and writing data in the Crystallographic Information File (CIF) format. It mostly conforms to the CIF 1.1 specification.
Go
9
star
81

rucd

WIP
Rust
8
star
82

qcsv

An API to read and analyze CSV files by inferring types for each column of data.
Python
8
star
83

pyndow

A window manager written in Python
Python
8
star
84

csql

Package csql provides convenience functions for use with the types and functions defined in the standard library `database/sql` package.
Go
6
star
85

freetype-go

A fork of freetype-go with bounding box calculations.
Go
6
star
86

sqlsess

Simple database backed session management. Integrates with Gorilla's sessions package.
Go
6
star
87

go-wayland-simple-shm

C
5
star
88

sqlauth

A simple Golang package that provides database backed user authentication with bcrypt.
Vim Script
4
star
89

lcmweb

A Go web application for coding documents with the Linguistic Category Model.
JavaScript
4
star
90

bcbgo

Computational biology tools for the BCB group at Tufts University.
Go
4
star
91

fex

A framework for specifying and executing experiments.
Haskell
3
star
92

present

My presentations.
HTML
3
star
93

memchr-2.6-mov-regression

Rust
3
star
94

genecentric

A tool to generate between-pathway modules and perform GO enrichment on them.
Python
3
star
95

rust-docs

A silly repo for managing my Rust crate documentation.
Python
3
star
96

pcre2-mirror

A git mirror for PCRE2's SVN repository at svn://vcs.exim.org/pcre2/code
2
star
97

xpyb

A clone of xorg-xpyb.
C
2
star
98

burntsushi-homepage

A small PHP web application for my old homepage.
PHP
2
star
99

window-marker

Use vim-like marks on windows.
Python
2
star
100

sudoku

An attempt at a sudoku solver in Haskell.
Haskell
1
star