• Stars
    star
    121
  • Rank 293,924 (Top 6 %)
  • Language
    Python
  • License
    Other
  • Created about 11 years ago
  • Updated over 9 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Randomly sample lines from a csv, tsv, or other line-based data file

Subsample

subsample is a command-line tool for sampling data from a large, newline-separated dataset (typically a CSV-like file).

Installation

subsample is distributed with pip. Once you've installed pip, simply run:

> pip install subsample

and subsample will be installed into your Python environment.

Usage

subsample requires one argument, the input file. If the input file is -, data will be read from standard input (in this case, only the reservoir and approximate algorithms can be used).

Simple Example

To take a sample of size 1000 from the file big_data.csv, run subsample as follows:

> subsample -n 1000 big_data.csv

This will print 1000 random lines from the file to the terminal.

File Redirection

Usually we want to save the sample to another file instead. subsample doesn't have file output built-in; instead it relies on the output redirection features of your terminal. To save to big_data_sample.csv, run the following command:

> subsample -n 1000 big_data.csv > big_data_sample.csv

Sampling from STDIN

To use standard input as the source, use - as the filename, eg:

> subsample -n 1000 < big_data.csv > big_data_sample.csv

Note that only reservoir sampling supports stdin because the other sampling algorithms require a seekable input stream.

Header Rows

CSV files often have a header row with the column names. You can pass the -r flag to subsample to preserve the header row:

> subsample -n 1000 big_data.csv -r > big_data_sample.csv

Rarely, you may need to sample from a file with a header spanning multiple rows. The -r argument takes an optional number of rows to preserve as a header:

> subsample -n 1000 -r 3 data_with_header.csv > sample_with_header.csv

Note that if the -r argument is directly before the input filename, it must have an argument or else it will try to interpret the input filename as the number of header rows and fail. Putting the -r argument after the input filename will avoid this.

Random Seed

The output of subsample is random and depend on the computer's random state. Sometimes you may want to take a sample in a way that can be reproduced. You can pass a random seed to subsample with the -s flag to accomplish this:

> subsample -s 45906345 data_file.csv > reproducable_sample.csv

Sampling Algorithms

Algorithm Comparison

subsample implements three sampling algorithms, each with their own strengths and weaknesses.

Β  Reservoir Approximate Two-pass
flag -res -app -tp
stdin-compatible yes yes no
space complexity O(ss*rs) O(1) O(1)
fixed sample size compatible not compatible compatible
fractional sample size not compatible compatible compatible
sample order random source source

For space complexity, ss is the number of records in the sample and rs is the maximum size of a record.

Sample order is the order of the records returned. Only reservoir sampling gives results in random order; approximate and two-pass return results in the same order as the source data.

Reservoir Sampling

Reservoir sampling (Random Sampling with a Reservoir (Vitter 85)) is a method of sampling from a stream of unknown size where the sample size is fixed in advance. It is a one-pass algorithm and uses space proportional to the amount of data in the sample.

Reservoir sampling is the default algorithm used by subsample. For consistency, it can also be invoked with the argument --reservoir.

When using reservoir sampling, the sample size must be fixed rather than fractional.

Example:

> subsample --reservoir -n 1000 big_data.csv > sample_data.csv

Approximate Sampling

Approximate sampling simply includes each row in the sample with a probability given as the sample proportion. It is a stateless algorithm with minimal space requirements. Samples will have on average a size of fraction * population_size, but it will vary between each invocation. Because of this, approximate sampling is only useful when the sample size does not have to be exact (hence the name).

Example:

> subsample --approximate -f 0.15 my_data.csv > my_sample.csv

Equivalently, supply a percentage instead of a fraction by switching the -f to a -p:

> subsample --approximate -p 15 my_data.csv > my_sample.csv

Two-Pass Sampling

As the name implies, two-pass sampling uses two passes: the first is to count the number of records (ie. the population size) and the second is to emit the records which are part of the sample. Because of this it is not compatible with stdin as an input.

Example:

> subsample --two-pass -n 1000 my_data.csv > my_sample.csv

Two-pass sampling also accepts the sample size as a fraction or percent:

> subsample --two-pass -p 15 my_data.csv > my_sample.csv

Tests

A simple GNU Make-driven testing script is included. Run make test from subsample's base directory after installing to run some regression tests.

Due to the randomness inherent to random sampling, testing is limited to checking that the output is the same when the random seed is unchanged. This serves mainly to find new bugs introduced by changes in the future and does not imply that the code itself is correct (in the sense that the sample is truly random).

More Repositories

1

BarbBlock

Chrome extension which blocks requests to sites which have used legal threats to remove themselves from other blacklists.
JavaScript
625
star
2

Treeverse

A browser extension for navigating burgeoning Twitter conversations
TypeScript
500
star
3

simplediff

Simple Diff Function implemented in Python, PHP, CoffeeScript, and JavaScript
Python
456
star
4

runipy

Run IPython notebooks as command-line scripts, generate HTML reports
Python
448
star
5

penkit

Tools for pen plotting in Python
Jupyter Notebook
132
star
6

gcmap

Draw great-circle maps from large sets of coordinate pairs
Python
73
star
7

wgsl-playground

Rust
55
star
8

crossword-composer

Constraint solver for word games.
Rust
42
star
9

tiny_id

Rust library for generating non-sequential, tightly-packed short IDs.
Rust
36
star
10

haskell_hadoop

Haskell module for streaming hadoop MapReduce jobs
Haskell
35
star
11

wdimtloap

Code accompanying the β€œwhat does it mean to listen on a port” blog post.
Python
20
star
12

PenPlots.jl

A simple Julia library for generating SVGs suitable for AxiDraw and similar pen plotters.
Julia
20
star
13

Python-Gale-Shapley

Python (toy) implementation of Gale-Shapley algorithm for the stable marriage problem
Python
18
star
14

nbgraph

Inline, interactive graphs inside jupyter/ipython notebooks
Python
16
star
15

tweetvis

Twitter conversation visualization
CoffeeScript
15
star
16

farevis

Visualization of flight prices
JavaScript
14
star
17

svelte-vis

Svelte
14
star
18

firestore-serde

A Rust serializer/deserializer for Google Cloud Firestore.
Rust
12
star
19

interactive_process

A tiny Rust library for interacting with a running process over `stdio`.
Rust
11
star
20

tube-plots

Code to accompany plotter videos on my YouTube channel
Julia
10
star
21

wgsl-cheat-sheet

Cheat sheet for WGSL syntax for developers coming from GLSL.
9
star
22

python-pattern-matching

Functional pattern matching module for python
Python
6
star
23

tiny-firestore-odm

Rust
5
star
24

vecdraw

Vectorized, GPU-accelerated drawing library in Rust. (experimental)
Rust
5
star
25

sec-data-parser

Rust parser for SEC EDGAR .nc submission container files.
Rust
5
star
26

rhythmless

Vertical Rhythm for LESS
5
star
27

styletransfer-midi

A Keras implementation of Neural Style Transfer with real-time weight updates from a MIDI control surface.
Python
5
star
28

are-we-google-cloud-yet

A listing of Rust crates for use with Google Cloud
4
star
29

bitaesthetics

source for bitaesthetics.com
Jupyter Notebook
3
star
30

flightdata

Tools for extracting flight schedules from various alliances (incomplete and unmaintained)
C
3
star
31

experiments

Rust
2
star
32

tilelife

Implementation of a tiled variant of Conway's Game of Life.
Python
2
star
33

crumb

Track arbitrary metrics across the git commits of a program
Python
2
star
34

webgl2-glyph-atlas

A Rust library for generating a glyph atlas on-the-fly for text rendering to a WebGL2 canvas context.
Rust
2
star
35

webFractal

Web-based fractal explorer (February 2007)
JavaScript
2
star
36

notebooks

Public Jupyter notebooks
Jupyter Notebook
2
star
37

gtfs-gexf

Convert a route map in GTFS (General Transit Feed Specification) to an undirected GEXF (Graph Exchange XML Format) graph
Python
2
star
38

d3class

HTML
1
star
39

sicp-excersises

Scheme
1
star
40

proofgame

Jupyter Notebook
1
star
41

birthmap

Python
1
star
42

jssec-demo

CoffeeScript
1
star
43

settings

Settings files (.vimrc, bash, etc.)
Vim Script
1
star
44

dinesafe-api

API server for dineviz
Python
1
star
45

webgl2-glyph

Rust
1
star
46

cs240

Python
1
star
47

cloudpush

Command-line tool for hosting static sites on Rackspace Cloud Files or other Open Stack Swift services
Python
1
star
48

yelp

Entry for Yelp Kaggle competition
Python
1
star
49

judicial-data

Cleaned dataset of judicial nominations from uscourts.gov.
HTML
1
star
50

between.io

Between is an HTTP debugger in your browser
JavaScript
1
star
51

cumc2010

Slides for my presentation at CUMC 2010
1
star