• Stars
    star
    127
  • Rank 281,177 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created over 15 years ago
  • Updated about 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Utilities for processing tab-separated files

tsvutils -- utilities for processing tab-separated files

tsvutils are scripts that can convert and manipulate tabular data in the TSV file format: tab-separated values, sometimes with a header. They build on top of standard Unix utilities to allow ad-hoc, efficient, and reliable processing and summarization of tabular data from the shell.

Overview of scripts

Convert into tsv:

  • csv2tsv - convert from Excel-compatible csv.
  • json2tsv - convert from concatenated JSON records.
  • xlsx2tsv - convert from Excel's .xlsx format.
  • others: eq2tsv ssv2tsv uniq2tsv yaml2tsv ...

Manipulate tsv:

  • tsvawk - gives you column names in your awk.
  • hwrap - wraps pipeline process but preserves stdin's header.
  • tsvcat - concatenate tsv's, aligning common columns.
  • namecut - like 'cut' but with header names.
  • tabsort, tabawk - wrappers for tab delimitation.

Convert out of tsv:

  • tsv2csv - convert tsv to Excel-compatible csv.
  • tsv2my - load tsv into a new MySQL table.
  • tsv2fmt - format as ASCII-art table.
  • tsv2html - format as HTML table.
  • others: tsv2yaml tsv2tex ...

By "tsv" we mean honest-to-goodness tab-separated values, often with a header. No quoting, escaping, or comments. All rows should have the same number of fields. Rows end with a unix \n newline. Cell values cannot have tabs or newlines.

These conditions are all enforced in scripts that output tsv. For programs that take tsv input, if these assumptions do not hold, the script's behavior is undefined.

TSV is an easy format for other programs to handle:

  • After stripping the newline, split("\t") correctly parses a row.
  • To strip out the header beforehand: "tail -n+2" or "tail +2".

Weak naming convention: programs that don't work well with headers call that format "tab"; ones that either need a header or are agnostic call that "tsv". E.g., for tabsort you don't want to sort the header, but tsv2my is impossible without the header. csv2tsv and tsv2csv are agnostic, since a csv file may or may not have a header.

Examples: pipelines

The TSV format is intended to work with many other pipeline-friendly programs. Examples include:

  • General
    • cat, head, tail, tail -n+X, cut, merge, diff, comm, sort, uniq, uniq -c, wc -l
  • Multipurpose
    • perl -pe, ruby -ne, awk, sed, tr
  • SQL to TSV
    • echo 'select a,b from bla' | mysql
    • echo a b | ssv2tsv; echo "select a,b from bla" | sqlite3 -separator $(echo -e '\t')
    • echo a b | ssv2tsv; echo "select a,b from bla" | psql -tqAF $(echo -e '\t')
  • GUI to TSV
    • Excel: copy-and-paste cells <-> text as tsv (though kills double quotes)
    • Web browsers: copy rendered HTML table -> text as tsv
  • Misc

The tsvutils scripts' comments include further examples.

Examples: Named columns in programs

Here are examples of parsing TSV-with headers in several script-y languages, such that you get to refer to columns by their names, instead of positions. This makes the scripts much more maintainable.

Python has a built-in facility for TSV-with-headers:

tsv_reader = lambda f: csv.DictReader(f, dialect=None, delimiter='\t', quoting=csv.QUOTE_NONE)
for record in tsv_reader(sys.stdin):
  print record  # => hash of key/values

Or equivalently:

cols = sys.stdin.readline()[:-1]
for line in sys.stdin:
  vals = line[:-1].split("\t")
  record = dict((cols[j],vals[j]) for j in range(len(cols)))
  print record  # => hash of key/values

Ruby:

cols = STDIN.readline.chomp.split("\t")
STDIN.each do |line|
  vals = line.chomp.split("\t")
  record = (0...cols.size).map {|j| [cols[j], vals[j]]}.to_h
  p record  # => hash of key/values
end

R has a built-in facility:

data = read.delim("data.tsv", sep="\t")

Installation

It's probably useful to look at or tweak these scripts, so you're best off just putting the entire directory on your PATH.

The philosophy of tsvutils

There are many data processing and analysis situations where data consists of tables. A "table" is a list of flat records each with the same set of named attributes, where it's easy to manipulate a particular attribute across all records -- a "column". The main data structures in SQL, R, and Excel are tables.

TSV-with-headers sits in a sweet spot on the spectrum of data format complexity.

  • A more complex alternative is to encode in arbitrarily nested structures (XML, JSON). These have greater representational capacity, but are less convenient for data analysis. Since they can have high structural complexity, it's often error-prone to use them -- ad-hoc querying is generally difficult. Furthermore, it's wasteful of space to repeat key names over and over if all records are known to have the same set of keys. Finally, when doing data analysis, especially statistical analysis, you want to turn columns into vectors, which presupposes a flatter, more table-like structure.
  • A simpler alternative is a table with positional, un-named columns. The main weakness is that for more than several columns, it's hard to remember which column is which. Named columns improve maintainability.

But: SQL databases and Excel spreadsheets are often inconvenient data management environments compared to the filesystem on the unix commandline. Unfortunately, the most common file format for tables is CSV, which is complex and has several incompatible versions. It plays only moderately nicely with the unix commandline, which is the best ad-hoc processing tool for filesystem data. Often the only correct way to handle CSV is to use a parsing library, but it's inconvenient to fire up a python/perl/ruby session just to do simple sanity checks and casually inspect data.

To balance these needs, so far I've found that TSV-with-headers is the most convenient canonical format for table data in the filesystem/commandline environment, or at least the lingua franca in shell pipelines. These utilities are just a little bit of glue to make TSV play nicely with CSV, Excel, MySQL, and Unix tools. Interfaces in and out of other table-centric environments could easily be added.

On the philosophy of having NO escaping or special data value conventions: If you want those things in your data, make up your own convention (like backslash escaping, URL escaping, or whatever) and have your application be aware of it. Our philosophy is, a data processing utility should ignore that stuff in order to have safe and predictable behavior. I've seen too many bugs because some intermediate program imposed a special meaning on "NA" or "\N" or "NULL", etc., when really a program further downstream should have had sole responsibility for this interpretation.

In Conclusion

Hope you enjoy tsvutils!

More Repositories

1

ark-tweet-nlp

CMU ARK Twitter Part-of-Speech Tagger
Java
575
star
2

tweetmotif

Topical search for Twitter. See twokenize.py, emoticons.py for tokenization.
Python
162
star
3

stanford_corenlp_pywrapper

Java
151
star
4

awkspeed

Speed testing for a data munging task
C++
44
star
5

arkref

http://www.ark.cs.cmu.edu/ARKref/
Java
32
star
6

scalacheat

cheat sheet for scala syntax
Shell
32
star
7

parseviz

Visualize constituent and dependency parses as PDF or image formats, through GraphViz.
Python
31
star
8

OConnor_IREvents_ACL2013

Replication software, data, and supplementary materials for the paper: O'Connor, Stewart and Smith, ACL-2013, "Learning to Extract International Relations from Political Context"
C++
26
star
9

mte

MiTextExplorer - interactive browser of text and document covariates.
Java
24
star
10

myutil

Java
23
star
11

dlanalysis

a bunch of R code for various statistical analyses
R
21
star
12

conplot

Console ascii art plotter - quick-and-dirty data visualization, e.g. for log statistics
Python
18
star
13

running_stat

Running variance / standard deviation calculation (C++ and Python)
Python
14
star
14

cmdutils

Some command-line utilities, mostly for data manipulation and inspection.
Python
13
star
15

muc4_proc

preprocessing of the MUC4 dataset
Python
11
star
16

bow

A patched version of bow & rainbow 20020213 that compiles with modern gcc 4.0.1, OSX 10.5
C
11
star
17

twitter_geo_preproc

A preprocessing script to get geo-coded tweets from the Streaming API
Python
9
star
18

gfl_syntax

Graph Fragment Language for Easy Syntactic Annotation
Python
8
star
19

nlp_jobs

research code from rion and brendan when writing snow, o'connor, jurafsky, ng EMNLP-2008 "cheap and fast, but is it good?"
Ruby
6
star
20

stanfordnlp-util

java utilities for stanford nlp
Java
5
star
21

gigaword_conversion

Python
3
star
22

glmnet_starter

Starter code for the glmnet package (elastic net regressions)
R
2
star
23

slmunge

Scripts to munge certain machine learning sparse data formats, including SVMLight/LibSVM
Python
2
star
24

twitter_geo_viz

REALLY HALFBAKED DO NOT USE YOU MAY CRASH OUR SERVER
JavaScript
2
star
25

namefreedom

data and analysis of country names versus democratic freedoms
2
star
26

viewdb

HTML report of an SQL DB's schema and data
Python
1
star
27

super_tuesday_2020

analysis of Super Tuesday exit poll data
HTML
1
star
28

flex-for-morpha

Patched version of GNU Flex 2.5.35 to compile "morpha"
C
1
star
29

beta_explorer

1
star
30

flightstats

Python
1
star
31

randomsearch

web app to randomly choose which search engine to use per query
Python
1
star