• Stars
    star
    882
  • Rank 51,756 (Top 2 %)
  • Language
    Go
  • License
    MIT License
  • Created over 8 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A cross-platform, efficient and practical CSV/TSV toolkit in Golang

csvtk - a cross-platform, efficient and practical CSV/TSV toolkit

Introduction

Similar to FASTA/Q format in field of Bioinformatics, CSV/TSV formats are basic and ubiquitous file formats in both Bioinformatics and data science.

People usually use spreadsheet software like MS Excel to process table data. However this is all by clicking and typing, which is not automated and is time-consuming to repeat, especially when you want to apply similar operations with different datasets or purposes.

You can also accomplish some CSV/TSV manipulations using shell commands, but more code is needed to handle the header line. Shell commands do not support selecting columns with column names either.

csvtk is convenient for rapid data investigation and also easy to integrate into analysis pipelines. It could save you lots of time in (not) writing Python/R scripts.

Table of Contents

Features

  • Cross-platform (Linux/Windows/Mac OS X/OpenBSD/FreeBSD)
  • Light weight and out-of-the-box, no dependencies, no compilation, no configuration
  • Fast, multiple-CPUs supported (some commands)
  • Practical functions provided by N subcommands
  • Support STDIN and gziped input/output file, easy being used in pipe
  • Most of the subcommands support unselecting fields and fuzzy fields, e.g. -f "-id,-name" for all fields except "id" and "name", -F -f "a.*" for all fields with prefix "a.".
  • Support some common plots (see usage)
  • Seamlessly support for data with meta line (e.g., sep=,) of separator declaration used by MS Excel

Subcommands

53 subcommands in total.

Information

  • headers: prints headers
  • dim: dimensions of CSV file
  • nrow: print number of records
  • ncol: print number of columns
  • summary: summary statistics of selected numeric or text fields (groupby group fields)
  • watch: online monitoring and histogram of selected field
  • corr: calculate Pearson correlation between numeric columns

Format conversion

  • pretty: converts CSV to a readable aligned table
  • csv2tab: converts CSV to tabular format
  • tab2csv: converts tabular format to CSV
  • space2tab: converts space delimited format to TSV
  • csv2md: converts CSV to markdown format
  • csv2rst: converts CSV to reStructuredText format
  • csv2json: converts CSV to JSON format
  • csv2xlsx: converts CSV/TSV files to XLSX file
  • xlsx2csv: converts XLSX to CSV format

Set operations

  • head: prints first N records
  • concat: concatenates CSV/TSV files by rows
  • sample: sampling by proportion
  • cut: select and arrange fields
  • grep: greps data by selected fields with patterns/regular expressions
  • uniq: unique data without sorting
  • freq: frequencies of selected fields
  • inter: intersection of multiple files
  • filter: filters rows by values of selected fields with arithmetic expression
  • filter2: filters rows by awk-like arithmetic/string expressions
  • join: join files by selected fields (inner, left and outer join)
  • split splits CSV/TSV into multiple files according to column values
  • splitxlsx: splits XLSX sheet into multiple sheets according to column values
  • comb: compute combinations of items at every row

Edit

  • fix: fix CSV/TSV with different numbers of columns in rows
  • fix-quotes: fix malformed CSV/TSV caused by double-quotes
  • del-quotes: remove extra double-quotes added by fix-quotes
  • add-header: add column names
  • del-header: delete column names
  • rename: renames column names with new names
  • rename2: renames column names by regular expression
  • replace: replaces data of selected fields by regular expression
  • round: round float to n decimal places
  • mutate: creates new columns from selected fields by regular expression
  • mutate2: creates a new column from selected fields by awk-like arithmetic/string expressions
  • fmtdate: format date of selected fields

Transform

  • transpose: transposes CSV data
  • sep: separate column into multiple columns
  • gather: gather columns into key-value pairs, like tidyr::gather/pivot_longer
  • spread: spread a key-value pair across multiple columns, like tidyr::spread/pivot_wider
  • unfold: unfold multiple values in cells of a field
  • fold: fold multiple values of a field into cells of groups

Ordering

  • sort: sorts by selected fields

Ploting

Misc

  • cat stream file and report progress
  • version print version information and check for update
  • genautocomplete generate shell autocompletion script (bash|zsh|fish|powershell)

Installation

Download Page

csvtk is implemented in Go programming language, executable binary files for most popular operating systems are freely available in release page.

Method 1: Download binaries (latest stable/dev version)

Just download compressed executable file of your operating system, and decompress it with tar -zxvf *.tar.gz command or other tools. And then:

  1. For Linux-like systems

    1. If you have root privilege simply copy it to /usr/local/bin:

       sudo cp csvtk /usr/local/bin/
      
    2. Or copy to anywhere in the environment variable PATH:

       mkdir -p $HOME/bin/; cp csvtk $HOME/bin/
      
  2. For windows, just copy csvtk.exe to C:\WINDOWS\system32.

Method 2: Install via conda (latest stable version) Anaconda Cloud downloads

conda install -c bioconda csvtk

Method 3: Install via homebrew

brew install csvtk

Method 4: For Go developer (latest stable/dev version)

go get -u github.com/shenwei356/csvtk/csvtk

Method 5: For ArchLinux AUR users (may be not the latest)

yaourt -S csvtk

Command-line completion

Bash:

# generate completion shell
csvtk genautocomplete --shell bash

# configure if never did.
# install bash-completion if the "complete" command is not found.
echo "for bcfile in ~/.bash_completion.d/* ; do source \$bcfile; done" >> ~/.bash_completion
echo "source ~/.bash_completion" >> ~/.bashrc

Zsh:

# generate completion shell
csvtk genautocomplete --shell zsh --file ~/.zfunc/_csvtk

# configure if never did
echo 'fpath=( ~/.zfunc "${fpath[@]}" )' >> ~/.zshrc
echo "autoload -U compinit; compinit" >> ~/.zshrc

fish:

csvtk genautocomplete --shell fish --file ~/.config/fish/completions/csvtk.fish

Compared to csvkit

csvkit, attention: this table wasn't updated for many years.

Features csvtk csvkit Note
Read Gzip Yes Yes read gzip files
Fields ranges Yes Yes e.g. -f 1-4,6
Unselect fileds Yes -- e.g. -1 for excluding first column
Fuzzy fields Yes -- e.g. ab* for columns with name prefix "ab"
Reorder fields Yes Yes it means -f 1,2 is different from -f 2,1
Rename columns Yes -- rename with new name(s) or from existed names
Sort by multiple keys Yes Yes bash sort like operations
Sort by number Yes -- e.g. -k 1:n
Multiple sort Yes -- e.g. -k 2:r -k 1:nr
Pretty output Yes Yes convert CSV to readable aligned table
Unique data Yes -- unique data of selected fields
frequency Yes -- frequencies of selected fields
Sampling Yes -- sampling by proportion
Mutate fields Yes -- create new columns from selected fields
Replace Yes -- replace data of selected fields

Similar tools:

  • csvkit - A suite of utilities for converting to and working with CSV, the king of tabular file formats. http://csvkit.rtfd.org/
  • xsv - A fast CSV toolkit written in Rust.
  • miller - Miller is like sed, awk, cut, join, and sort for name-indexed data such as CSV and tabular JSON http://johnkerl.org/miller
  • tsv-utils - Command line utilities for tab-separated value files written in the D programming language.

Examples

More examples and tutorial.

Attention

  1. By default, csvtk assumes input files have header row, if not, switch flag -H on.

  2. By default, csvtk handles CSV files, use flag -t for tab-delimited files.

  3. Column names should be unique.

  4. By default, lines starting with # will be ignored, if the header row starts with #, please assign flag -C another rare symbol, e.g. $.

  5. Do not mix use field (column) numbers and names to specify columns to operate.

  6. The CSV parser requires all the lines have same numbers of fields/columns. Even lines with spaces will cause error. Use -I/--ignore-illegal-row to skip these lines if neccessary. You can also use "csvtk fix" to fix files with different numbers of columns in rows.

  7. If double-quotes exist in fields not enclosed with double-quotes, e.g.,

     x,a "b" c,1
    

    It would report error:

     bare `"` in non-quoted-field.
    

    Please switch on the flag -l or use csvtk fix-quotes to fix it.

  8. If somes fields have only a double-quote eighter in the beginning or in the end, e.g.,

     x,d "e","a" b c,1
    

    It would report error:

     extraneous or missing " in quoted-field
    

    Please use csvtk fix-quotes to fix it, and use csvtk del-quotes to reset to the original format as needed.

Examples

  1. Pretty result

     $ csvtk pretty names.csv
     id   first_name   last_name   username
     --   ----------   ---------   --------
     11   Rob          Pike        rob
     2    Ken          Thompson    ken
     4    Robert       Griesemer   gri
     1    Robert       Thompson    abc
     NA   Robert       Abel        123
    
     $ csvtk pretty names.csv -S 3line
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
      id   first_name   last_name   username
     ----------------------------------------
      11   Rob          Pike        rob
      2    Ken          Thompson    ken
      4    Robert       Griesemer   gri
      1    Robert       Thompson    abc
      NA   Robert       Abel        123
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
     $ csvtk pretty names.csv -S bold -w 5 -m 1-
     ┏━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓
     ┃  id   ┃ first_name ┃ last_name ┃ username ┃
     ┣━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━┫
     ┃  11   ┃    Rob     ┃   Pike    ┃   rob    ┃
     ┣━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━┫
     ┃   2   ┃    Ken     ┃ Thompson  ┃   ken    ┃
     ┣━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━┫
     ┃   4   ┃   Robert   ┃ Griesemer ┃   gri    ┃
     ┣━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━┫
     ┃   1   ┃   Robert   ┃ Thompson  ┃   abc    ┃
     ┣━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━┫
     ┃  NA   ┃   Robert   ┃   Abel    ┃   123    ┃
     ┗━━━━━━━┻━━━━━━━━━━━━┻━━━━━━━━━━━┻━━━━━━━━━━┛
    
  2. Summary of selected numeric fields, supporting "group-by"

     $ cat testdata/digitals2.csv \
         | csvtk summary -i -f f4:sum,f5:sum -g f1,f2 \
         | csvtk pretty
     f1    f2     f4:sum   f5:sum
     bar   xyz    7.00     106.00
     bar   xyz2   4.00     4.00
     foo   bar    6.00     3.00
     foo   bar2   4.50     5.00
    
  3. Select fields/columns (cut)

    • By index: csvtk cut -f 1,2
    • By names: csvtk cut -f first_name,username
    • Unselect: csvtk cut -f -1,-2 or csvtk cut -f -first_name
    • Fuzzy fields: csvtk cut -F -f "*_name,username"
    • Field ranges: csvtk cut -f 2-4 for column 2,3,4 or csvtk cut -f -3--1 for discarding column 1,2,3
    • All fields: csvtk cut -f 1- or csvtk cut -F -f "*"
  4. Search by selected fields (grep) (matched parts will be highlighted as red)

    • By exactly matching: csvtk grep -f first_name -p Robert -p Rob
    • By regular expression: csvtk grep -f first_name -r -p Rob
    • By pattern list: csvtk grep -f first_name -P name_list.txt
    • Remore rows containing missing data (NA): csvtk grep -F -f "*" -r -p "^$" -v
  5. Rename column names (rename and rename2)

    • Setting new names: csvtk rename -f A,B -n a,b or csvtk rename -f 1-3 -n a,b,c
    • Replacing with original names by regular express: csvtk rename2 -f 1- -p "(.*)" -r 'prefix_$1' for adding prefix to all column names.
  6. Edit data with regular expression (replace)

    • Remove Chinese charactors: csvtk replace -F -f "*_name" -p "\p{Han}+" -r ""
  7. Create new column from selected fields by regular expression (mutate)

    • In default, copy a column: csvtk mutate -f id
    • Extract prefix of data as group name (get "A" from "A.1" as group name): csvtk mutate -f sample -n group -p "^(.+?)\." --after sample
  8. Sort by multiple keys (sort)

    • By single column : csvtk sort -k 1 or csvtk sort -k last_name
    • By multiple columns: csvtk sort -k 1,2 or csvtk sort -k 1 -k 2 or csvtk sort -k last_name,age
    • Sort by number: csvtk sort -k 1:n or csvtk sort -k 1:nr for reverse number
    • Complex sort: csvtk sort -k region -k age:n -k id:nr
    • In natural order: csvtk sort -k chr:N
  9. Join multiple files by keys (join)

    • All files have same key column: csvtk join -f id file1.csv file2.csv
    • Files have different key columns: csvtk join -f "username;username;name" names.csv phone.csv adress.csv -k
  10. Filter by numbers (filter)

    • Single field: csvtk filter -f "id>0"
    • Multiple fields: csvtk filter -f "1-3>0"
    • Using --any to print record if any of the field satisfy the condition: csvtk filter -f "1-3>0" --any
    • fuzzy fields: csvtk filter -F -f "A*!=0"
  11. Filter rows by awk-like arithmetic/string expressions (filter2)

    • Using field index: csvtk filter2 -f '$3>0'
    • Using column names: csvtk filter2 -f '$id > 0'
    • Both arithmetic and string expressions: csvtk filter2 -f '$id > 3 || $username=="ken"'
    • More complicated: csvtk filter2 -H -t -f '$1 > 2 && $2 % 2 == 0'
  12. Ploting

    • plot histogram with data of the second column:

        csvtk -t plot hist testdata/grouped_data.tsv.gz -f 2 | display
      

      histogram.png

    • plot boxplot with data of the "GC Content" (third) column, group information is the "Group" column.

        csvtk -t plot box testdata/grouped_data.tsv.gz -g "Group" \
            -f "GC Content" --width 3 | display
      

      boxplot.png

    • plot horiz boxplot with data of the "Length" (second) column, group information is the "Group" column.

       csvtk -t plot box testdata/grouped_data.tsv.gz -g "Group" -f "Length"  \
           --height 3 --width 5 --horiz --title "Horiz box plot" | display
      

    boxplot2.png

    • plot line plot with X-Y data

        csvtk -t plot line testdata/xy.tsv -x X -y Y -g Group | display
      

      lineplot.png

    • plot scatter plot with X-Y data

        csvtk -t plot line testdata/xy.tsv -x X -y Y -g Group --scatter | display
      

      scatter.png

Acknowledgements

We are grateful to Zhiluo Deng and Li Peng for suggesting features and reporting bugs.

Thanks Albert Vilella for features suggestion, which makes csvtk feature-rich。

Contact

Create an issue to report bugs, propose new functions or ask for help.

Or leave a comment.

License

MIT License

Starchart

Stargazers over time

More Repositories

1

seqkit

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
Go
1,066
star
2

awesome

Awesome resources on Bioinformatics, data science, machine learning, programming language (Python, Golang, R, Perl) and miscellaneous stuff.
568
star
3

taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies
Go
256
star
4

brename

A practical cross-platform command-line tool for safely batch renaming files/directories via regular expression
Go
224
star
5

kmcp

Accurate metagenomic profiling && Fast large-scale sequence/genome searching
Go
139
star
6

LexicMap

LexicMap: efficient sequence alignment against millions of prokaryotic genomes
Go
96
star
7

bio_scripts

Practical, reusable scripts for bioinformatics
Perl
90
star
8

bio

A lightweight and high-performance bioinformatics package in Golang
Go
83
star
9

unikmer

Toolkit for k-mer with taxonomic information
Go
53
star
10

ClipboardTextJoiner

Monitoring system clipboard change and joining multi-line text. It's very useful when copying multi-line text from PDF files.
Perl
47
star
11

go4bio

Golang for Bioinformatics
31
star
12

gtdb-taxdump

GTDB taxonomy taxdump files with trackable TaxIds
R
28
star
13

easy_qsub

Easily submitting multiple PBS jobs or running local jobs in parallel. Multiple input files supported.
Python
27
star
14

countminsketch

An implementation of Count-Min Sketch in Golang
Go
24
star
15

taxid-changelog

NCBI taxonomic identifier (taxid) changelog, including taxids deletion, new adding, merge, reuse, and rank/name changes.
R
19
star
16

bwt

Burrows-Wheeler Transform and FM-index in golang
Go
18
star
17

cnote

A platform independent command line note app
Go
17
star
18

gtaxon

gTaxon - a fast cross-platform NCBI taxonomy data querying (gi2taxid, taxid2taxon, name2taxid, LCA) tool, with cmd client and REST API server for both local and remote server.
Go
15
star
19

easy_sbatch

easy_sbatch - Batch submitting Slurm jobs with script templates
Python
14
star
20

ncbi_acc2gtdb_acc

Mapping NCBI Genbank accession to GTDB accession
14
star
21

strobemers

A Go implementation of the strobemers (https://github.com/ksahlin/strobemers)
Go
14
star
22

ictv-taxdump

NCBI-style taxdump files for International Committee on Taxonomy of Viruses (ICTV)
12
star
23

crun

Run workflow mixed with concurrent and sequential jobs. Please use https://github.com/shenwei356/rush
Go
10
star
24

breader

breader (Buffered File Reader), asynchronous parsing and pre-processing while reading file. Safe cancellation is also supported.
Go
9
star
25

pinyin

收集汉字,结合汉语拼音来取名
Python
8
star
26

lexichash

LexicHash in Golang
Go
8
star
27

pand

Bitwise AND on two byte-slices using SIMD instructions
Go
7
star
28

unik

A k-mer serialization package for Golang
Go
7
star
29

kmers

bit-packed k-mers methods for Golang
Go
7
star
30

install-windows

Windows系统安装经验
6
star
31

perfect-bioinformatic-tools

What should perfect bioinformatic tools be like?
6
star
32

breseq-rm-bg

Removing control/background mutations from breseq output index.html
Go
5
star
33

bbuffer

An alternative of standard library bytes.Buffer
Go
5
star
34

dirsize

Summarize size of directories and files in directories
Go
5
star
35

datakit

CSV/TSV file manipulation and more. Please use my another tool: csvtk, https://github.com/shenwei356/csvtk
Python
4
star
36

util

Golang utility packages
Go
4
star
37

uint64-hash-bench

Benchmark of three uint64 hash functions
Go
3
star
38

simhash-eval

Go
2
star
39

sun2021-cami-profiles

Ground truth metagenomic profiles in CAMI format for the 25 metagenomic reads in Sun et al.
2
star
40

go-hashing-kmer-bench

Benchmark of hashing k-mers in Golang
Go
2
star
41

shenwei356

2
star
42

RNA-HairpinFigure

Draw hairpin-like text figure from RNA sequence and its secondary structure in dot-bracket notation.
Perl
2
star
43

BioUtil

Bioinformatics Perl modules
Perl
2
star
44

rust-practice

Some tools in Rust for learning
2
star
45

todo

A very simple online todo list application
Go
1
star
46

blast-nf

A nextflow-based BLAST command-line helper tool
Nextflow
1
star
47

easy_run

Run command with default options in configuration file
Python
1
star
48

swr

Wei Shen' R utilities
R
1
star
49

shenwei356.github.io

HTML
1
star
50

uintset

Fast uint64 Set in golang
Go
1
star
51

process_queue

Process queue for high CPU/RAM/time usage processes
Perl
1
star
52

roux2016-mock-virome-cami-profile

Ground truth metagenomic profiles in CAMI format for the mock virome communities in Roux et al.
1
star