• Stars
    star
    66
  • Rank 468,167 (Top 10 %)
  • Language
    Go
  • License
    BSD 3-Clause "New...
  • Created about 8 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

csvplus extends the standard Go encoding/csv package with fluent interface, lazy stream operations, indices and joins.

csvplus

GoDoc Go Report Card License: BSD 3-Clause

Package csvplus extends the standard Go encoding/csv package with fluent interface, lazy stream processing operations, indices and joins.

The library is primarily designed for ETL-like processes. It is mostly useful in places where the more advanced searching/joining capabilities of a fully-featured SQL database are not required, but the same time the data transformations needed still include SQL-like operations.

License: BSD

Examples

Simple sequential processing:

people := csvplus.FromFile("people.csv").SelectColumns("name", "surname", "id")

err := csvplus.Take(people).
	Filter(csvplus.Like(csvplus.Row{"name": "Amelia"})).
	Map(func(row csvplus.Row) csvplus.Row { row["name"] = "Julia"; return row }).
	ToCsvFile("out.csv", "name", "surname")

if err != nil {
	return err
}

More involved example:

customers := csvplus.FromFile("people.csv").SelectColumns("id", "name", "surname")
custIndex, err := csvplus.Take(customers).UniqueIndexOn("id")

if err != nil {
	return err
}

products := csvplus.FromFile("stock.csv").SelectColumns("prod_id", "product", "price")
prodIndex, err := csvplus.Take(products).UniqueIndexOn("prod_id")

if err != nil {
	return err
}

orders := csvplus.FromFile("orders.csv").SelectColumns("cust_id", "prod_id", "qty", "ts")
iter := csvplus.Take(orders).Join(custIndex, "cust_id").Join(prodIndex)

return iter(func(row csvplus.Row) error {
	// prints lines like:
	//	John Doe bought 38 oranges for £0.03 each on 2016-09-14T08:48:22+01:00
	_, e := fmt.Printf("%s %s bought %s %ss for £%s each on %s\n",
		row["name"], row["surname"], row["qty"], row["product"], row["price"], row["ts"])
	return e
})

Design principles

The package functionality is based on the operations on the following entities:

  • type Row
  • type DataSource
  • type Index

Type Row

Row represents one row from a DataSource. It is a map from column names to the string values under those columns on the current row. The package expects a unique name assigned to every column at source. Compared to using integer indices this provides more convenience when complex transformations get applied to each row during processing.

type DataSource

Type DataSource represents any source of zero or more rows, like .csv file. This is a function that when invoked feeds the given callback with the data from its source, one Row at a time. The type also has a number of operations defined on it that provide for easy composition of the operations on the DataSource, forming so called fluent interface. All these operations are 'lazy', i.e. they are not performed immediately, but instead each of them returns a new DataSource.

There is also a number of convenience operations that actually invoke the DataSource function to produce a specific type of output:

  • IndexOn to build an index on the specified column(s);
  • UniqueIndexOn to build a unique index on the specified column(s);
  • ToCsv to serialise the DataSource to the given io.Writer in .csv format;
  • ToCsvFile to store the DataSource in the specified file in .csv format;
  • ToJSON to serialise the DataSource to the given io.Writer in JSON format;
  • ToJSONFile to store the DataSource in the specified file in JSON format;
  • ToRows to convert the DataSource to a slice of Rows.

Type Index

Index is a sorted collection of rows. The sorting is performed on the columns specified when the index is created. Iteration over an index yields a sorted sequence of rows. An Index can be joined with a DataSource. The type has operations for finding rows and creating sub-indices in O(log(n)) time. Another useful operation is resolving duplicates. Building an index takes O(n*log(n)) time. It should be noted that the Index building operation requires the entire dataset to be read into the memory, so certain care should be taken when indexing huge datasets. An index can also be stored to, or loaded from a disk file.

For more details see the documentation.

Project status

The project is in a usable state usually called "beta". Tested on Linux Mint 18.3 using Go version 1.10.2.

More Repositories

1

str

str: yet another string library for C language.
C
288
star
2

strit

Package strit introduces a new type of string iterator, along with a number of iterator constructors, wrappers and combinators.
Go
84
star
3

FullFIX

A library for parsing FIX (Financial Information eXchange) protocol messages.
C
66
star
4

go-ocr

A tool for extracting text from scanned documents (via OCR), with user-defined post-processing.
Go
34
star
5

FFP

Fast FIX (Financial Information Exchange) protocol parser [FFP]
C
33
star
6

OCR

A collection of tools for OCR (optical character recognition).
C
30
star
7

ufw-stats

ufw-stats: Show ufw actions since boot, with ip address information from RIPE database.
Python
15
star
8

lsch

lsch: list all added, deleted, and modified files in the current directory and its subdirectories.
Lua
14
star
9

trw

Functional composition of text processing operations.
Go
9
star
10

gen-cache

LRU cache code generator for Go
Go
8
star
11

pump

A minimalist framework for assembling data processing pipelines.
Go
6
star
12

stout

Package stout (STream OUTput): writing byte streams in a type-safe and extensible way.
Go
6
star
13

mvr

Minimal Viable Runtime (MVR)
Go
5
star
14

tojson

Convert text to JSON via regular expression.
Python
5
star
15

smap

smap: a hash table for C language.
C
4
star
16

xlib

Ever growing collection of useful Go functions.
Go
3
star
17

rss

A collection of scripts for 'newsboat' RSS reader.
Shell
2
star
18

liberr

A set of wrapper functions to reduce the generated code size around 'throw' statement in C++
C++
2
star
19

rstat

Library "rstat" provides basic functionality for periodical health check of IoT devices running Linux.
Go
1
star
20

pstr

A sketch of a C++ string class with no overhead when handling string literals.
C++
1
star
21

textadept-setup

Scripts for Textadept editor.
Lua
1
star
22

maxim2266.github.io

My blog
1
star
23

cache

Another generic cache for Go.
Go
1
star