• This repository has been archived on 19/Apr/2020
  • Stars
    star
    233
  • Rank 165,934 (Top 4 %)
  • Language
    C++
  • License
    MIT License
  • Created about 5 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[DEPRECATED] See https://github.com/p-ranav/csv2

[DEPRECATED APRIL 2020]

This library is now deprecated. Checkout a second implementation of this library here: https://github.com/p-ranav/csv2.

Highlights

Table of Contents

Reading CSV files

Simply include reader.hpp and you're good to go.

#include <csv/reader.hpp>

To start parsing CSV files, create a csv::Reader object and call .read(filename).

csv::Reader foo;
foo.read("test.csv");

This .read method is non-blocking. The reader spawns multiple threads to tokenize the file stream and build a "list of dictionaries". While the reader is doing it's thing, you can start post-processing the rows it has parsed so far using this iterator pattern:

while(foo.busy()) {
  if (foo.ready()) {
    auto row = foo.next_row();  // Each row is a csv::unordered_flat_map (github.com/martinus/robin-hood-hashing)
    auto foo = row["foo"]       // You can use it just like an std::unordered_map
    auto bar = row["bar"];
    // do something
  }
}

If instead you'd like to wait for all the rows to get processed, you can call .rows() which is a convenience method that executes the above while loop

auto rows = foo.rows();           // blocks until the CSV is fully processed
for (auto& row : rows) {          // Example: [{"foo": "1", "bar": "2"}, {"foo": "3", "bar": "4"}, ...] 
  auto foo = row["foo"];
  // do something
}

Dialects

This csv library comes with three standard dialects:

Name Description
excel The excel dialect defines the usual properties of an Excel-generated CSV file
excel_tab The excel_tab dialect defines the usual properties of an Excel-generated TAB-delimited file
unix The unix dialect defines the usual properties of a CSV file generated on UNIX systems, i.e. using '\n' as line terminator and quoting all fields

Configuring Custom Dialects

Custom dialects can be constructed with .configure_dialect(...)

csv::Reader csv;
csv.configure_dialect("my fancy dialect")
  .delimiter("")
  .quote_character('"')
  .double_quote(true)
  .skip_initial_space(false)
  .trim_characters(' ', '\t')
  .ignore_columns("foo", "bar")
  .header(true)
  .skip_empty_rows(true);

csv.read("foo.csv");
for (auto& row : csv.rows()) {
  // do something
}
Property Data Type Description
delimiter std::string specifies the character sequence which should separate fields (aka columns). Default = ","
quote_character char specifies a one-character string to use as the quoting character. Default = '"'
double_quote bool controls the handling of quotes inside fields. If true, two consecutive quotes should be interpreted as one. Default = true
skip_initial_space bool specifies how to interpret whitespace which immediately follows a delimiter; if false, it means that whitespace immediately after a delimiter should be treated as part of the following field. Default = false
trim_characters std::vector<char> specifies the list of characters to trim from every value in the CSV. Default = {} - nothing trimmed
ignore_columns std::vector<std::string> specifies the list of columns to ignore. These columns will be stripped during the parsing process. Default = {} - no column ignored
header bool indicates whether the file includes a header row. If true the first row in the file is a header row, not data. Default = true
column_names std::vector<std::string> specifies the list of column names. This is useful when the first row of the CSV isn't a header Default = {}
skip_empty_rows bool specifies how empty rows should be interpreted. If this is set to true, empty rows are skipped. Default = false

The line terminator is '\n' by default. I use std::getline and handle stripping out '\r' from line endings. So, for now, this is not configurable in custom dialects.

Multi-character Delimiters

Consider this strange, messed up log file:

[Thread ID] :: [Log Level] :: [Log Message] :: {Timestamp}
04 :: INFO :: Hello World ::             1555164718
02        :: DEBUG :: Warning! Foo has happened                :: 1555463132

To parse this file, simply configure a new dialect that splits on "::" and trims whitespace, braces, and bracket characters.

csv::Reader csv;
csv.configure_dialect("my strange dialect")
  .delimiter("::")
  .trim_characters(' ', '[', ']', '{', '}');   

csv.read("test.csv");
for (auto& row : csv.rows()) {
  auto thread_id = row["Thread ID"];    // "04"
  auto log_level = row["Log Level"];    // "INFO"
  auto message = row["Log Message"];    // "Hello World"
  // do something
}

Ignoring Columns

Consider the following CSV. Let's say you don't care about the columns age and gender. Here, you can use .ignore_columns and provide a list of columns to ignore.

name, age, gender, email, department
Mark Johnson, 50, M, [email protected], BA
John Stevenson, 35, M, [email protected], IT
Jane Barkley, 25, F, [email protected], MGT

You can configure the dialect to ignore these columns like so:

csv::Reader csv;
csv.configure_dialect("ignore meh and fez")
  .delimiter(", ")
  .ignore_columns("age", "gender");

csv.read("test.csv");
auto rows = csv.rows();
// Your rows are:
// [{"name": "Mark Johnson", "email": "[email protected]", "department": "BA"},
//  {"name": "John Stevenson", "email": "[email protected]", "department": "IT"},
//  {"name": "Jane Barkley", "email": "[email protected]", "department": "MGT"}]

No Header?

Sometimes you have CSV files with no header row:

9 52 1
52 91 0
91 135 0
135 174 0
174 218 0
218 260 0
260 301 0
301 341 0
341 383 0
...

If you want to prevent the reader from parsing the first row as a header, simply:

  • Set .header to false
  • Provide a list of column names with .column_names(...)
csv.configure_dialect("no headers")
  .header(false)
  .column_names("foo", "bar", "baz");

The CSV rows will now look like this:

[{"foo": "9", "bar": "52", "baz": "1"}, {"foo": "52", "bar": "91", "baz": "0"}, ...]

If .column_names is not called, then the reader simply generates dictionary keys like so:

[{"0": "9", "1": "52", "2": "1"}, {"0": "52", "1": "91", "2": "0"}, ...]

Dealing with Empty Rows

Sometimes you have to deal with a CSV file that has empty lines; either in the middle or at the end of the file:

a,b,c
1,2,3

4,5,6

10,11,12



Here's how this get's parsed by default:

csv::Reader csv;
csv.read("inputs/empty_lines.csv");
auto rows = csv.rows();
// [{"a": 1, "b": 2, "c": 3}, {"a": "", "b": "", "c": ""}, {"a": "4", "b": "5", "c": "6"}, {"a": "", ...}]

If you don't care for these empty rows, simply call .skip_empty_rows(true)

csv::Reader csv;
csv.configure_dialect()
  .skip_empty_rows(true);
csv.read("inputs/empty_lines.csv");
auto rows = csv.rows();
// [{"a": 1, "b": 2, "c": 3}, {"a": "4", "b": "5", "c": "6"}, {"a": "10", "b": "11", "c": "12"}]

Reading first N rows

If you know exactly how many rows to parse, you can help out the reader by using the .read(filename, num_rows) overloaded method. This saves the reader from trying to figure out the number of lines in the CSV file. You can use this method to parse the first N rows of the file instead of parsing all of it.

csv::Reader foo;
foo.read("bar.csv", 1000);
auto rows = foo.rows();

Note: Do not provide num_rows greater than the actual number of rows in the file - The reader will loop forever till the end of time.

Performance Benchmark

// benchmark.cpp
void parse(const std::string& filename) {
  csv::Reader foo;
  foo.read(filename);
  std::vector<csv::unordered_flat_map<std::string_view, std::string>> rows;
  while (foo.busy()) {
    if (foo.ready()) {
      auto row = foo.next_row();
      rows.push_back(row);
    }
  }
}
$ g++ -pthread -std=c++17 -O3 -Iinclude/ -o test benchmark.cpp
$ time ./test

Each test is run 30 times on an Intel(R) Core(TM) i7-6650-U @ 2.20 GHz CPU.

Here are the average-case execution times:

Dataset File Size Rows Cols Time
Demographic Statistics By Zip Code 27 KB 237 46 0.026s
Simple 3-column CSV 14.1 MB 761,817 3 0.523s
Majestic Million 77.7 MB 1,000,000 12 1.972s
Crimes 2001 - Present 1.50 GB 6,846,406 22 32.411s

Writing CSV files

Simply include writer.hpp and you're good to go.

#include <csv/writer.hpp>

To start writing CSV files, create a csv::Writer object and provide a filename:

csv::Writer foo("test.csv");

Constructing a writer spawns a worker thread that is ready to start writing rows. Using .configure_dialect, configure the dialect to be used by the writer. This is where you can specify the column names:

foo.configure_dialect()
  .delimiter(", ")
  .column_names("a", "b", "c");

Now it's time to write rows. You can do this in multiple ways:

foo.write_row("1", "2", "3");                                     // parameter packing
foo.write_row({"4", "5", "6"});                                   // std::vector
foo.write_row(std::map<std::string, std::string>{                 // std::map
  {"a", "7"}, {"b", "8"}, {"c", "9"} });
foo.write_row(std::unordered_map<std::string, std::string>{       // std::unordered_map
  {"a", "7"}, {"b", "8"}, {"c", "9"} });
foo.write_row(csv::unordered_flat_map<std::string, std::string>{  // csv::unordered_flat_map
  {"a", "7"}, {"b", "8"}, {"c", "9"} });

You can also omit one or more values dynamically when using maps:

foo.write_row(std::map<std::string, std::string>{                 // std::map
  {"a", "7"}, {"c", "9"} });                                      // omitting "b"
foo.write_row(std::unordered_map<std::string, std::string>{       // std::unordered_map
  {"b", "8"}, {"c", "9"} });                                      // omitting "a"
foo.write_row(csv::unordered_flat_map<std::string, std::string>{  // csv::unordered_flat_map
  {"a", "7"}, {"b", "8"} });                                      // omitting "c"

Finally, once you're done writing rows, call .close() to stop the worker thread and close the file stream.

foo.close();

Here's an example writing 3 million lines of CSV to a file:

csv::Writer foo("test.csv");
foo.configure_dialect()
  .delimiter(", ")
  .column_names("a", "b", "c");

for (long i = 0; i < 3000000; i++) {
  auto x = std::to_string(i % 100);
  auto y = std::to_string((i + 1) % 100);
  auto z = std::to_string((i + 2) % 100);
  foo.write_row(x, y, z);
}
foo.close();

The above code takes about 1.8 seconds to execute on my Surface Pro 4.

Steps For Contributors

Contributions are welcome, have a look at the CONTRIBUTING.md document for more information.

git clone https://github.com/p-ranav/csv.git
cd csv
git submodule update --init --recursive
mkdir build
cd build
cmake .. -DCSV_BUILD_TESTS=ON
cmake --build . --config Debug
ctest --output-on-failure -C Debug

Steps For Users

git clone https://github.com/p-ranav/csv.git
cd csv
mkdir build
cd build
cmake ../.
sudo make install

Continuous Integration Reports

License

The project is available under the MIT license.

More Repositories

1

awesome-hpp

A curated list of awesome header-only C++ libraries
3,057
star
2

indicators

Activity Indicators for Modern C++
C++
2,736
star
3

argparse

Argument Parser for Modern C++
C++
2,224
star
4

tabulate

Table Maker for Modern C++
C++
1,726
star
5

pprint

Pretty Printer for Modern C++
C++
907
star
6

csv2

Fast CSV parser and writer for Modern C++
C++
497
star
7

structopt

Parse command line arguments by defining a struct
C++
451
star
8

alpaca

Serialization library written in C++17 - Pack C++ structs into a compact byte-array without any macros or boilerplate code
C++
399
star
9

fccf

fccf: A command-line tool that quickly searches through C/C++ source code in a directory based on a search string and prints relevant code snippets that match the query.
C++
342
star
10

glob

Glob for C++17
C++
221
star
11

binary_log

Fast binary logger for C++
C++
207
star
12

criterion

Microbenchmarking for Modern C++
C++
202
star
13

hypergrep

Recursively search directories for a regex pattern
C++
158
star
14

saveddit

Bulk Downloader for Reddit
Python
156
star
15

PhotoLab

AI-Powered Photo Editor (Python, PyQt6, PyTorch)
Python
123
star
16

box

box is a text-based visual programming language inspired by Unreal Engine Blueprint function graphs.
Python
116
star
17

cppgit2

Git for Modern C++ (A libgit2 Wrapper Library)
C++
106
star
18

repr

repr for Modern C++: Return printable string representation of a value
C++
83
star
19

psched

Priority-based Task Scheduling for Modern C++
C++
80
star
20

fswatch

File/Directory Watcher for Modern C++
C++
70
star
21

envy

envy: Deserialize environment variables into type-safe structs
C++
66
star
22

iris

Lightweight Component Model and Messaging Framework based on ØMQ
C++
53
star
23

merged_depth

Monocular Depth Estimation - Weighted-average prediction from multiple pre-trained depth estimation models
Python
45
star
24

pipeline

Pipelines for Modern C++
C++
42
star
25

unicode_display_width

Displayed width of UTF-8 strings in Modern C++
C++
38
star
26

task_system

Task System presented in "Better Code: Concurrency - Sean Parent"
C++
38
star
27

cgol

Conway's Game of Life in the Terminal
C++
33
star
28

jsonlint

Lightweight command-line tool for validating JSON
C++
32
star
29

small_vector

"Small Vector" optimization for Modern C++: store up to a small number of items on the stack
C++
31
star
30

result

Result<T, E> for Modern C++
C++
29
star
31

container_traits

Container Traits for Modern C++
C++
24
star
32

lexer

Hackable Lexer with UTF-8 support
C++
21
star
33

lc

Fast multi-threaded line counter in Modern C++ (2-10x faster than `wc -l` for large files)
C++
17
star
34

oystr

oystr recursively searches directories for a substring.
C++
10
star
35

walnut.v1

The Walnut programming language
C++
8
star
36

line-detector

OpenCV-based Hough Transform Line Detection
C++
8
star
37

ttt

Terminal Typing Test
C++
6
star
38

wxPython-text-editor

wxPython Text Editor
Python
6
star
39

Vulkan-Earth

Vulkan-based 3D Rendering of Earth
HTML
6
star
40

DiverseDepth

The code and data of DiverseDepth
Python
6
star
41

strcpp.old

String Manipulation API for C++
C++
5
star
42

OpenGL-Engine

OpenGL 3D Rendering Engine
C++
5
star
43

zcm

A Lightweight Component Model using ZeroMQ
C++
4
star
44

any_of_trait

Type traits for any_of and any_but
C++
4
star
45

StaticAnalysis

GitHub action for C++ static analysis
Python
4
star
46

ImageViewer-Qt6

Minimalist image viewer in Qt6
C++
3
star
47

krpci

C++ client to kRPC for communication with Kerbal Space Program (KSP)
C++
2
star
48

activity-plotter

Linux Scheduler Thread Activity Plotter
Python
2
star
49

video_device_discovery

Find all video devices connected to Linux-based embedded platform
C++
2
star
50

python-zcm

ZeroMQ-based Component Model in Python
Python
2
star
51

emacs_config

Emacs configuration
Emacs Lisp
1
star
52

plexil-analysis

Timing Analysis for the Plan Interchange Language (Plexil)
Python
1
star
53

object-tracker

OpenCV-based Real-time Object Tracking
C++
1
star
54

json.old

JSON Manipulation Library for C++
C++
1
star
55

phd-dissertation

TeX
1
star
56

OpenGL-Engine-II

OpenGL 3D Rendering Engine II - Alternate Architecture
C++
1
star
57

arangit

Python program that can scan a .git folder and reconstruct a git version control property graph in ArangoDB
Python
1
star
58

ros-installer

Script to install ROS Indigo from source
Python
1
star