• Stars
    star
    1,145
  • Rank 40,717 (Top 0.9 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 6 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.


Github Actions Build Status PyPI version Documentation Status Downloads Binder

CleverCSV provides a drop-in replacement for the Python csv package with improved dialect detection for messy CSV files. It also provides a handy command line tool that can standardize a messy file or generate Python code to import it.

Useful links:


Contents: Quick Start | Introduction | Installation | Usage | Python Library | Command-Line Tool | Version Control Integration | Contributing | Notes


Quick Start

Click here to go to the introduction with more details about CleverCSV. If you're in a hurry, below is a quick overview of how to get started with the CleverCSV Python package and the command line interface.

For the Python package:

# Import the package
>>> import clevercsv

# Load the file as a list of rows
# This uses the imdb.csv file in the examples directory
>>> rows = clevercsv.read_table('./imdb.csv')

# Load the file as a Pandas Dataframe
# Note that df = pd.read_csv('./imdb.csv') would fail here
>>> df = clevercsv.read_dataframe('./imdb.csv')

# Use CleverCSV as drop-in replacement for the Python CSV module
# This follows the Sniffer example: https://docs.python.org/3/library/csv.html#csv.Sniffer
# Note that csv.Sniffer would fail here
>>> with open('./imdb.csv', newline='') as csvfile:
...     dialect = clevercsv.Sniffer().sniff(csvfile.read())
...     csvfile.seek(0)
...     reader = clevercsv.reader(csvfile, dialect)
...     rows = list(reader)

And for the command line interface:

# Install the full version of CleverCSV (this includes the command line interface)
$ pip install clevercsv[full]

# Detect the dialect
$ clevercsv detect ./imdb.csv
Detected: SimpleDialect(',', '', '\\')

# Generate code to import the file
$ clevercsv code ./imdb.csv

import clevercsv

with open("./imdb.csv", "r", newline="", encoding="utf-8") as fp:
    reader = clevercsv.reader(fp, delimiter=",", quotechar="", escapechar="\\")
    rows = list(reader)

# Explore the CSV file as a Pandas dataframe
$ clevercsv explore -p imdb.csv
Dropping you into an interactive shell.
CleverCSV has loaded the data into the variable: df
>>> df

Introduction

  • CSV files are awesome! They are lightweight, easy to share, human-readable, version-controllable, and supported by many systems and tools!
  • CSV files are terrible! They can have many different formats, multiple tables, headers or no headers, escape characters, and there's no support for recording metadata!

CleverCSV is a Python package that aims to solve some of the pain points of CSV files, while maintaining many of the good things. The package automatically detects (with high accuracy) the format (dialect) of CSV files, thus making it easier to simply point to a CSV file and load it, without the need for human inspection. In the future, we hope to solve some of the other issues of CSV files too.

CleverCSV is based on science. We investigated thousands of real-world CSV files to find a robust way to automatically detect the dialect of a file. This may seem like an easy problem, but to a computer a CSV file is simply a long string, and every dialect will give you some table. In CleverCSV we use a technique based on the patterns of row lengths of the parsed file and the data type of the resulting cells. With our method we achieve 97% accuracy for dialect detection, with a 21% improvement on non-standard (messy) CSV files compared to the Python standard library.

We think this kind of work can be very valuable for working data scientists and programmers and we hope that you find CleverCSV useful (if there's a problem, please open an issue!) Since the academic world counts citations, please cite CleverCSV if you use the package. Here's a BibTeX entry you can use:

@article{van2019wrangling,
        title = {Wrangling Messy {CSV} Files by Detecting Row and Type Patterns},
        author = {{van den Burg}, G. J. J. and Naz{\'a}bal, A. and Sutton, C.},
        journal = {Data Mining and Knowledge Discovery},
        year = {2019},
        volume = {33},
        number = {6},
        pages = {1799--1820},
        issn = {1573-756X},
        doi = {10.1007/s10618-019-00646-y},
}

And of course, if you like the package please spread the word! You can do this by Tweeting about it (#CleverCSV) or clicking the ⭐️ on GitHub!

Installation

CleverCSV is available on PyPI. You can install either the full version, which includes the command line interface and all optional dependencies, using

$ pip install clevercsv[full]

or you can install a lighter, core version of CleverCSV with

$ pip install clevercsv

Usage

CleverCSV consists of a Python library and a command line tool called clevercsv.

Python Library

We designed CleverCSV to provide a drop-in replacement for the built-in CSV module, with some useful functionality added to it. Therefore, if you simply want to replace the builtin CSV module with CleverCSV, you can import CleverCSV as follows, and use it as you would use the builtin csv module.

import clevercsv

CleverCSV provides an improved version of the dialect sniffer in the CSV module, but it also adds some useful wrapper functions. These functions automatically detect the dialect and aim to make working with CSV files easier. We currently have the following helper functions:

  • detect_dialect: takes a path to a CSV file and returns the detected dialect
  • read_table: automatically detects the dialect and encoding of the file, and returns the data as a list of rows. A version that returns a generator is also available: stream_table
  • read_dataframe: detects the dialect and encoding of the file and then uses Pandas to read the CSV into a DataFrame. Note that this function requires Pandas to be installed.
  • read_dicts: detect the dialect and return the rows of the file as dictionaries, assuming the first row contains the headers. A streaming version called stream_dicts is also available.
  • write_table: write a table (a list of lists) to a file using the RFC-4180 dialect.
  • write_dicts: write a list of dictionaries to a file using the RFC-4180 dialect.

Of course, you can also use the traditional way of loading a CSV file, as in the Python CSV module:

import clevercsv

with open("data.csv", "r", newline="") as fp:
  # you can use verbose=True to see what CleverCSV does
  dialect = clevercsv.Sniffer().sniff(fp.read(), verbose=False)
  fp.seek(0)
  reader = clevercsv.reader(fp, dialect)
  rows = list(reader)

Since CleverCSV v0.8.0, dialect detection is a lot faster than in previous versions. However, for large files, you can speed up detection even more by supplying a sample of the document to the sniffer instead of the whole file, for example:

dialect = clevercsv.Sniffer().sniff(fp.read(10000))

You can also speed up encoding detection by installing cCharDet, it will automatically be used when it is available on the system.

That's the basics! If you want more details, you can look at the code of the package, the test suite, or the API documentation. If you run into any issues or have comments or suggestions, please open an issue on GitHub.

Command-Line Tool

To use the command line tool, make sure that you install the full version of CleverCSV (see above).

The clevercsv command line application has a number of handy features to make working with CSV files easier. For instance, it can be used to view a CSV file on the command line while automatically detecting the dialect. It can also generate Python code for importing data from a file with the correct dialect. The full help text is as follows:

usage: clevercsv [-h] [-V] [-v] command ...

Available commands:
  help         Display help information
  detect       Detect the dialect of a CSV file
  view         View the CSV file on the command line using TabView
  standardize  Convert a CSV file to one that conforms to RFC-4180
  code         Generate Python code to import a CSV file
  explore      Explore the CSV file in an interactive Python shell

Each of the commands has further options (for instance, the code and explore commands have support for importing the CSV file as a Pandas DataFrame). Use clevercsv help <command> or man clevercsv <command> for more information. Below are some examples for each command.

Note that each command accepts the -n or --num-chars flag to set the number of characters used to detect the dialect. This can be especially helpful to speed up dialect detection on large files.

Code

Code generation is useful when you don't want to detect the dialect of the same file over and over again. You simply run the following command and copy the generated code to a Python script!

$ clevercsv code imdb.csv

# Code generated with CleverCSV

import clevercsv

with open("imdb.csv", "r", newline="", encoding="utf-8") as fp:
    reader = clevercsv.reader(fp, delimiter=",", quotechar="", escapechar="\\")
    rows = list(reader)

We also have a version that reads a Pandas dataframe:

$ clevercsv code --pandas imdb.csv

# Code generated with CleverCSV

import clevercsv

df = clevercsv.read_dataframe("imdb.csv", delimiter=",", quotechar="", escapechar="\\")

Detect

Detection is useful when you only want to know the dialect.

$ clevercsv detect imdb.csv
Detected: SimpleDialect(',', '', '\\')

The --plain flag gives the components of the dialect on separate lines, which makes combining it with grep easier.

$ clevercsv detect --plain imdb.csv
delimiter = ,
quotechar =
escapechar = \

Explore

The explore command is great for a command-line based workflow, or when you quickly want to start working with a CSV file in Python. This command detects the dialect of a CSV file and starts an interactive Python shell with the file already loaded! You can either have the file loaded as a list of lists:

$ clevercsv explore milk.csv
Dropping you into an interactive shell.

CleverCSV has loaded the data into the variable: rows
>>>
>>> len(rows)
381

or you can load the file as a Pandas dataframe:

$ clevercsv explore -p imdb.csv
Dropping you into an interactive shell.

CleverCSV has loaded the data into the variable: df
>>>
>>> df.head()
                   fn        tid  ... War Western
0  titles01/tt0012349  tt0012349  ...   0       0
1  titles01/tt0015864  tt0015864  ...   0       0
2  titles01/tt0017136  tt0017136  ...   0       0
3  titles01/tt0017925  tt0017925  ...   0       0
4  titles01/tt0021749  tt0021749  ...   0       0

[5 rows x 44 columns]

Standardize

Use the standardize command when you want to rewrite a file using the RFC-4180 standard:

$ clevercsv standardize --output imdb_standard.csv imdb.csv

In this particular example the use of the escape character is replaced by using quotes.

View

This command allows you to view the file in the terminal. The dialect is of course detected using CleverCSV! Both this command and the standardize command support the --transpose flag, if you want to transpose the file before viewing or saving:

$ clevercsv view --transpose imdb.csv

Version Control Integration

If you'd like to make sure that you never commit a messy (non-standard) CSV file to your repository, you can install a pre-commit hook. First, install pre-commit using the installation instructions. Next, add the following configuration to the .pre-commit-config.yaml file in your repository:

repos:
  - repo: https://github.com/alan-turing-institute/CleverCSV-pre-commit
    rev: v0.6.6   # or any later version
    hooks:
      - id: clevercsv-standardize

Finally, run pre-commit install to set up the git hook. Pre-commit will now use CleverCSV to standardize your CSV files following RFC-4180 whenever you commit a CSV file to your repository.

Contributing

If you want to encourage development of CleverCSV, the best thing to do now is to spread the word!

If you encounter an issue in CleverCSV, please open an issue or submit a pull request. Don't hesitate, you're helping to make this project better for everyone! If GitHub's not your thing but you still want to contact us, you can send an email to gertjanvandenburg at gmail dot com instead. You can also ask questions on Gitter.

Note that all contributions to the project must adhere to the Code of Conduct.

The CleverCSV package was originally written by Gertjan van den Burg and came out of scientific research on wrangling messy CSV files by Gertjan van den Burg, Alfredo Nazabal, and Charles Sutton.

Notes

CleverCSV is licensed under the MIT license. Please cite our research if you use CleverCSV in your work.

Copyright (c) 2018-2021 The Alan Turing Institute.

More Repositories

1

the-turing-way

Host repository for The Turing Way: a how to guide for reproducible data science
TeX
1,635
star
2

AIrsenal

Machine learning Fantasy Premier League team
Jupyter Notebook
289
star
3

distinctipy

A lightweight package for generating visually distinct colours.
Python
236
star
4

ReadabiliPy

A simple HTML content extractor in Python. Can be run as a wrapper for Mozilla's Readability.js package or in pure-python mode.
HTML
221
star
5

rse-course

Materials for The Alan Turing Institute's Research Software Engineering course
Jupyter Notebook
220
star
6

TCPD

The Turing Change Point Dataset - A collection of time series for the evaluation and development of change point detection algorithms
Python
116
star
7

TCPDBench

The Turing Change Point Detection Benchmark: An Extensive Benchmark Evaluation of Change Point Detection Algorithms on real-world data
114
star
8

environmental-ds-book

A computational notebook community for open environmental data science 🌎
TeX
95
star
9

scivision

scivision: a framework for scientific image analysis
JavaScript
94
star
10

deepsensor

A Python package for tackling diverse environmental prediction tasks with NPs.
Python
89
star
11

FootballTournamentPrediction

Predicting results for International men's and women's football tournaments.
Jupyter Notebook
73
star
12

SHEEP

SHEEP is a Homomorphic Encryption Evaluation Platform
C++
47
star
13

mogp-emulator

Package for fitting Gaussian Process Emulators to multiple output computer simulation results.
Python
47
star
14

TuringDataStories

TuringDataStories: An open community creating “Data Stories”: A mix of open data, code, narrative 💬, visuals 📊📈 and knowledge 🧠 to help understand the world around us.
Jupyter Notebook
39
star
15

mathematics-of-ml-course

Jupyter Notebook
38
star
16

SemAIDA

Semantic Technologies for the AIDA project
Python
37
star
17

AutSPACEs

Code respository for AutSPACEs: the Autistica/Turing citizen science platform
Python
36
star
18

data-safe-haven

PowerShell
36
star
19

grace

Graph Representation Analysis for Connected Embeddings
Jupyter Notebook
34
star
20

PDSampler.jl

Piecewise Deterministic Sampler library (Bouncy particle sampler, Zig Zag sampler, ...)
Julia
33
star
21

ds-ai-educators-programme

The Data Science and AI Educators' Programme
32
star
22

rds-course

Materials for Turing's Research Data Science course
Jupyter Notebook
31
star
23

tapas

Python
31
star
24

robots-in-disguise

Information and materials for the Turing's "robots-in-disguise" reading group on fundamental AI research.
Jupyter Notebook
31
star
25

bocpdms

Python
30
star
26

ptype

Probabilistic type inference
Jupyter Notebook
29
star
27

AutisticaCitizenScience

Project management and resource repository for the Autistica/Turing Citizen Science project
Ruby
29
star
28

TimeSeriesClassification.jl

Machine Learning with Time Series in Julia
Julia
27
star
29

datadiff

Datadiff is diff for data
R
26
star
30

xpandas

Universal 1d/2d data containers with Transformers functionality for data analysis.
Python
26
star
31

CSV_Wrangling

Repository for reproducibility of the CSV file project
TeX
26
star
32

CROP

CROP is a Research Observation Platform
Python
25
star
33

signatures-psychiatry

Code from the paper "A signature-based machine learning model for bipolar disorder and borderline personality disorder".
Python
25
star
34

SigNet

A package for clustering of Signed Networks
Python
24
star
35

open-research-community-management

Establishing cross-community collaborations and promoting open research in data science
Jupyter Notebook
23
star
36

solar-panel-detection

Solar Panel Detection (Turing Climate Action Call)
Jupyter Notebook
23
star
37

turing-roche-partnership

23
star
38

AssurancePlatform

Project to facilitate creation of Assurance Cases
TypeScript
22
star
39

monitoring-ecosystem-resilience

Repository for mini-projects in the Data science for Sustainable development project
Python
22
star
40

rbocpdms

Robust bayesian online changepoint detection with model selection
Python
22
star
41

ThermodynamicAnalyticsToolkit

Sampling-based approach to analyse neural networks using TensorFlow
Python
22
star
42

QUIPP-pipeline

Privacy preserving synthetic data generation workflows
Python
20
star
43

uatk-spc

Synthetic Population Catalyst
Jupyter Notebook
20
star
44

AnnotateChange

A simple flask application to collect annotations for the Turing Change Point Dataset, a benchmark dataset for change point detection algorithms
Python
19
star
45

prompto

An open source library for asynchronous querying of LLM endpoints
Python
19
star
46

autoemulate

emulate simulations easily
Python
17
star
47

the-turing-way-book

The Turing Way: A Handbook for Reproducible Data Science
CSS
17
star
48

turing-commons

The main repository for the Turing Commons platform
HTML
17
star
49

Palaeoanalytics

Repository for the Paleoanalytics project.
Python
17
star
50

defoe

Code to analyse books and newspapers data using Apache Spark.
Lex
16
star
51

rPSMF

Code for Probabilistic Sequential Matrix Factorization
Python
15
star
52

network-comparison

An R package implementing the NetEMD and NetDis network comparison measures
R
14
star
53

bias-in-AI-course

Jupyter Notebook
14
star
54

learning-at-the-turing

The core repository for training materials at the Alan Turing Institute.
13
star
55

HDS-DiscussionGroup

Repo of the Turing's Humanities & Data Science Discussion Group
13
star
56

Turing-RSS-Health-Data-Lab-Biomedical-Acoustic-Markers

Python
13
star
57

reproducible-project-template

Template repository for setting a reproducible research project.
13
star
58

affinity-vae

Self-supervised method for disentanglement, clustering and classification of objects in multidimensional image data
Python
13
star
59

research-application-management

12
star
60

SIMple-ID

SIM-based QR-code authentication for basic and feature phones
TeX
12
star
61

templates

Turing Beamer templates for presentations
TeX
12
star
62

stat-fem

Python tools for solving data-constrained finite element problems
Python
12
star
63

RSE4DataScience18

Repo containing docs and outputs from the RSE4DataScience18 meeting.
12
star
64

advent-of-code-2021

Advent of Code 2021
Racket
11
star
65

professionalising-data-science-roles

Policy Skills Award project with TPS and Skills team - Professionalising traditional and infrastructure research roles in data science
11
star
66

notice-board

Community notice board for the Turing Institute
11
star
67

DTBase

A starting point from which digital twins can be developed.
Python
11
star
68

room2glo

Python
11
star
69

trustchain

Trustworthy decentralised PKI
Rust
11
star
70

python-project-template

Python
11
star
71

Intro-to-transparent-ML-course

An Introduction to Transparent Machine Learning
Jupyter Notebook
11
star
72

sqlsynthgen

Synthetic data for SQL databases
Python
11
star
73

ReproducibleResearchResources

This repository contains information to help you make your research reproducible
10
star
74

clim-recal

Open repository of methods for recalibrating & bias correcting UKCP18 climate projections data
HTML
10
star
75

gnn-reading-group

Public-facing repo for organising activities+ archiving material relating to the Graph Neural Network reading group.
Jupyter Notebook
10
star
76

hub23-deploy

A repo to manage the Turing BinderHub instance
Python
9
star
77

empiarreader

Reader for EMPIAR datasets
Python
9
star
78

DH-RSE-Summer-School

R
9
star
79

data-training-for-bioscience

Introduction to Data Science Project Management for Project Leaders.
9
star
80

cage-challenge-2-public

Team Mindrake's hierarchical RL solution to the second CybORG CAGE challenge.
Python
9
star
81

spatial-inequality

Jupyter Notebook
9
star
82

ADViCE

AI for Decarbonisation's Virtual Centre of Excellence
9
star
83

branded-overleaf-template

TeX
8
star
84

guard

Simulating Imperial Dynamics and Conflict in the Ancient World
Jupyter Notebook
8
star
85

DSSG19-Cochrane-PUBLIC

Python
8
star
86

netts

Toolbox for creating networks capturing semantic content of speech transcripts.
Python
8
star
87

p2lab-pokemon

A Python library for running genetic algorithms to optimize Pokemon teams!
Python
8
star
88

AI-workflows

A collections of portable, real-world AI workflows for testing and benchmarking
Shell
8
star
89

jbc-turing-rss-nowcasting

A Bayesian model for time-series count data with weekend effects and a lagged reporting process
Jupyter Notebook
8
star
90

mousehole

Quickly deploy a flexible, collaborative environment for working with private data.
HCL
8
star
91

DSSG19-HomelessLink-PUBLIC

TSQL
8
star
92

learn-azure

Repository for generalised learning materials on Azure
Python
8
star
93

uicc_identity_toolbox

A framework of Java Card applets for enhancing the trustworthiness of DigitalID systems using low-cost basic and feature phone devices.
TeX
8
star
94

neuro-ai-reading-group

Space to collate materials related to the Neuroscience-AI reading group
8
star
95

COVID-19_PSTC

Pandemic Symptom Tracker Calendar open code /Symptom tracker open code repository
HTML
8
star
96

alexa-room-finder

Lets you find meeting rooms through our Amazon Echo
JavaScript
7
star
97

causal-cyber-defence

This repository contains glue-code necessary to run dynamic Causal Bayesian optimisation within the Yawning Titan cyber-simulation environment.
Jupyter Notebook
7
star
98

pam-aad-oidc

PAM module connecting to AzureAD for user authentication using OpenID Connect/OAuth2.
Go
6
star
99

reprosyn

Python
6
star
100

DSSG

meta repository for DSSG projects
6
star