• Stars
    star
    371
  • Rank 115,103 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created about 10 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

CSVs are awesome, yet they're pretty dumb. Let's get them smarter!

Travis status ย  PyPi version ย  PyPi downloads

Smart and awesome CSV utils

CSVs are awesome, yet they're pretty dumb. Let's get them smarter!

smartcsv is a python utility to read and parse CSVs based on model definitions. Instead of just parsing the CSV into lists (like the builtin csv module) it adds the ability to specify models with attributes names. On top of that it adds nice features like validation, custom parsing, failure control and nice error messages.

>>> reader = smartcsv.reader(file_object, columns=COLUMNS, fail_fast=False)
>>> my_object = next(reader)
>>> my_object['title']  # Accessed by model name.
'iPhone 5c Blue'
>>> my_object['price']  # Value transform included
Decimal("799.99")
>>> my_object['currency']  # Based on choices = ['USD', 'YEN']
'USD'
>>> my_object['url']  # custom validator lambda v: v.startswith('http')
https://www.apple.com/iphone.jpg

# Nice errors
>>> from pprint import pprint as pp
>>> pp(my_object.errors)
{
    17: {  # The row number
        'row': ['','',...]  # The complete row for reference,
        'errors': {  # Description of the errors
            'url': 'Validation failed',
            'currency': 'Invalid choice. Expected ['USD', 'YEN']. Got 'AUD' instead.
        }
    }
}

Installation

pip install smartcsv

Usage

To see an entire set of usages check the test package (99% coverage).

The basic is to define a spec for the columns of your csv. Assuming the following CSV file:

title,category,subcategory,currency,price,url,image_url
iPhone 5c blue,Phones,Smartphones,USD,399,http://apple.com/iphone,http://apple.com/iphone.jpg
iPad mini,Tablets,Apple,USD,699,http://apple.com/iphone,http://apple.com/iphone.jpg

First you need to define the spec for your columns. This is an example (the one used in tests):

CURRENCIES = ('USD', 'ARS', 'JPY')

COLUMNS_1 = [
    {'name': 'title', 'required': True},
    {'name': 'category', 'required': True},
    {'name': 'subcategory', 'required': False},
    {
        'name': 'currency',
        'required': True,
        'choices': CURRENCIES
    },
    {
        'name': 'price',
        'required': True,
        'validator': is_number
    },
    {
        'name': 'url',
        'required': True,
        'validator': lambda c: c.startswith('http')
    },
    {
        'name': 'image_url',
        'required': False,
        'validator': lambda c: c.startswith('http')
    },
]

You can then use smartcsv to parse the CSV:

import smartcsv
with open('my-csv.csv', 'r') as f:
    reader = smartcsv.reader(f, columns=COLUMNS_1)
    for obj in reader:
        print(obj['title'])

smartcsv.reader uses the builtin csv module and accepts a dialect to use.

More advanced usage

Errors

By default smartcsv will raise a smartcsv.exceptions.InvalidCSVException when it encounters an error in a column (a missing required field, a field different than choices, a validation failure, etc). The exception will have a nice error message in that case:

# Assuming the price field is missing
try:
    item = next(reader)
except InvalidCSVException as e:
    print(e.errors)
    # {'price': 'Field required and not provided.'}

You can always avoid fast-failure (raising an exception on failure). You can pass the fail_fast argument as False. That will prevent exceptions, instead the errors are reported in the reader object (indicating the row number and the detail of the errors). For example, assuming a CSV with the an error in the second row:

reader = smartcsv.reader(f, columns=COLUMNS_1, fail_fast=False)
for obj in reader:
    # All the processing is done Ok without exceptions raised.
    print(obj['title'])
    
error_row = reader.errors['rows'][1]  # Second row has index = 1. Errors are 0-indexed.
print(error_row['row'])  # Print original row data
print(error_row['errors'].keys())  # currency  (the currency column)
print(error_row['errors']['currency'])  # Invalid currency... (nice error explanation)

You can also specify a max_failures parameter. It will count failures and will raise an exception when that threshold is exceeded.

Strip white spaces

By default the strip_white_spaces option is set to True. Example:

sample.csv
title,price
   Some Product  ,  55.5  

row['title'] will be "Some Product" and row['price'] will be "55.5" (spaces stripped)

Skip lines

sample.csv
GENERATED BY AWESOME SCRIPT
2014-08-12

title,price
Some Product,55.5

The first 3 lines don't contain any valuable data so we'll skip them.

reader = smartcsv.reader(f, columns=COLUMNS_1, fail_fast=False, skip_lines=3)
for obj in reader:
    print(obj['title'])

Break (stop) on occurrance of first error

By default, value of fail_fast is True. You can also mention it explicitly with fail_fast=True. This will cause halting execution of reader() function as soon as it faces an error in the csv file. This error can be data mismatch in between your data specification and found value in csv file. Data-validation failure also trigger fail_fast.

reader = smartcsv.reader(f, columns=COLUMNS_1, fail_fast=True)
for obj in reader:
    print(obj['title'])

Contributing

Fork, code, watch your tests pass, submit PR. To test:

$ python setup.py test  # Run tests in your venv
$ tox  # Make sure it passes in all versions.

Integration tests

There are "integration" tests included under tests/integration. They are not run by the default test runner. The idea of those tests is to have real examples of use cases for smartcsv documented. You'll have to run them manually:

py.test tests/integration/lpnk/test_lpnk.py

More Repositories

1

ipython-gpt

An ChatGPT integration for Jupyter Notebooks and the IPython Shell
Python
595
star
2

pycon-concurrency-tutorial-2020

Main repo of PyCon 2020 Tutorial
Jupyter Notebook
137
star
3

debug-inspector-panel

A django-debug-toolbar panel to get information about certain variables.
Python
71
star
4

party-parrot

Bring the party to your terminal
Python
30
star
5

python-hacker-news

A library wrapper for Hacker News Search API (Powered by Algolia)
Jupyter Notebook
29
star
6

parallel

Effortless parallelization library for Python
Python
25
star
7

slack.py

A simple command line tool to interact with Slack
Python
22
star
8

hyper-inspector

Python
19
star
9

tpb-download

A python utility to download torrent files from The Pirate Bay. Repo for learning purposes.
Python
9
star
10

fuzzy-match-strings-using-pandas

Fuzzy string matching between different files using Pandas and Fuzzywuzzy.
Jupyter Notebook
8
star
11

Machine-Learning-for-Business

Jupyter Notebook
7
star
12

sargon

Python
5
star
13

django-tastypie-example

Example of a tastypie project.
Python
4
star
14

djangoday.com.ar

Official website for djangoday Argentina 2012
Python
3
star
15

mult.be

Multiple URL Shortener
Python
3
star
16

django-comments-utils

Util lib for making Django Comments framework development easier
Python
3
star
17

datawars-llm-challenges

2
star
18

flask-rest-toolkit

A set of tools to create simple Flask REST web services and APIs
Python
2
star
19

dotfiles

dotfiles
Shell
2
star
20

santiagobasulto.github.com

Personal blog/website
CSS
2
star
21

huikau

Micro ad-server for AppEngine
Python
2
star
22

python-for-experienced-programmers

Python
2
star
23

gdd-frlp

Source Code for FRLP Gestiรณn de Datos exercises.
Python
1
star
24

zappa-pydata-miami

HTML
1
star
25

pylatam-2019

Jupyter Notebook
1
star
26

hn-summary

Jupyter Notebook
1
star
27

data-analysis-archive

This repo contains a collection of different analysis I do whenever a question sparks my curiosity
Jupyter Notebook
1
star
28

santiagobasulto.com.ar

HTML
1
star
29

react-native-calculator

React Native Calculator
JavaScript
1
star
30

django-demo-project

Python
1
star
31

test_new_name

Makefile
1
star
32

react-basics

Basic examples to use React (see tags/releases)
HTML
1
star
33

django-orm-test

Django ORM Tests, for education purposes
Python
1
star
34

async-iterators

Async iterators experiments in javascript and node.js
JavaScript
1
star
35

google-image-search

A set of widgets and modules to construct a Google image search wiget in your site.
JavaScript
1
star
36

scala-concurrency-tests

Scala
1
star
37

django-model-utils

A set of models, fields and mixins to test some advanced Django models use cases.
Python
1
star
38

sherlock-debugger

Sherlock is a python debugger to provide more information about what you want to debug in real time.
Python
1
star