• This repository has been archived on 19/Feb/2021
  • Stars
    star
    132
  • Rank 273,210 (Top 6 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created over 10 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Generate a diff between two tabular datasets expressed in CSV files.

csvdiff

This project is no longer maintained. Please consider forking it if you have the interest and time.

Overview

Generate a diff between two CSV files on the command-line.

csvdiff allows you to compare the semantic contents of two CSV files, ignoring things like row and column ordering in order to get to what's actually changed. This is useful if you're comparing the output of an automatic system from one day to the next, so that you can look at just what's changed.

It's also useful for maintaining patches to third-party data. Diffs generated by csvdiff are a subset of JSON and can be stored and applied using the matching csvpatch command. If upstream data changes, you can fetch the new version and re-apply your changes to it easily.

Installing

You'll firstly need Python and pip. Then run:

pip install csvdiff

Examples

For example, suppose we have a.csv:

id,name,amount
1,bob,20
2,eva,63
3,sarah,7
4,jeff,19
6,fred,10

After some changes and corrections to the data, we now have b.csv:

id,name,amount
1,bob,23       <--- changed
3,sarah,7
4,jeff,19
5,mira,81      <--- added
6,fred,13      <--- changed

Now we can ask for a summary of differences:

$ csvdiff --style=summary id a.csv b.csv
1 rows removed (20.0%)
1 rows added (20.0%)
2 rows changed (40.0%)

Or look at the full diff pretty printed, to make it more readable:

$ csvdiff --style=pretty --output=diff.json id a.csv b.csv
$ cat diff.json
{
  "_index": [
    "id"
  ],
  "added": [
    {
      "amount": "81",
      "id": "5",
      "name": "mira"
    }
  ],
  "changed": [
    {
      "fields": {
        "amount": {
          "from": "20",
          "to": "23"
        }
      },
      "key": [
        "1"
      ]
    },
    {
      "fields": {
        "amount": {
          "from": "10",
          "to": "13"
        }
      },
      "key": [
        "6"
      ]
    }
  ],
  "removed": [
    {
      "amount": "63",
      "id": "2",
      "name": "eva"
    }
  ]
}

If you want to ignore a column from the comparison then you can do so by specifying a comma seperated list of column names to ignore. For example:

$ csvdiff --style=summary --ignore-columns=amount id a.csv b.csv
1 rows removed (20.0%)
1 rows added (20.0%)
0 rows changed (0%)

You can also choose to compare numeric fields only up to a certain number of significant figures. Use negative significant figures for orders of magnitude:

$ csvdiff --style=summary id a.csv c.csv
0 rows removed (0.0%)
0 rows added (0.0%)
2 rows changed (40.0%)
$ csvdiff --style=summary id --significance=-1 a.csv c.csv
files are identical

Diffs generated this way contain all the data that's changed, and can be reapplied later if the original data changes. For example, suppose more data gets added to a.csv, giving us a-plus.csv:

id,name,amount
1,bob,20
2,eva,63
3,sarah,7
4,jeff,19
6,fred,10
8,henry,9

We can reapply our changes with the csvpatch command:

$ csvpatch --input=diff.json --output=b-plus.csv a-plus.csv
$ cat b-plus.csv
id,name,amount
1,bob,23
3,sarah,7
4,jeff,19
5,mira,81
6,fred,13
8,henry,9

This can be useful if you're using csvdiff to transform data that's outside your control. In this case, you maintain the patch file and simply reapply it when the upstream data provider gives you a fresh file.

For more usage options, run csvdiff --help or csvpatch --help.

API

The main entry points are the diff_files and diff_records methods:

import csvdiff

patch = csvdiff.diff_files('a.csv', 'b.csv', ['id'])

# just show the changed rows
print(patch['changed'])

Using diff_records instead:

import csvdiff

records_a = [{'id': 1, 'name': 'Alice'},
             {'id': 2, 'name': 'Bob'}]
records_b = [{'id': 1, 'name': 'Alice'},
             {'id': 2, 'name': 'Jeff'}]

patch = csvdiff.diff_records(records_a, records_b, ['id'])
print(patch['changed'])

See the matching patch_file and patch_records methods for working with patches.

License

BSD license

More Repositories

1

marelle

Test-driven system administration with a little extra logic.
Prolog
387
star
2

wide-language-index

An index of public broadcasts tagged by their primary language.
Python
50
star
3

pandoc-talk

A cookiecutter template for pandoc / XeTeX talks.
21
star
4

cjktools

Tools for processing CJK strings in Python
Python
19
star
5

simplecv-demo

SimpleCV demo scripts teaching some fundamentals of Computer Vision.
Python
17
star
6

marelle-deps

Configuration targets for Marelle.
Prolog
17
star
7

slowclap

Detect claps from your computer's microphone and act on them.
Python
11
star
8

simsearch

Search-by-similarity for Japanese kanji
CSS
10
star
9

questioner

Quickly ask questions on the command-line and annotate examples
Python
9
star
10

anytop

A command-line tool for viewing frequency distributions over streaming input.
Python
8
star
11

kanjitester

A research project which generates randomized vocab and kanji tests for JLPT levels N4 and N5.
Python
5
star
12

wegan

Generate an HTTP Archive dump for a page's performance using Chrome and Selenium
Scala
4
star
13

sql-constraint-checker

Check for violations of soft constraints in MySQL to keep data quality high.
Python
4
star
14

coursera-ml-2013-notes

Study notes for Coursera's 2013 Q4 Machine Learning subject.
4
star
15

cookiecutter-go-service

A template for new Go projects.
Go
4
star
16

cjkdata

Data files for the cjktools package.
Python
4
star
17

dotvim

My personal vim config.
Lua
3
star
18

doko

A command-line utility and Python module which reports your current location.
Python
3
star
19

cookiecutter-gitbook

A template for a new gitbook book.
Makefile
2
star
20

docker-vm

A Vagrant recipe for a VM with Docker installed and nothing else.
Prolog
2
star
21

shelly

Shell tools for working with data
Python
2
star
22

simplecv-vm

A Vagrant image for SimpleCV.
Shell
2
star
23

proj

Manage many folders of projects by periodically archiving them.
Python
1
star
24

gitbook-mcpy

Study notes: concurrency in Python
Makefile
1
star
25

ckan-vm

A Vagrant virtual machine for the CKAN open source data portal.
Prolog
1
star
26

referredby

A Python module for parsing referrer URLs, in particular for common search engines.
Python
1
star
27

dotcheat

Custom cheatsheets for use with the cheat command.
Julia
1
star
28

simplestats

A lightweight and minimal stats library for Python.
Python
1
star
29

babushka-deps

Ruby
1
star
30

code-library

A polyglot library of code examples.
Python
1
star
31

govhack2013-hansard

Hansard data hacking for GovHack 2013.
JavaScript
1
star
32

web-kata

Practice runs for building simple web tools.
Clojure
1
star
33

spatula

Web scraper for online recipe sites.
Python
1
star