• Stars
    star
    480
  • Rank 91,562 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 6 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Pandas, Polars, and Spark DataFrame comparison for humans and more!

DataComPy

PyPI - Python Version Code style: black PyPI version Anaconda-Server Badge PyPI - Downloads

DataComPy is a package to compare two Pandas DataFrames. Originally started to be something of a replacement for SAS's PROC COMPARE for Pandas DataFrames with some more functionality than just Pandas.DataFrame.equals(Pandas.DataFrame) (in that it prints out some stats, and lets you tweak how accurate matches have to be). Then extended to carry that functionality over to Spark Dataframes.

Quick Installation

pip install datacompy

or

conda install datacompy

Installing extras

If you would like to use Spark or any other backends please make sure you install via extras:

pip install datacompy[spark]
pip install datacompy[dask]
pip install datacompy[duckdb]
pip install datacompy[polars]
pip install datacompy[ray]

Pandas Detail

DataComPy will try to join two dataframes either on a list of join columns, or on indexes. If the two dataframes have duplicates based on join values, the match process sorts by the remaining fields and joins based on that row number.

Column-wise comparisons attempt to match values even when dtypes don't match. So if, for example, you have a column with decimal.Decimal values in one dataframe and an identically-named column with float64 dtype in another, it will tell you that the dtypes are different but will still try to compare the values.

Basic Usage

from io import StringIO
import pandas as pd
import datacompy

data1 = """acct_id,dollar_amt,name,float_fld,date_fld
10000001234,123.45,George Maharis,14530.1555,2017-01-01
10000001235,0.45,Michael Bluth,1,2017-01-01
10000001236,1345,George Bluth,,2017-01-01
10000001237,123456,Bob Loblaw,345.12,2017-01-01
10000001239,1.05,Lucille Bluth,,2017-01-01
"""

data2 = """acct_id,dollar_amt,name,float_fld
10000001234,123.4,George Michael Bluth,14530.155
10000001235,0.45,Michael Bluth,
10000001236,1345,George Bluth,1
10000001237,123456,Robert Loblaw,345.12
10000001238,1.05,Loose Seal Bluth,111
"""

df1 = pd.read_csv(StringIO(data1))
df2 = pd.read_csv(StringIO(data2))

compare = datacompy.Compare(
    df1,
    df2,
    join_columns='acct_id',  #You can also specify a list of columns
    abs_tol=0, #Optional, defaults to 0
    rel_tol=0, #Optional, defaults to 0
    df1_name='Original', #Optional, defaults to 'df1'
    df2_name='New' #Optional, defaults to 'df2'
    )
compare.matches(ignore_extra_columns=False)
# False

# This method prints out a human-readable report summarizing and sampling differences
print(compare.report())

See docs for more detailed usage instructions and an example of the report output.

Things that are happening behind the scenes

  • You pass in two dataframes (df1, df2) to datacompy.Compare and a column to join on (or list of columns) to join_columns. By default the comparison needs to match values exactly, but you can pass in abs_tol and/or rel_tol to apply absolute and/or relative tolerances for numeric columns.

    • You can pass in on_index=True instead of join_columns to join on the index instead.
  • The class validates that you passed dataframes, that they contain all of the columns in join_columns and have unique column names other than that. The class also lowercases all column names to disambiguate.

  • On initialization the class validates inputs, and runs the comparison.

  • Compare.matches() will return True if the dataframes match, False otherwise.

    • You can pass in ignore_extra_columns=True to not return False just because there are non-overlapping column names (will still check on overlapping columns)
    • NOTE: if you only want to validate whether a dataframe matches exactly or not, you should look at pandas.testing.assert_frame_equal. The main use case for datacompy is when you need to interpret the difference between two dataframes.
  • Compare also has some shortcuts like

    • intersect_rows, df1_unq_rows, df2_unq_rows for getting intersection, just df1 and just df2 records (DataFrames)
    • intersect_columns(), df1_unq_columns(), df2_unq_columns() for getting intersection, just df1 and just df2 columns (Sets)
  • You can turn on logging to see more detailed logs.

Fugue Detail

Fugue is a Python library that provides a unified interface for data processing on Pandas, DuckDB, Polars, Arrow, Spark, Dask, Ray, and many other backends. DataComPy integrates with Fugue to provide a simple way to compare data across these backends.

Basic Usage

The following usage example compares two Pandas dataframes, it is equivalent to the Pandas example above.

from io import StringIO
import pandas as pd
import datacompy

data1 = """acct_id,dollar_amt,name,float_fld,date_fld
10000001234,123.45,George Maharis,14530.1555,2017-01-01
10000001235,0.45,Michael Bluth,1,2017-01-01
10000001236,1345,George Bluth,,2017-01-01
10000001237,123456,Bob Loblaw,345.12,2017-01-01
10000001239,1.05,Lucille Bluth,,2017-01-01
"""

data2 = """acct_id,dollar_amt,name,float_fld
10000001234,123.4,George Michael Bluth,14530.155
10000001235,0.45,Michael Bluth,
10000001236,1345,George Bluth,1
10000001237,123456,Robert Loblaw,345.12
10000001238,1.05,Loose Seal Bluth,111
"""

df1 = pd.read_csv(StringIO(data1))
df2 = pd.read_csv(StringIO(data2))

datacompy.is_match(
    df1,
    df2,
    join_columns='acct_id',  #You can also specify a list of columns
    abs_tol=0, #Optional, defaults to 0
    rel_tol=0, #Optional, defaults to 0
    df1_name='Original', #Optional, defaults to 'df1'
    df2_name='New' #Optional, defaults to 'df2'
)
# False

# This method prints out a human-readable report summarizing and sampling differences
print(datacompy.report(
    df1,
    df2,
    join_columns='acct_id',  #You can also specify a list of columns
    abs_tol=0, #Optional, defaults to 0
    rel_tol=0, #Optional, defaults to 0
    df1_name='Original', #Optional, defaults to 'df1'
    df2_name='New' #Optional, defaults to 'df2'
))

In order to compare dataframes of different backends, you just need to replace df1 and df2 with dataframes of different backends. Just pass in Dataframes such as Pandas dataframes, DuckDB relations, Polars dataframes, Arrow tables, Spark dataframes, Dask dataframes or Ray datasets. For example, to compare a Pandas dataframe with a Spark dataframe:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
spark_df2 = spark.createDataFrame(df2)
datacompy.is_match(
    df1,
    spark_df2,
    join_columns='acct_id',
)

Notice that in order to use a specific backend, you need to have the corresponding library installed. For example, if you want compare Ray datasets, you must do

pip install datacompy[ray]

How it works

DataComPy uses Fugue to partition the two dataframes into chunks, and then compare each chunk in parallel using the Pandas-based Compare. The comparison results are then aggregated to produce the final result. Different from the join operation used in SparkCompare, the Fugue version uses the cogroup -> map like semantic (not exactly the same, Fugue adopts a coarse version to achieve great performance), which guarantees full data comparison with consistent result compared to Pandas-based Compare.

Spark Detail

:::{important} With version v0.9.0 SparkCompare now uses Null Safe (<=>) comparisons :::

DataComPy's SparkCompare class will join two dataframes either on a list of join columns. It has the capability to map column names that may be different in each dataframe, including in the join columns. You are responsible for creating the dataframes from any source which Spark can handle and specifying a unique join key. If there are duplicates in either dataframe by join key, the match process will remove the duplicates before joining (and tell you how many duplicates were found).

As with the Pandas-based Compare class, comparisons will be attempted even if dtypes don't match. Any schema differences will be reported in the output as well as in any mismatch reports, so that you can assess whether or not a type mismatch is a problem or not.

The main reasons why you would choose to use SparkCompare over Compare are that your data is too large to fit into memory, or you're comparing data that works well in a Spark environment, like partitioned Parquet, CSV, or JSON files, or Cerebro tables.

Performance Implications

Spark scales incredibly well, so you can use SparkCompare to compare billions of rows of data, provided you spin up a big enough cluster. Still, joining billions of rows of data is an inherently large task, so there are a couple of things you may want to take into consideration when getting into the cliched realm of "big data":

  • SparkCompare will compare all columns in common in the dataframes and report on the rest. If there are columns in the data that you don't care to compare, use a select statement/method on the dataframe(s) to filter those out. Particularly when reading from wide Parquet files, this can make a huge difference when the columns you don't care about don't have to be read into memory and included in the joined dataframe.
  • For large datasets, adding cache_intermediates=True to the SparkCompare call can help optimize performance by caching certain intermediate dataframes in memory, like the de-duped version of each input dataset, or the joined dataframe. Otherwise, Spark's lazy evaluation will recompute those each time it needs the data in a report or as you access instance attributes. This may be fine for smaller dataframes, but will be costly for larger ones. You do need to ensure that you have enough free cache memory before you do this, so this parameter is set to False by default.

Basic Usage

    import datetime
    import datacompy
    from pyspark.sql import Row

    # This example assumes you have a SparkSession named "spark" in your environment, as you
    # do when running `pyspark` from the terminal or in a Databricks notebook (Spark v2.0 and higher)

    data1 = [
        Row(acct_id=10000001234, dollar_amt=123.45, name='George Maharis', float_fld=14530.1555,
            date_fld=datetime.date(2017, 1, 1)),
        Row(acct_id=10000001235, dollar_amt=0.45, name='Michael Bluth', float_fld=1.0,
            date_fld=datetime.date(2017, 1, 1)),
        Row(acct_id=10000001236, dollar_amt=1345.0, name='George Bluth', float_fld=None,
            date_fld=datetime.date(2017, 1, 1)),
        Row(acct_id=10000001237, dollar_amt=123456.0, name='Bob Loblaw', float_fld=345.12,
            date_fld=datetime.date(2017, 1, 1)),
        Row(acct_id=10000001239, dollar_amt=1.05, name='Lucille Bluth', float_fld=None,
            date_fld=datetime.date(2017, 1, 1))
    ]

    data2 = [
        Row(acct_id=10000001234, dollar_amt=123.4, name='George Michael Bluth', float_fld=14530.155),
        Row(acct_id=10000001235, dollar_amt=0.45, name='Michael Bluth', float_fld=None),
        Row(acct_id=10000001236, dollar_amt=1345.0, name='George Bluth', float_fld=1.0),
        Row(acct_id=10000001237, dollar_amt=123456.0, name='Robert Loblaw', float_fld=345.12),
        Row(acct_id=10000001238, dollar_amt=1.05, name='Loose Seal Bluth', float_fld=111.0)
    ]

    base_df = spark.createDataFrame(data1)
    compare_df = spark.createDataFrame(data2)

    comparison = datacompy.SparkCompare(spark, base_df, compare_df, join_columns=['acct_id'])

    # This prints out a human-readable report summarizing differences
    comparison.report()

Using SparkCompare on EMR or standalone Spark

  1. Set proxy variables
  2. Create a virtual environment, if desired (virtualenv venv; source venv/bin/activate)
  3. Pip install datacompy and requirements
  4. Ensure your SPARK_HOME environment variable is set (this is probably /usr/lib/spark but may differ based on your installation)
  5. Augment your PYTHONPATH environment variable with export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$SPARK_HOME/python:$PYTHONPATH (note that your version of py4j may differ depending on the version of Spark you're using)

Using SparkCompare on Databricks

  1. Clone this repository locally
  2. Create a datacompy egg by running python setup.py bdist_egg from the repo root directory.
  3. From the Databricks front page, click the "Library" link under the "New" section.
  4. On the New library page: a. Change source to "Upload Python Egg or PyPi" b. Under "Upload Egg", Library Name should be "datacompy" c. Drag the egg file in datacompy/dist/ to the "Drop library egg here to upload" box d. Click the "Create Library" button
  5. Once the library has been created, from the library page (which you can find in your /Users/{login} workspace), you can choose clusters to attach the library to.
  6. import datacompy in a notebook attached to the cluster that the library is attached to and enjoy!

Contributors

We welcome and appreciate your contributions! Before we can accept any contributions, we ask that you please be sure to sign the Contributor License Agreement (CLA).

This project adheres to the Open Source Code of Conduct. By participating, you are expected to honor this code.

Roadmap

Roadmap details can be found here

More Repositories

1

DataProfiler

What's in your data? Extract schema, statistics and entities from datasets
Python
1,426
star
2

react-native-pathjs-charts

Android and iOS charts based on react-native-svg and paths-js
JavaScript
878
star
3

cqrs-manager-for-distributed-reactive-services

Experimental CQRS and Event Sourcing service
Java
304
star
4

SWHttpTrafficRecorder

A simple library empowering you to record/capture HTTP(s) traffic of an iOS app for mocking/stubbing later.
Objective-C
205
star
5

fpe

A format-preserving encryption implementation in Go
Go
202
star
6

rubicon-ml

Capture all information throughout your model's development in a reproducible way and tie results directly to the model code!
Jupyter Notebook
127
star
7

giraffez

User-friendly Teradata client for Python
Python
108
star
8

locopy

locopy: Loading/Unloading to Redshift and Snowflake using Python.
Python
104
star
9

checks-out

Checks-Out pull request approval system
Go
76
star
10

dataCompareR

dataCompareR is an R package that allows users to compare two datasets and view a report on the similarities and differences.
R
75
star
11

stack-deployment-tool

Go
66
star
12

bash_shell_mock

A shell script mocking utility/framework for the BASH shell
Shell
66
star
13

architecture-viewer

Visualize your PlantUML sequence diagrams as interactive architecture diagrams!
JavaScript
60
star
14

go-future-context

A simple Future (Promise) library for Go.
Go
54
star
15

AI_Dictionary_English_Spanish

TeX
49
star
16

acronym-decoder

Acronym Decoder
TypeScript
43
star
17

synthetic-data

Generating complex, nonlinear datasets appropriate for use with deep learning/black box models which 'need' nonlinearityโ€ฉ
Python
43
star
18

Particle-Cloud-Framework

Python
36
star
19

slackbot-destroyer

๐Ÿ“ฃ โŒ Slack integration that can destroy all incoming messages from Slackbot.
Python
34
star
20

global-attribution-mapping

GAM (Global Attribution Mapping) explains the landscape of neural network predictions across subpopulations
Python
33
star
21

federated-model-aggregation

The Federated Model Aggregation (FMA) Service is a collection of installable python components that make up the generic workflow/infrastructure needed for federated learning.
Python
30
star
22

oas-nodegen

A library for generating completely customizable code from the Open API Specification (FKA Swagger) RESTful API documentation using the scripting power of Node.js.
JavaScript
28
star
23

easy-screenshots

Android Instrumentation Test Screenshots made Easy.
Java
21
star
24

edgetest

edgetest is a tox-inspired python library that will loop through your project's dependencies, and check if your project is compatible with the latest version of each dependency
Python
19
star
25

ablation

Evaluating XAI methods through ablation studies.
Python
15
star
26

serverless-shell

โšก๏ธ๐Ÿš Serverless Shell with environment variables plugin
JavaScript
14
star
27

OAuthClient

Awesome OAuth Client for Java.
Java
13
star
28

otvPlots

ovtPlots: An R Package for Variable Level Monitoring
R
13
star
29

json-syntax

Generates functions to convert Python classes to and from JSON friendly objects.
Python
12
star
30

screen-object

screen-object (ruby gem for mobile app automation)
Ruby
12
star
31

jwt-security

JavaScript
11
star
32

BankAccountStarter-API-reference-app

CSS
10
star
33

CreditOffers-API-reference-app

JavaScript
10
star
34

Rewards-API-reference-app

JavaScript
10
star
35

local-crontab

๐Ÿ—บ๏ธโฐ Convert local crontabs to UTC crontabs
JavaScript
8
star
36

modtracker

JSON unmarshaling in Go that includes detection of modified fields
Go
7
star
37

grpc-cucumber-js

JavaScript
7
star
38

edgetest-hub

hub plugin for edgetest
Python
2
star
39

oas-nodegen-example

Example project that shows how to customize generated code to fit a specific design pattern using oas-nodegen
Java
2
star
40

edgetest-conda

Conda plugin for edgetest
Python
1
star
41

edgetest-pip-tools

pip-tools plugin for edgetest
Python
1
star