• Stars
    star
    180
  • Rank 213,097 (Top 5 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created almost 8 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A validation library for Pandas data frames using user-friendly schemas

PandasSchema

For the full documentation, refer to the Github Pages Website.


PandasSchema is a module for validating tabulated data, such as CSVs (Comma Separated Value files), and TSVs (Tab Separated Value files). It uses the incredibly powerful data analysis tool Pandas to do so quickly and efficiently.

For example, say your code expects a CSV that looks a bit like this:

Given Name,Family Name,Age,Sex,Customer ID
Gerald,Hampton,82,Male,2582GABK
Yuuwa,Miyake,27,Male,7951WVLW
Edyta,Majewska,50,Female,7758NSID

Now you want to be able to ensure that the data in your CSV is in the correct format:

import pandas as pd
from io import StringIO
from pandas_schema import Column, Schema
from pandas_schema.validation import LeadingWhitespaceValidation, TrailingWhitespaceValidation, CanConvertValidation, MatchesPatternValidation, InRangeValidation, InListValidation

schema = Schema([
    Column('Given Name', [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()]),
    Column('Family Name', [LeadingWhitespaceValidation(), TrailingWhitespaceValidation()]),
    Column('Age', [InRangeValidation(0, 120)]),
    Column('Sex', [InListValidation(['Male', 'Female', 'Other'])]),
    Column('Customer ID', [MatchesPatternValidation(r'\d{4}[A-Z]{4}')])
])

test_data = pd.read_csv(StringIO('''Given Name,Family Name,Age,Sex,Customer ID
Gerald ,Hampton,82,Male,2582GABK
Yuuwa,Miyake,270,male,7951WVLW
Edyta,Majewska ,50,Female,775ANSID
'''))

errors = schema.validate(test_data)

for error in errors:
    print(error)

PandasSchema would then output

{row: 0, column: "Given Name"}: "Gerald " contains trailing whitespace
{row: 1, column: "Age"}: "270" was not in the range [0, 120)
{row: 1, column: "Sex"}: "male" is not in the list of legal options (Male, Female, Other)
{row: 2, column: "Family Name"}: "Majewska " contains trailing whitespace
{row: 2, column: "Customer ID"}: "775ANSID" does not match the pattern "\d{4}[A-Z]{4}"

More Repositories

1

vue-cwl

Visualizer of CWL (Common Workflow Language) workflows for Vue
Vue
29
star
2

emoji_pix

A simple command-line utility (and Rust crate!) for converting from a conventional image file (e.g. a PNG file) into a pixel-art version constructed with emoji
Rust
22
star
3

pokemontcgscraper

Scrapes the official Pokemon website for pokemon card data
JavaScript
20
star
4

Unipressed

Comprehensive Python client for the Uniprot REST API
Python
18
star
5

RustLangRetweet

Rust bot that runs periodically on AWS Lambda and retweets any Tweets matching a query
Rust
17
star
6

wordnet-sqlite

A node package exposing an SQLite database of the Princeton University WordNet database
JavaScript
17
star
7

ArgparsePrompt

Wrapper for the built-in Argparse, allowing missing command-line arguments to be filled in by the user via interactive prompts
Python
11
star
8

TidyMultiqc

Converts 'MultiQC' Reports into Tidy Data Frames
R
10
star
9

koa-pg-session

A model implementation of sessions for koa using postgres as the backend
JavaScript
10
star
10

node-mtg-json

Exposes an API for downloading and acessing the mtgJson file for Magic the Gathering Cards
JavaScript
9
star
11

AmplifyCountDirective

Count the number of items in your DynamoDB tables using Amplify
TypeScript
9
star
12

PkmnCardsScraper

Scrapes pkmncards.com for pokemon card data
JavaScript
8
star
13

UnreleasedArenaData

Parses useful information out of the Magic Arena data files
Python
3
star
14

AflTablesScraper

A python scraper for the AFL Tables website
Python
2
star
15

jsTreeBind

A jQuery plugin that allows the use of data binding frameworks (Angular, Ember, Knockout etc.) with the jsTree UI component
JavaScript
2
star
16

Asynchronize

Python package for converting callback functions to asynchronous coroutines
Python
2
star
17

DockerCli

A command line tool for running executable docker images, and automatically mounting in any files used in the command
Python
2
star
18

PipeChain

Functional pipelines in Python using method chaining
Python
2
star
19

Properon

A free and open source web application for the generation of publication-quality gene diagrams
JavaScript
1
star
20

Top8Scraper

A web scraper for the Magic tournament info stored on mtgtop8
Python
1
star
21

DashAwesomeQueryBuilder

Dash layer around https://github.com/ukrbublik/react-awesome-query-builder
JavaScript
1
star
22

refviewers

A simple web app for recommending reviewers for your journal article
JavaScript
1
star
23

1000gFastqc

A collection of 1000 Genomes samples run through FastQC and then MultiQC, for use in testing QC pipelines etc
HTML
1
star
24

LastPdfPage

For each page number in a PDF, selects only the last page with that number, and then regenerates the PDF
Python
1
star
25

WithPartial

A utility for functional piping in Python that allows you to access any function in any scope as a partial.
Python
1
star
26

Regulagity

Regulagity (reh-gew-la-git-ee): A tool for measuring the frequency on a git repo using a variety of single summary statistics
Python
1
star
27

WdlParserPackaging

A repo for packaging the official Python WDL parser
Python
1
star
28

CustomClass

Modify the internal behaviour of your JavaScript classes
JavaScript
1
star
29

botany

A css/js library for creating dynamic tree-view components using declarative techniques
HTML
1
star
30

mmcifix

Fixes mmCIF protein structure files
Python
1
star
31

1000g-megaqc

1000 genomes data processed for use as a MegaQC test set
1
star