• Stars
    star
    259
  • Rank 157,669 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 10 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Python library for working with Table Schema.

tableschema-py

Build Coverage Release Codebase Support

A Python implementation of the Table Schema standard.

[Important Notice] We have released Frictionless Framework. This framework provides improved tableschema functionality extended to be a complete data solution. The change in not breaking for the existing software so no actions are required. Please read the Migration Guide from tableschema to Frictionless Framework.

Features

  • Table to work with data tables described by Table Schema
  • Schema representing Table Schema
  • Field representing Table Schema field
  • validate to validate Table Schema
  • infer to infer Table Schema from data
  • built-in command-line interface to validate and infer schemas
  • storage/plugins system to connect tables to different storage backends like SQL Database

Contents

Getting Started

Installation

The package uses semantic versioning. It means that major versions could include breaking changes. It's highly recommended to specify tableschema version range in your setup/requirements file e.g. tableschema>=1.0,<2.0.

$ pip install tableschema

Documentation

Introduction

Let's start with a simple example:

from tableschema import Table

# Create table
table = Table('path.csv', schema='schema.json')

# Print schema descriptor
print(table.schema.descriptor)

# Print cast rows in a dict form
for keyed_row in table.iter(keyed=True):
    print(keyed_row)

Working with Table

A table is a core concept in a tabular data world. It represents data with metadata (Table Schema). Let's see how we can use it in practice.

Consider we have some local csv file. It could be inline data or from a remote link - all supported by the Table class (except local files for in-brower usage of course). But say it's data.csv for now:

city,location
london,"51.50,-0.11"
paris,"48.85,2.30"
rome,N/A

Let's create and read a table instance. We use the static Table.load method and the table.read method with the keyed option to get an array of keyed rows:

table = Table('data.csv')
table.headers # ['city', 'location']
table.read(keyed=True)
# [
#   {city: 'london', location: '51.50,-0.11'},
#   {city: 'paris', location: '48.85,2.30'},
#   {city: 'rome', location: 'N/A'},
# ]

As we can see, our locations are just strings. But they should be geopoints. Also, Rome's location is not available, but it's just a string N/A instead of None. First we have to infer Table Schema:

table.infer()
table.schema.descriptor
# { fields:
#   [ { name: 'city', type: 'string', format: 'default' },
#     { name: 'location', type: 'geopoint', format: 'default' } ],
#  missingValues: [ '' ] }
table.read(keyed=True)
# Fails with a data validation error

Let's fix the "not available" location. There is a missingValues property in Table Schema specification. As a first try we set missingValues to N/A in table.schema.descriptor. The schema descriptor can be changed in-place, but all changes should also be committed using table.schema.commit():

table.schema.descriptor['missingValues'] = 'N/A'
table.schema.commit()
table.schema.valid # false
table.schema.errors
# [<ValidationError: "'N/A' is not of type 'array'">]

As a good citizens we've decided to check our schema descriptor's validity. And it's not valid! We should use an array for the missingValues property. Also, don't forget to include "empty string" as a valid missing value:

table.schema.descriptor['missingValues'] = ['', 'N/A']
table.schema.commit()
table.schema.valid # true

All good. It looks like we're ready to read our data again:

table.read(keyed=True)
# [
#   {city: 'london', location: [51.50,-0.11]},
#   {city: 'paris', location: [48.85,2.30]},
#   {city: 'rome', location: null},
# ]

Now we see that:

  • locations are arrays with numeric latitude and longitude
  • Rome's location is a native Python None

And because there are no errors after reading, we can be sure that our data is valid against our schema. Let's save it:

table.schema.save('schema.json')
table.save('data.csv')

Our data.csv looks the same because it has been stringified back to csv format. But now we have schema.json:

{
    "fields": [
        {
            "name": "city",
            "type": "string",
            "format": "default"
        },
        {
            "name": "location",
            "type": "geopoint",
            "format": "default"
        }
    ],
    "missingValues": [
        "",
        "N/A"
    ]
}

If we decide to improve it even more we could update the schema file and then open it again. But now providing a schema path:

table = Table('data.csv', schema='schema.json')
# Continue the work

As already mentioned a given schema can be used to validate data (see the Schema section for schema specification details). In default mode invalid data rows immediately trigger an exception in the table.iter()/table.write() methods.

Suppose this schema-invalid local file invalid_data.csv:

key,value
zero,0
one,not_an_integer
two,2

We're going to validate the data against the following schema:

table = Table(
    'invalid_data.csv',
    schema={'fields': [{'name': 'key'}, {'name': 'value', 'type': 'integer'}]})

Iterating over the data triggers an exception due to the failed cast of 'not_an_integer' to int:

for row in table.iter():
    print(row)

# Traceback (most recent call last):
# ...
# tableschema.exceptions.CastError: There are 1 cast errors (see exception.errors) for row "3"

Hint: The row number count starts with 1 and also includes header lines.

(Note: You can optionally switch off iter()/read() value casting using the cast parameter, see reference below.)

By providing a custom exception handler (a callable) to those methods you can treat occurring exceptions at your own discretion, i.e. to "fail late" and e.g. gather a validation report on the whole data:

errors = []
def exc_handler(exc, row_number=None, row_data=None, error_data=None):
    errors.append((exc, row_number, row_data, error_data))

for row in table.iter(exc_handler=exc_handler):
    print(row)

# ['zero', 0]
# ['one', FailedCast('not_an_integer')]
# ['two', 2]

print(errors)

# [(CastError('There are 1 cast errors (see exception.errors) for row "3"',),
#   3,
#   OrderedDict([('key', 'one'), ('value', 'not_an_integer')]),
#   OrderedDict([('value', 'not_an_integer')]))]

Note that

  • Data rows are yielded even though the data is schema-invalid; this is due to our custom expression handler choosing not to raise exceptions (but rather collect them in the errors list).
  • Data field values that can't get casted properly (if iter()/read() cast parameter is set to True, which is the default) are wrapped into a FailedCast "value holder". This allows for distinguishing uncasted values from successfully casted values on the data consumer side. FailedCast instances can only get yielded when custom exception handling is in place.
  • The custom exception handler callable must support a function signature as specified in the iter()/read() sections of the Table class API reference.

Working with Schema

A model of a schema with helpful methods for working with the schema and supported data. Schema instances can be initialized with a schema source as a url to a JSON file or a JSON object. The schema is initially validated (see validate below). By default validation errors will be stored in schema.errors but in a strict mode it will be instantly raised.

Let's create a blank schema. It's not valid because descriptor.fields property is required by the Table Schema specification:

schema = Schema()
schema.valid # false
schema.errors
# [<ValidationError: "'fields' is a required property">]

To avoid creating a schema descriptor by hand we will use a schema.infer method to infer the descriptor from given data:

schema.infer([
  ['id', 'age', 'name'],
  ['1','39','Paul'],
  ['2','23','Jimmy'],
  ['3','36','Jane'],
  ['4','28','Judy'],
])
schema.valid # true
schema.descriptor
#{ fields:
#   [ { name: 'id', type: 'integer', format: 'default' },
#     { name: 'age', type: 'integer', format: 'default' },
#     { name: 'name', type: 'string', format: 'default' } ],
#  missingValues: [ '' ] }

Now we have an inferred schema and it's valid. We can cast data rows against our schema. We provide a string input which will be cast correspondingly:

schema.cast_row(['5', '66', 'Sam'])
# [ 5, 66, 'Sam' ]

But if we try provide some missing value to the age field, the cast will fail because the only valid "missing" value is an empty string. Let's update our schema:

schema.cast_row(['6', 'N/A', 'Walt'])
# Cast error
schema.descriptor['missingValues'] = ['', 'N/A']
schema.commit()
schema.cast_row(['6', 'N/A', 'Walt'])
# [ 6, None, 'Walt' ]

We can save the schema to a local file, and resume work on it at any time by loading it from that file:

schema.save('schema.json')
schema = Schema('schema.json')

Working with Field

from tableschema import Field

# Init field
field = Field({'name': 'name', 'type': 'number'})

# Cast a value
field.cast_value('12345') # -> 12345

Data values can be cast to native Python objects with a Field instance. Type instances can be initialized with field descriptors. This allows formats and constraints to be defined.

Casting a value will check the value is of the expected type, is in the correct format, and complies with any constraints imposed by a schema. E.g. a date value (in ISO 8601 format) can be cast with a DateType instance. Values that can't be cast will raise an InvalidCastError exception.

Casting a value that doesn't meet the constraints will raise a ConstraintError exception.

API Reference

cli

cli()

Command-line interface

Usage: tableschema [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  infer     Infer a schema from data.
  info      Return info on this version of Table Schema
  validate  Validate that a supposed schema is in fact a Table Schema.

Table

Table(self,
      source,
      schema=None,
      strict=False,
      post_cast=[],
      storage=None,
      **options)

Table representation

Arguments

  • source (str/list[]): data source one of:
    • local file (path)
    • remote file (url)
    • array of arrays representing the rows
  • schema (any): data schema in all forms supported by Schema class
  • strict (bool): strictness option to pass to Schema constructor
  • post_cast (function[]): list of post cast processors
  • storage (None): storage name like sql or bigquery
  • options (dict): tabulator or storage's options

Raises

  • TableSchemaException: raises on any error

table.hash

Table's SHA256 hash if it's available.

If it's already read using e.g. table.read, otherwise returns None. In the middle of an iteration it returns hash of already read contents

Returns

str/None: SHA256 hash

table.headers

Table's headers is available

Returns

str[]: headers

table.schema

Returns schema class instance if available

Returns

Schema: schema

table.size

Table's size in BYTES if it's available

If it's already read using e.g. table.read, otherwise returns None. In the middle of an iteration it returns size of already read contents

Returns

int/None: size in BYTES

table.iter

table.iter(keyed=False,
           extended=False,
           cast=True,
           integrity=False,
           relations=False,
           foreign_keys_values=False,
           exc_handler=None)

Iterates through the table data and emits rows cast based on table schema.

Arguments

keyed (bool):
    yield keyed rows in a form of `{header1: value1, header2: value2}`
    (default is false; the form of rows is `[value1, value2]`)

extended (bool):
    yield extended rows in a for of `[rowNumber, [header1, header2], [value1, value2]]`
    (default is false; the form of rows is `[value1, value2]`)

cast (bool):
    disable data casting if false
    (default is true)

integrity (dict):
    dictionary in a form of `{'size': <bytes>, 'hash': '<sha256>'}`
    to check integrity of the table when it's read completely.
    Both keys are optional.

relations (dict):
    dictionary of foreign key references in a form
    of `{resource1: [{field1: value1, field2: value2}, ...], ...}`.
    If provided, foreign key fields will checked and resolved
    to one of their references (/!\ one-to-many fk are not completely resolved).

foreign_keys_values (dict):
    three-level dictionary of foreign key references optimized
    to speed up validation process in a form of
    `{resource1: {(fk_field1, fk_field2): {(value1, value2): {one_keyedrow}, ... }}}`.
    If not provided but relations is true, it will be created
    before the validation process by *index_foreign_keys_values* method

exc_handler (func):
    optional custom exception handler callable.
    Can be used to defer raising errors (i.e. "fail late"), e.g.
    for data validation purposes. Must support the signature below

Custom exception handler

def exc_handler(exc, row_number=None, row_data=None, error_data=None):
    '''Custom exception handler (example)

    # Arguments:
        exc(Exception):
            Deferred exception instance
        row_number(int):
            Data row number that triggers exception exc
        row_data(OrderedDict):
            Invalid data row source data
        error_data(OrderedDict):
            Data row source data field subset responsible for the error, if
            applicable (e.g. invalid primary or foreign key fields). May be
            identical to row_data.
    '''
    # ...

Raises

  • TableSchemaException: base class of any error
  • CastError: data cast error
  • IntegrityError: integrity checking error
  • UniqueKeyError: unique key constraint violation
  • UnresolvedFKError: unresolved foreign key reference error

Returns

Iterator[list]: yields rows

table.read

table.read(keyed=False,
           extended=False,
           cast=True,
           limit=None,
           integrity=False,
           relations=False,
           foreign_keys_values=False,
           exc_handler=None)

Read the whole table and return as array of rows

It has the same API as table.iter except for

Arguments

  • limit (int): limit count of rows to read and return

Returns

list[]: returns rows

table.infer

table.infer(limit=100,
            confidence=0.75,
            missing_values=[''],
            guesser_cls=None,
            resolver_cls=None)

Infer a schema for the table.

It will infer and set Table Schema to table.schema based on table data.

Arguments

  • limit (int): limit rows sample size
  • confidence (float): how many casting errors are allowed (as a ratio, between 0 and 1)
  • missing_values (str[]): list of missing values (by default [''])
  • guesser_cls (class): you can implement inferring strategies by providing type-guessing and type-resolving classes [experimental]
  • resolver_cls (class): you can implement inferring strategies by providing type-guessing and type-resolving classes [experimental]

Returns

dict: Table Schema descriptor

table.save

table.save(target, storage=None, **options)

Save data source to file locally in CSV format with , (comma) delimiter

To save schema use table.schema.save()

Arguments

  • target (str): saving target (e.g. file path)
  • storage (None/str): storage name like sql or bigquery
  • options (dict): tabulator or storage options

Raises

  • TableSchemaException: raises an error if there is saving problem

Returns

True/Storage: returns true or storage instance

table.index_foreign_keys_values

table.index_foreign_keys_values(relations)

Creates a three-level dictionary of foreign key references

We create them optimized to speed up validation process in a form of {resource1: {(fk_field1, fk_field2): {(value1, value2): {one_keyedrow}, ... }}}.

For each foreign key of the schema it will iterate through the corresponding relations['resource'] to create an index (i.e. a dict) of existing values for the foreign fields and store on keyed row for each value combination.

The optimization relies on the indexation of possible values for one foreign key in a hashmap to later speed up resolution.

This method is public to allow creating the index once to apply it on multiple tables charing the same schema (typically grouped resources in datapackage)

Notes

  • the second key of the output is a tuple of the foreign fields, a proxy identifier of the foreign key
  • the same relation resource can be indexed multiple times as a schema can contain more than one Foreign Keys pointing to the same resource

Arguments

  • relations (dict): dict of foreign key references in a form of {resource1: [{field1: value1, field2: value2}, ...], ...}. It must contain all resources pointed in the foreign keys schema definition.

Returns

dict: returns a three-level dictionary of foreign key references optimized to speed up validation process in a form of {resource1: {(fk_field1, fk_field2): {(value1, value2): {one_keyedrow}, ... }}})

Schema

Schema(self, descriptor={}, strict=False)

Schema representation

Arguments

  • descriptor (str/dict): schema descriptor one of: - local path - remote url - dictionary
  • strict (bool): flag to specify validation behaviour: - if false, errors will not be raised but instead collected in schema.errors - if true, validation errors are raised immediately

Raises

  • TableSchemaException: raise any error that occurs during the process

schema.descriptor

Schema's descriptor

Returns

dict: descriptor

schema.errors

Validation errors

Always empty in strict mode.

Returns

Exception[]: validation errors

schema.field_names

Schema's field names

Returns

str[]: an array of field names

schema.fields

Schema's fields

Returns

Field[]: an array of field instances

schema.foreign_keys

Schema's foreign keys

Returns

dict[]: foreign keys

schema.headers

Schema's field names

Returns

str[]: an array of field names

schema.missing_values

Schema's missing values

Returns

str[]: missing values

schema.primary_key

Schema's primary keys

Returns

str[]: primary keys

schema.valid

Validation status

Always true in strict mode.

Returns

bool: validation status

schema.get_field

schema.get_field(name)

Get schema's field by name.

Use table.update_field if you want to modify the field descriptor

Arguments

  • name (str): schema field name

Returns

Field/None: Field instance or None if not found

schema.get_field

schema.get_field(name)

Get schema's field by name.

Use table.update_field if you want to modify the field descriptor

Arguments

  • name (str): schema field name

Returns

Field/None: Field instance or None if not found

schema.add_field

schema.add_field(descriptor)

Add new field to schema.

The schema descriptor will be validated with newly added field descriptor.

Arguments

  • descriptor (dict): field descriptor

Raises

  • TableSchemaException: raises any error that occurs during the process

Returns

Field/None: added Field instance or None if not added

schema.update_field

schema.update_field(name, update)

Update existing descriptor field by name

Arguments

  • name (str): schema field name
  • update (dict): update to apply to field's descriptor

Returns

bool: true on success and false if no field is found to be modified

schema.remove_field

schema.remove_field(name)

Remove field resource by name.

The schema descriptor will be validated after field descriptor removal.

Arguments

  • name (str): schema field name

Raises

  • TableSchemaException: raises any error that occurs during the process

Returns

Field/None: removed Field instances or None if not found

schema.cast_row

schema.cast_row(row, fail_fast=False, row_number=None, exc_handler=None)

Cast row based on field types and formats.

Arguments

  • row (any[]: data row as an array of values

Returns

any[]: returns cast data row

schema.infer

schema.infer(rows,
             headers=1,
             confidence=0.75,
             guesser_cls=None,
             resolver_cls=None)

Infer and set schema.descriptor based on data sample.

Arguments

  • rows (list[]): array of arrays representing rows.
  • headers (int/str[]): data sample headers (one of): - row number containing headers (rows should contain headers rows) - array of headers (rows should NOT contain headers rows)
  • confidence (float): how many casting errors are allowed (as a ratio, between 0 and 1)
  • guesser_cls (class): you can implement inferring strategies by providing type-guessing and type-resolving classes [experimental]
  • resolver_cls (class): you can implement inferring strategies by providing type-guessing and type-resolving classes [experimental]

Returns

dict: Table Schema descriptor

schema.commit

schema.commit(strict=None)

Update schema instance if there are in-place changes in the descriptor.

Example

from tableschema import Schema
descriptor = {'fields': [{'name': 'my_field', 'title': 'My Field', 'type': 'string'}]}
schema = Schema(descriptor)
print(schema.get_field('my_field').descriptor['type']) # string

# Update descriptor by field position
schema.descriptor['fields'][0]['type'] = 'number'
# Update descriptor by field name
schema.update_field('my_field', {'title': 'My Pretty Field'}) # True

# Change are not committed
print(schema.get_field('my_field').descriptor['type']) # string
print(schema.get_field('my_field').descriptor['title']) # My Field

# Commit change
schema.commit()
print(schema.get_field('my_field').descriptor['type']) # number
print(schema.get_field('my_field').descriptor['title']) # My Pretty Field

Arguments

  • strict (bool): alter strict mode for further work

Raises

  • TableSchemaException: raises any error that occurs during the process

Returns

bool: true on success and false if not modified

schema.save

schema.save(target, ensure_ascii=True)

Save schema descriptor to target destination.

Arguments

  • target (str): path where to save a descriptor

Raises

  • TableSchemaException: raises any error that occurs during the process

Returns

bool: true on success

Field

Field(self, descriptor, missing_values=[''], schema=None)

Field representaion

Arguments

  • descriptor (dict): schema field descriptor
  • missingValues (str[]): an array with string representing missing values

Raises

  • TableSchemaException: raises any error that occurs during the process

field.constraints

Field constraints

Returns

dict: dict of field constraints

field.descriptor

Fields's descriptor

Returns

dict: descriptor

field.format

Field format

Returns

str: field format

field.missing_values

Field's missing values

Returns

str[]: missing values

field.name

Field name

Returns

str: field name

field.required

Whether field is required

Returns

bool: true if required

field.schema

Returns a schema instance if the field belongs to some schema

Returns

Schema: field's schema

field.type

Field type

Returns

str: field type

field.cast_value

field.cast_value(value, constraints=True)

Cast given value according to the field type and format.

Arguments

  • value (any): value to cast against field
  • constraints (boll/str[]): gets constraints configuration - it could be set to true to disable constraint checks - it could be an Array of constraints to check e.g. ['minimum', 'maximum']

Raises

  • TableSchemaException: raises any error that occurs during the process

Returns

any: returns cast value

field.test_value

field.test_value(value, constraints=True)

Test whether value is compliant to the field.

Arguments

  • value (any): value to cast against field
  • constraints (bool/str[]): constraints configuration

Returns

bool: returns if value is compliant to the field

Storage

Storage(self, **options)

Storage factory/interface

For users

Use Storage.connect to instantiate a storage

For instantiation of concrete storage instances, tableschema.Storage provides a unified factory method connect (which uses the plugin system under the hood):

# pip install tableschema_sql
from tableschema import Storage

storage = Storage.connect('sql', **options)
storage.create('bucket', descriptor)
storage.write('bucket', rows)
storage.read('bucket')

For integrators

The library includes interface declaration to implement tabular Storage. This interface allow to use different data storage systems like SQL with tableschema.Table class (load/save) as well as on the data package level:

Storage

An implementor must follow tableschema.Storage interface to write his own storage backend. Concrete storage backends could include additional functionality specific to conrete storage system. See plugins below to know how to integrate custom storage plugin into your workflow.

storage.buckets

Return list of storage bucket names.

A bucket is a special term which has almost the same meaning as table. You should consider bucket as a table stored in the storage.

Raises

  • exceptions.StorageError: raises on any error

Returns

str[]: return list of bucket names

storage.connect

storage.connect(name, **options)

Create tabular storage based on storage name.

This method is statis: Storage.connect()

Arguments

  • name (str): storage name like sql
  • options (dict): concrete storage options

Raises

  • StorageError: raises on any error

Returns

Storage: returns Storage instance

storage.create

storage.create(bucket, descriptor, force=False)

Create one/multiple buckets.

Arguments

  • bucket (str/list): bucket name or list of bucket names
  • descriptor (dict/dict[]): schema descriptor or list of descriptors
  • force (bool): whether to delete and re-create already existing buckets

Raises

  • exceptions.StorageError: raises on any error

storage.delete

storage.delete(bucket=None, ignore=False)

Delete one/multiple/all buckets.

Arguments

  • bucket (str/list/None): bucket name or list of bucket names to delete. If None, all buckets will be deleted
  • descriptor (dict/dict[]): schema descriptor or list of descriptors
  • ignore (bool): don't raise an error on non-existent bucket deletion

Raises

  • exceptions.StorageError: raises on any error

storage.describe

storage.describe(bucket, descriptor=None)

Get/set bucket's Table Schema descriptor

Arguments

  • bucket (str): bucket name
  • descriptor (dict/None): schema descriptor to set

Raises

  • exceptions.StorageError: raises on any error

Returns

dict: returns Table Schema descriptor

storage.iter

storage.iter(bucket)

Return an iterator of typed values based on the schema of this bucket.

Arguments

  • bucket (str): bucket name

Raises

  • exceptions.StorageError: raises on any error

Returns

list[]: yields data rows

storage.read

storage.read(bucket)

Read typed values based on the schema of this bucket.

Arguments

  • bucket (str): bucket name Raises
  • exceptions.StorageError: raises on any error Returns

list[]: returns data rows

storage.write

storage.write(bucket, rows)

This method writes data rows into storage.

It should store values of unsupported types as strings internally (like csv does).

Arguments

  • bucket (str): bucket name
  • rows (list[]): data rows to write

Raises

  • exceptions.StorageError: raises on any error

validate

validate(descriptor)

Validate descriptor

Arguments

  • dict: descriptor

Raises

  • ValidationError: on validation errors

Returns

bool: True

infer

infer(source,
      headers=1,
      limit=100,
      confidence=0.75,
      missing_values=[''],
      guesser_cls=None,
      resolver_cls=None,
      **options)

Infer source schema.

Arguments

  • source (any): source as path, url or inline data
  • headers (int/str[]): headers rows number or headers list
  • confidence (float): how many casting errors are allowed (as a ratio, between 0 and 1)
  • missing_values (str[]): list of missing values (by default [''])
  • guesser_cls (class): you can implement inferring strategies by providing type-guessing and type-resolving classes [experimental]
  • resolver_cls (class): you can implement inferring strategies by providing type-guessing and type-resolving classes [experimental]

Raises

  • TableSchemaException: raises any error that occurs during the process

Returns

dict: returns schema descriptor

FailedCast

FailedCast(self, value)

Wrap an original data field value that failed to be properly casted.

FailedCast allows for further processing/yielding values but still be able to distinguish uncasted values on the consuming side.

Delegates attribute access and the basic rich comparison methods to the underlying object. Supports default user-defined classes hashability i.e. is hashable based on object identity (not based on the wrapped value).

Arguments

  • value (any): value

DataPackageException

DataPackageException(self, message, errors=[])

Base class for all DataPackage/TableSchema exceptions.

If there are multiple errors, they can be read from the exception object:

try:
    # lib action
except DataPackageException as exception:
    if exception.multiple:
        for error in exception.errors:
            # handle error

datapackageexception.errors

List of nested errors

Returns

DataPackageException[]: list of nested errors

datapackageexception.multiple

Whether it's a nested exception

Returns

bool: whether it's a nested exception

TableSchemaException

TableSchemaException(self, message, errors=[])

Base class for all TableSchema exceptions.

LoadError

LoadError(self, message, errors=[])

All loading errors.

ValidationError

ValidationError(self, message, errors=[])

All validation errors.

CastError

CastError(self, message, errors=[])

All value cast errors.

IntegrityError

IntegrityError(self, message, errors=[])

All integrity errors.

UniqueKeyError

UniqueKeyError(self, message, errors=[])

Unique key constraint violation (CastError subclass)

RelationError

RelationError(self, message, errors=[])

All relations errors.

UnresolvedFKError

UnresolvedFKError(self, message, errors=[])

Unresolved foreign key reference error (RelationError subclass).

StorageError

StorageError(self, message, errors=[])

All storage errors.

Experimental

This API is experimental and can be changed/removed in the future

There is an experimental environment variable TABLESCHEMA_PRESERVE_MISSING_VALUES which, if it is set, affects how data casting works.

By default, missing values are resolved to None values. When this flag is set, missing values are passed through as it is. For example:

missing_values.py

from tableschema import Field

field = Field({'type': 'number'}, missing_values=['-'])
print(field.cast_value('3'))
print(field.cast_value('-'))

Running this script in different modes:

$ python missing_values.py
3
None
$ TABLESCHEMA_PRESERVE_MISSING_VALUES=1 python missing_values.py
3
-

The flags affects all the library's APIs and software built on top of tableschema. For example, Data Package Pipelines:

$ TABLESCHEMA_PRESERVE_MISSING_VALUES=1 dpp run ./my_pipeline

Contributing

The project follows the Open Knowledge International coding standards.

Recommended way to get started is to create and activate a project virtual environment. To install package and development dependencies into active environment:

$ make install

To run tests with linting and coverage:

$ make test

Changelog

Here described only breaking and the most important changes. The full changelog and documentation for all released versions can be found in the nicely formatted commit history.

v1.20

  • Added --json flag to the CLI (#287)

v1.19

  • Deduplicate field names if guessing in infer

v1.18

  • Publish field.ERROR/cast_function/check_functions

v1.17

  • Added schema.missing_values and field.missing_values

v1.16

  • Fixed the way we parse geopoint:
    • as a string it can be in 3 forms ("default", "array", "object") BUT
    • as native object it can only be a list/tuple

v1.15

  • Added an experimental TABLESCHEMA_PRESERVE_MISSING_VALUES environment variable flag

v1.14

  • Allow providing custom guesser and resolver to table.infer and infer

v1.13

  • Added missing_values argument to the infer function (#269)

v1.12

  • Support optional custom exception handling for table.iter/read (#259)

v1.11

  • Added preserve_missing_values parameter to field.cast_value

v1.10

  • Added an ability to check table's integrity while reading

v1.9

  • Implemented the table.size and table.hash properties

v1.8

  • Added table.index_foreign_keys_values and improved foreign key checks performance

v1.7

  • Added field.schema property

v1.6

  • In strict mode raise an exception if there are problems in field construction

v1.5

  • Allow providing custom guesser and resolver to schema infer

v1.4

  • Added schema.update_field method

v1.3

  • Support datetime with no time for date casting

v1.2

  • Support floats like 1.0 for integer casting

v1.1

  • Added the confidence parameter to infer

v1.0

More Repositories

1

frictionless-py

Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data
Python
696
star
2

datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.
MDX
488
star
3

tabulator-py

Python library for reading and writing tabular data via streams.
Python
235
star
4

datapackage-py

A Python library for working with Data Packages.
Python
191
star
5

frictionlessdata.io

The main repository of the Frictionless Data project. Website, issues, and discussions
Vue
141
star
6

datapackage-pipelines

Framework for processing data packages in pipelines of modular components.
Python
119
star
7

dpm-js

[DEPRECATED] Please use https://datahub.io/docs/features/data-cli
JavaScript
109
star
8

livemark

Data presentation framework for Python that generates static sites from extended Markdown with interactive charts, tables, scripts, and other features
Python
95
star
9

tableschema-js

A JavaScript library for working with Table Schema.
JavaScript
82
star
10

goodtables.io

Data validation as a service. Project retired, got to the current one at frictionsless/repository
Python
69
star
11

frictionless-js

A lightweight, standardized library accessing files and datasets, especially tabular ones (CSV, Excel).
JavaScript
62
star
12

tableschema-sql-py

Generate SQL tables, load and extract data, based on JSON Table Schema descriptors.
Python
60
star
13

tableschema-pandas-py

Generate Pandas frames, load and extract data, based on JSON Table Schema descriptors.
Python
52
star
14

tableschema-go

A Go library for working with Table Schema.
Go
46
star
15

datapackage-r

An R package for working with Data Package.
R
44
star
16

datapackage-js

A JavaScript library for working with Data Package.
JavaScript
43
star
17

ckanext-datapackager

CKAN extension for importing/exporting Data Packages.
Python
36
star
18

frictionless-ci

Data management service that brings continuous data validation to tabular data in your repository via Github Action
JavaScript
36
star
19

data-quality-dashboard

Data Quality Dashboards display statistics on a collection of published data.
JavaScript
33
star
20

website-old

The Frictionless Data website.
CSS
31
star
21

examples

Example Data Packages for use in tutorials
R
28
star
22

frictionless-r

R package to read and write Frictionless Data Packages
R
27
star
23

datapackage-ui

Create and validate Data Packages in the browser
JavaScript
27
star
24

tableschema-r

An R library for working with Table Schema.
R
25
star
25

tableschema-java

A Java library for working with Table Schema.
Java
25
star
26

frictionless-darwin-core

DarwinCore Archive as Frictionless Data Package
Python
21
star
27

datapackage-go

A Go library for working with Data Package.
Go
21
star
28

datapackage-java

A Java library for working with Frictionless Data Data Packages.
Java
20
star
29

data-quality-spec

A spec for reporting errors in data quality.
JavaScript
20
star
30

delimiter

Edit CSV files in the browser and sync them with GitHub
Vue
19
star
31

awesome-frictionless

🎉 A curated list of tools, libraries, patterns and projects in the Frictionless ecosystem.
19
star
32

data-quality-cli

CLI for creating databases for Data Quality Dashboards.
Python
19
star
33

tableschema-bigquery-py

Generate BigQuery tables, load and extract data, based on JSON Table Schema descriptors.
Python
18
star
34

schemas

[DEPRECATED] Please use https://github.com/frictionlessdata/specs
Python
17
star
35

schema-collaboration

Carles Pina Estany's 2020 Tool Fund: data managers and researchers collaborate to write the Frictionless Data packages, tabular schemas, etc.
Python
16
star
36

datapackage-render-js

React components for rendering Data Package Views.
JavaScript
15
star
37

goodtables-web

[DEPRECATED] Please use http://try.goodtables.io/
CSS
15
star
38

datapackage-v2-draft

This repository has been merged back into https://github.com/frictionlessdata/datapackage.
MDX
14
star
39

dplib-py

Python implementation of the Data Package standard and various models and utils for working with data.
Python
14
star
40

frictionless-prototype-py

[DEPRECATED] Please use - https://github.com/frictionlessdata/frictionless-py
Python
13
star
41

tableschema-rb

A Ruby library for working with JSON Table Schema.
Ruby
12
star
42

example-continuous-data-integration

[DEPRECATED] Please use https://goodtables.io
Python
12
star
43

googlesheets-datapackage-tools

Extension for importing and exporting Data Packages from/into Google Sheets.
JavaScript
12
star
44

website

Merged into https://github.com/frictionlessdata/frictionlessdata.io
Vue
12
star
45

tableschema-php

A php library for working with Table Schema.
PHP
12
star
46

components

Data integration components for TypeScript/React that implement Frictionless Data concepts in visual UI
TypeScript
11
star
47

datapackage-rb

Ruby library and tools for working with datapackages
Ruby
11
star
48

covid-tracker

A livemark tracking COVID-19 disease pandemic
HTML
10
star
49

forum

🗣 Frictionless Data Forum esp for "How do I" type questions
10
star
50

datapackage-php

A php library for working with Data Package.
PHP
10
star
51

tableschema-ui

A web UI for creating, editing and validating Table Schemas.
JavaScript
10
star
52

tableschema-elasticsearch-py

Generate Elasticsearch indexes based on Table Schema descriptors.
Python
10
star
53

notebooks

Jupyter Notebooks presenting Frictionless Data.
Jupyter Notebook
9
star
54

frictionless-ckan-mapper

🛠️ A library for mapping CKAN metadata <=> Frictionless metadata
Python
9
star
55

datasets

A common set of test data files and data packages for testing Frictionless Data tooling
Python
9
star
56

DataPackage.jl

A Julia library for working with Data Package.
Julia
9
star
57

goodtables-ui

A web UI for goodtables validation and report visualizations.
TypeScript
9
star
58

datapackage-clj

A Clojure library for working with Data Package.
HTML
9
star
59

goodtables-js

Validate tabular data in JavaScript
JavaScript
8
star
60

schema-catalog

Community curated Schema Catalog
Vue
8
star
61

libraries

Data integration libraries for 10 programming languages that implement Frictionless Data concepts in code
JavaScript
8
star
62

TableSchema.jl

A Julia library for working with Table Schema.
Julia
7
star
63

datapackage-pipelines-aws

Datapackage-pipelines integration with AWS
Python
7
star
64

data-packages

A livemark listing data packages hosted on Github
HTML
7
star
65

frictionless-ci-demo

It's a demo repository for the Frictionless Repository project.
7
star
66

software-legacy

DEPRECATED / MERGED to https://github.com/frictionlessdata/project
Python
6
star
67

tableschema-clj

A Clojure library for working with Table Schema.
Clojure
6
star
68

tableschema-sql-js

Library for converting JSON Table Schema to an SQL table, for Node and the browser.
JavaScript
6
star
69

tableschema-spss-py

Read and write between SPSS and Table Schema.
Python
6
star
70

ckan-extensions

A livemark listing CKAN extensions hosted on Github
HTML
5
star
71

goodtables-py

Goodtables is a framework to validate tabular data [MAINTENANCE MODE]
Python
5
star
72

datapackage-pipelines-elasticsearch

Datapackage-pipelines integration with Elasticsearch
Python
4
star
73

datapackage-swift

A Swift library for working with Data Package.
Swift
4
star
74

software

The Frictionless Data project provides a rich set of open source software for working with data
HTML
4
star
75

pilot-open-archaeology

This is a repository for piloting a packaging approach to archaeology data
4
star
76

pilot-elife

Scripts and data for the eLife / Goodtables pilot
Python
3
star
77

toolfund

Frictionless Data Tool Fund Website
SCSS
3
star
78

livemark-project

Livemark project template
HTML
3
star
79

data-quality-databases

This repo contains databases created with https://github.com/frictionlessdata/data-quality-cli.
Makefile
3
star
80

datapackage-pipelines-goodtables

A datapackage-pipelines processor to validate tabular resources using goodtables
Python
3
star
81

community-insights

To tell a story about the Frictionless community using Livemark
HTML
3
star
82

docs

Documentation hub and automation
Python
3
star
83

dataflows-aws

AWS integration for dataflows
Python
3
star
84

tableschema-ckan-datastore-py

Generate CKAN DataStore tables, load and extract data, based on Table Schema descriptors.
Python
3
star
85

testsuite-extended

Testsuite for Frictioinless Data extended implementations.
Python
2
star
86

pilot-oxford

Pilot with Oxford
2
star
87

livemark-ckan

It's a CKAN plugin for Livemark
HTML
2
star
88

datapackage-pipelines-github

Datapackage-pipelines github
Python
2
star
89

tableschema-models-js

An object relational mapper based on Table Schema descriptors.
JavaScript
2
star
90

fellows

Website for fellows.frictionlessdata.io Lektor project
Jupyter Notebook
2
star
91

ckanext-datapackage

Data Package integreation for CKAN.
Python
2
star
92

pilot-catalyst

Pilot with the Catalyst cooperative
Python
2
star
93

tableschema-openrefine-py

Generate OpenRefine models, load and extract data, based on JSON Table Schema descriptors.
Python
2
star
94

geodata-training

This training guide for trainers of the OKF x Facebook GeoData Training Programme
HTML
2
star
95

tableschema-swift

A Swift library for working with Table Schema.
Swift
2
star
96

goodread

Goodread executes Python and Bash codeblocks in Markdown and writes the results back
Python
2
star
97

livemark-plugin

Livemark plugin template
HTML
1
star
98

goodtables.io-example

[EXAMPLE] Data repository for goodtables.io.
Makefile
1
star
99

datapackage-pipelines-ckan

Data Package Pipelines processors for CKAN
Python
1
star
100

pilot-pnnl

Pilot project with ADBio PNNL
1
star