• Stars
    star
    191
  • Rank 201,682 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 9 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Python library for working with Data Packages.

datapackage-py

Build Coverage Release Codebase Support

A library for working with Data Packages.

[Important Notice] We have released Frictionless Framework. This framework provides improved datapackage functionality extended to be a complete data solution. The change in not breaking for the existing software so no actions are required. Please read the Migration Guide from datapackage to Frictionless Framework.

Features

  • Package class for working with data packages
  • Resource class for working with data resources
  • Profile class for working with profiles
  • validate function for validating data package descriptors
  • infer function for inferring data package descriptors

Contents

Getting Started

Installation

The package use semantic versioning. It means that major versions could include breaking changes. It's highly recommended to specify datapackage version range in your setup/requirements file e.g. datapackage>=1.0,<2.0.

$ pip install datapackage

OSX 10.14+

If you receive an error about the cchardet package when installing datapackage on Mac OSX 10.14 (Mojave) or higher, follow these steps:

  1. Make sure you have the latest x-code by running the following in terminal: xcode-select --install
  2. Then go to https://developer.apple.com/download/more/ and download the command line tools. Note, this requires an Apple ID.
  3. Then, in terminal, run open /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg You can read more about these steps in this post.

Documentation

Introduction

Let's start with a simple example:

from datapackage import Package

package = Package('datapackage.json')
package.get_resource('resource').read()

Working with Package

A class for working with data packages. It provides various capabilities like loading local or remote data package, inferring a data package descriptor, saving a data package descriptor and many more.

Consider we have some local csv files in a data directory. Let's create a data package based on this data using a Package class:

data/cities.csv

city,location
london,"51.50,-0.11"
paris,"48.85,2.30"
rome,"41.89,12.51"

data/population.csv

city,year,population
london,2017,8780000
paris,2017,2240000
rome,2017,2860000

First we create a blank data package:

package = Package()

Now we're ready to infer a data package descriptor based on data files we have. Because we have two csv files we use glob pattern **/*.csv:

package.infer('**/*.csv')
package.descriptor
#{ profile: 'tabular-data-package',
#  resources:
#   [ { path: 'data/cities.csv',
#       profile: 'tabular-data-resource',
#       encoding: 'utf-8',
#       name: 'cities',
#       format: 'csv',
#       mediatype: 'text/csv',
#       schema: [Object] },
#     { path: 'data/population.csv',
#       profile: 'tabular-data-resource',
#       encoding: 'utf-8',
#       name: 'population',
#       format: 'csv',
#       mediatype: 'text/csv',
#       schema: [Object] } ] }

An infer method has found all our files and inspected it to extract useful metadata like profile, encoding, format, Table Schema etc. Let's tweak it a little bit:

package.descriptor['resources'][1]['schema']['fields'][1]['type'] = 'year'
package.commit()
package.valid # true

Because our resources are tabular we could read it as a tabular data:

package.get_resource('population').read(keyed=True)
#[ { city: 'london', year: 2017, population: 8780000 },
#  { city: 'paris', year: 2017, population: 2240000 },
#  { city: 'rome', year: 2017, population: 2860000 } ]

Let's save our descriptor on the disk as a zip-file:

package.save('datapackage.zip')

To continue the work with the data package we just load it again but this time using local datapackage.zip:

package = Package('datapackage.zip')
# Continue the work

It was onle basic introduction to the Package class. To learn more let's take a look on Package class API reference.

Working with Resource

A class for working with data resources. You can read or iterate tabular resources using the iter/read methods and all resource as bytes using row_iter/row_read methods.

Consider we have some local csv file. It could be inline data or remote link - all supported by Resource class (except local files for in-brower usage of course). But say it's data.csv for now:

city,location
london,"51.50,-0.11"
paris,"48.85,2.30"
rome,N/A

Let's create and read a resource. Because resource is tabular we could use resource.read method with a keyed option to get an array of keyed rows:

resource = Resource({path: 'data.csv'})
resource.tabular # true
resource.read(keyed=True)
# [
#   {city: 'london', location: '51.50,-0.11'},
#   {city: 'paris', location: '48.85,2.30'},
#   {city: 'rome', location: 'N/A'},
# ]
resource.headers
# ['city', 'location']
# (reading has to be started first)

As we could see our locations are just a strings. But it should be geopoints. Also Rome's location is not available but it's also just a N/A string instead of Python None. First we have to infer resource metadata:

resource.infer()
resource.descriptor
#{ path: 'data.csv',
#  profile: 'tabular-data-resource',
#  encoding: 'utf-8',
#  name: 'data',
#  format: 'csv',
#  mediatype: 'text/csv',
# schema: { fields: [ [Object], [Object] ], missingValues: [ '' ] } }
resource.read(keyed=True)
# Fails with a data validation error

Let's fix not available location. There is a missingValues property in Table Schema specification. As a first try we set missingValues to N/A in resource.descriptor.schema. Resource descriptor could be changed in-place but all changes should be commited by resource.commit():

resource.descriptor['schema']['missingValues'] = 'N/A'
resource.commit()
resource.valid # False
resource.errors
# [<ValidationError: "'N/A' is not of type 'array'">]

As a good citiziens we've decided to check out recource descriptor validity. And it's not valid! We should use an array for missingValues property. Also don't forget to have an empty string as a missing value:

resource.descriptor['schema']['missingValues'] = ['', 'N/A']
resource.commit()
resource.valid # true

All good. It looks like we're ready to read our data again:

resource.read(keyed=True)
# [
#   {city: 'london', location: [51.50,-0.11]},
#   {city: 'paris', location: [48.85,2.30]},
#   {city: 'rome', location: null},
# ]

Now we see that:

  • locations are arrays with numeric lattide and longitude
  • Rome's location is a native JavaScript null

And because there are no errors on data reading we could be sure that our data is valid againt our schema. Let's save our resource descriptor:

resource.save('dataresource.json')

Let's check newly-crated dataresource.json. It contains path to our data file, inferred metadata and our missingValues tweak:

{
    "path": "data.csv",
    "profile": "tabular-data-resource",
    "encoding": "utf-8",
    "name": "data",
    "format": "csv",
    "mediatype": "text/csv",
    "schema": {
        "fields": [
            {
                "name": "city",
                "type": "string",
                "format": "default"
            },
            {
                "name": "location",
                "type": "geopoint",
                "format": "default"
            }
        ],
        "missingValues": [
            "",
            "N/A"
        ]
    }
}

If we decide to improve it even more we could update the dataresource.json file and then open it again using local file name:

resource = Resource('dataresource.json')
# Continue the work

It was onle basic introduction to the Resource class. To learn more let's take a look on Resource class API reference.

Working with Group

A class representing a group of tabular resources. Groups can be used to read multiple resource as one or to export them, for example, to a database as one table. To define a group add the group: <name> field to corresponding resources. The group's metadata will be created from the "leading" resource's metadata (the first resource with the group name).

Consider we have a data package with two tables partitioned by a year and a shared schema stored separately:

cars-2017.csv

name,value
bmw,2017
tesla,2017
nissan,2017

cars-2018.csv

name,value
bmw,2018
tesla,2018
nissan,2018

cars.schema.json

{
    "fields": [
        {
            "name": "name",
            "type": "string"
        },
        {
            "name": "value",
            "type": "integer"
        }
    ]
}

datapackage.json

{
    "name": "datapackage",
    "resources": [
        {
            "group": "cars",
            "name": "cars-2017",
            "path": "cars-2017.csv",
            "profile": "tabular-data-resource",
            "schema": "cars.schema.json"
        },
        {
            "group": "cars",
            "name": "cars-2018",
            "path": "cars-2018.csv",
            "profile": "tabular-data-resource",
            "schema": "cars.schema.json"
        }
    ]
}

Let's read the resources separately:

package = Package('datapackage.json')
package.get_resource('cars-2017').read(keyed=True) == [
    {'name': 'bmw', 'value': 2017},
    {'name': 'tesla', 'value': 2017},
    {'name': 'nissan', 'value': 2017},
]
package.get_resource('cars-2018').read(keyed=True) == [
    {'name': 'bmw', 'value': 2018},
    {'name': 'tesla', 'value': 2018},
    {'name': 'nissan', 'value': 2018},
]

On the other hand, these resources defined with a group: cars field. It means we can treat them as a group:

package = Package('datapackage.json')
package.get_group('cars').read(keyed=True) == [
    {'name': 'bmw', 'value': 2017},
    {'name': 'tesla', 'value': 2017},
    {'name': 'nissan', 'value': 2017},
    {'name': 'bmw', 'value': 2018},
    {'name': 'tesla', 'value': 2018},
    {'name': 'nissan', 'value': 2018},
]

We can use this approach when we need to save the data package to a storage, for example, to a SQL database. There is the merge_groups flag to enable groupping behaviour:

package = Package('datapackage.json')
package.save(storage='sql', engine=engine)
# SQL tables:
# - cars-2017
# - cars-2018
package.save(storage='sql', engine=engine, merge_groups=True)
# SQL tables:
# - cars

Working with Profile

A component to represent JSON Schema profile from Profiles Registry:

profile = Profile('data-package')

profile.name # data-package
profile.jsonschema # JSON Schema contents

try:
   valid = profile.validate(descriptor)
except exceptions.ValidationError as exception:
   for error in exception.errors:
       # handle individual error

Working with Foreign Keys

The library supports foreign keys described in the Table Schema specification. It means if your data package descriptor use resources[].schema.foreignKeys property for some resources a data integrity will be checked on reading operations.

Consider we have a data package:

DESCRIPTOR = {
  'resources': [
    {
      'name': 'teams',
      'data': [
        ['id', 'name', 'city'],
        ['1', 'Arsenal', 'London'],
        ['2', 'Real', 'Madrid'],
        ['3', 'Bayern', 'Munich'],
      ],
      'schema': {
        'fields': [
          {'name': 'id', 'type': 'integer'},
          {'name': 'name', 'type': 'string'},
          {'name': 'city', 'type': 'string'},
        ],
        'foreignKeys': [
          {
            'fields': 'city',
            'reference': {'resource': 'cities', 'fields': 'name'},
          },
        ],
      },
    }, {
      'name': 'cities',
      'data': [
        ['name', 'country'],
        ['London', 'England'],
        ['Madrid', 'Spain'],
      ],
    },
  ],
}

Let's check relations for a teams resource:

from datapackage import Package

package = Package(DESCRIPTOR)
teams = package.get_resource('teams')
teams.check_relations()
# tableschema.exceptions.RelationError: Foreign key "['city']" violation in row "4"

As we could see there is a foreign key violation. That's because our lookup table cities doesn't have a city of Munich but we have a team from there. We need to fix it in cities resource:

package.descriptor['resources'][1]['data'].append(['Munich', 'Germany'])
package.commit()
teams = package.get_resource('teams')
teams.check_relations()
# True

Fixed! But not only a check operation is available. We could use relations argument for resource.iter/read methods to dereference a resource relations:

teams.read(keyed=True, relations=True)
#[{'id': 1, 'name': 'Arsenal', 'city': {'name': 'London', 'country': 'England}},
# {'id': 2, 'name': 'Real', 'city': {'name': 'Madrid', 'country': 'Spain}},
# {'id': 3, 'name': 'Bayern', 'city': {'name': 'Munich', 'country': 'Germany}}]

Instead of plain city name we've got a dictionary containing a city data. These resource.iter/read methods will fail with the same as resource.check_relations error if there is an integrity issue. But only if relations=True flag is passed.

Working with validate/infer

A standalone function to validate a data package descriptor:

from datapackage import validate, exceptions

try:
    valid = validate(descriptor)
except exceptions.ValidationError as exception:
   for error in exception.errors:
       # handle individual error

A standalone function to infer a data package descriptor.

descriptor = infer('**/*.csv')
#{ profile: 'tabular-data-resource',
#  resources:
#   [ { path: 'data/cities.csv',
#       profile: 'tabular-data-resource',
#       encoding: 'utf-8',
#       name: 'cities',
#       format: 'csv',
#       mediatype: 'text/csv',
#       schema: [Object] },
#     { path: 'data/population.csv',
#       profile: 'tabular-data-resource',
#       encoding: 'utf-8',
#       name: 'population',
#       format: 'csv',
#       mediatype: 'text/csv',
#       schema: [Object] } ] }

Frequently Asked Questions

Accessing data behind a proxy server?

Before the package = Package("https://xxx.json") call set these environment variables:

import os

os.environ["HTTP_PROXY"] = 'xxx'
os.environ["HTTPS_PROXY"] = 'xxx'

API Reference

cli

cli()

Command-line interface

Usage: datapackage [OPTIONS] COMMAND [ARGS]...

Options:
  --version  Show the version and exit.
  --help     Show this message and exit.

Commands:
  infer
  validate

Package

Package(self,
        descriptor=None,
        base_path=None,
        strict=False,
        unsafe=False,
        storage=None,
        schema=None,
        default_base_path=None,
        **options)

Package representation

Arguments

  • descriptor (str/dict): data package descriptor as local path, url or object
  • base_path (str): base path for all relative paths
  • strict (bool): strict flag to alter validation behavior. Setting it to True leads to throwing errors on any operation with invalid descriptor
  • unsafe (bool): if True unsafe paths will be allowed. For more inforamtion https://specs.frictionlessdata.io/data-resource/#data-location. Default to False
  • storage (str/tableschema.Storage): storage name like sql or storage instance
  • options (dict): storage options to use for storage creation

Raises

  • DataPackageException: raises error if something goes wrong

package.base_path

Package's base path

Returns

str/None: returns the data package base path

package.descriptor

Package's descriptor

Returns

dict: descriptor

package.errors

Validation errors

Always empty in strict mode.

Returns

Exception[]: validation errors

package.profile

Package's profile

Returns

Profile: an instance of Profile class

package.resource_names

Package's resource names

Returns

str[]: returns an array of resource names

package.resources

Package's resources

Returns

Resource[]: returns an array of Resource instances

package.valid

Validation status

Always true in strict mode.

Returns

bool: validation status

package.get_resource

package.get_resource(name)

Get data package resource by name.

Arguments

  • name (str): data resource name

Returns

Resource/None: returns Resource instances or null if not found

package.add_resource

package.add_resource(descriptor)

Add new resource to data package.

The data package descriptor will be validated with newly added resource descriptor.

Arguments

  • descriptor (dict): data resource descriptor

Raises

  • DataPackageException: raises error if something goes wrong

Returns

Resource/None: returns added Resource instance or null if not added

package.remove_resource

package.remove_resource(name)

Remove data package resource by name.

The data package descriptor will be validated after resource descriptor removal.

Arguments

  • name (str): data resource name

Raises

  • DataPackageException: raises error if something goes wrong

Returns

Resource/None: returns removed Resource instances or null if not found

package.get_group

package.get_group(name)

Returns a group of tabular resources by name.

For more information about groups see Group.

Arguments

  • name (str): name of a group of resources

Raises

  • DataPackageException: raises error if something goes wrong

Returns

Group/None: returns a Group instance or null if not found

package.infer

package.infer(pattern=False)

Infer a data package metadata.

Argument pattern works only for local files

If pattern is not provided only existent resources will be inferred (added metadata like encoding, profile etc). If pattern is provided new resoures with file names mathing the pattern will be added and inferred. It commits changes to data package instance.

Arguments

  • pattern (str): glob pattern for new resources

Returns

dict: returns data package descriptor

package.commit

package.commit(strict=None)

Update data package instance if there are in-place changes in the descriptor.

Example

package = Package({
    'name': 'package',
    'resources': [{'name': 'resource', 'data': ['data']}]
})

package.name # package
package.descriptor['name'] = 'renamed-package'
package.name # package
package.commit()
package.name # renamed-package

Arguments

  • strict (bool): alter strict mode for further work

Raises

  • DataPackageException: raises error if something goes wrong

Returns

bool: returns true on success and false if not modified

package.save

package.save(target=None,
             storage=None,
             merge_groups=False,
             to_base_path=False,
             **options)

Saves this data package

It saves it to storage if storage argument is passed or saves this data package's descriptor to json file if target arguments ends with .json or saves this data package to zip file otherwise.

Example

It creates a zip file into file_or_path with the contents of this Data Package and its resources. Every resource which content lives in the local filesystem will be copied to the zip file. Consider the following Data Package descriptor:

{
    "name": "gdp",
    "resources": [
        {"name": "local", "format": "CSV", "path": "data.csv"},
        {"name": "inline", "data": [4, 8, 15, 16, 23, 42]},
        {"name": "remote", "url": "http://someplace.com/data.csv"}
    ]
}

The final structure of the zip file will be:

./datapackage.json
./data/local.csv

With the contents of datapackage.json being the same as returned datapackage.descriptor. The resources' file names are generated based on their name and format fields if they exist. If the resource has no name, it'll be used resource-X, where X is the index of the resource in the resources list (starting at zero). If the resource has format, it'll be lowercased and appended to the name, becoming "name.format".

Arguments

  • target (string/filelike): the file path or a file-like object where the contents of this Data Package will be saved into.
  • storage (str/tableschema.Storage): storage name like sql or storage instance
  • merge_groups (bool): save all the group's tabular resoruces into one bucket if a storage is provided (for example into one SQL table). Read more about Group.
  • to_base_path (bool): save the package to the package's base path using the "<base_path>/<target>" route
  • options (dict): storage options to use for storage creation

Raises

  • DataPackageException: raises if there was some error writing the package

Returns

bool/Storage: on success return true or a Storage instance

Resource

Resource(self,
         descriptor={},
         base_path=None,
         strict=False,
         unsafe=False,
         storage=None,
         package=None,
         **options)

Resource represenation

Arguments

  • descriptor (str/dict): data resource descriptor as local path, url or object
  • base_path (str): base path for all relative paths
  • strict (bool): strict flag to alter validation behavior. Setting it to true leads to throwing errors on any operation with invalid descriptor
  • unsafe (bool): if True unsafe paths will be allowed. For more inforamtion https://specs.frictionlessdata.io/data-resource/#data-location. Default to False
  • storage (str/tableschema.Storage): storage name like sql or storage instance
  • options (dict): storage options to use for storage creation

Raises

  • DataPackageException: raises error if something goes wrong

resource.data

Return resource data

resource.descriptor

Package's descriptor

Returns

dict: descriptor

resource.errors

Validation errors

Always empty in strict mode.

Returns

Exception[]: validation errors

resource.group

Group name

Returns

str: group name

resource.headers

Resource's headers

Only for tabular resources (reading has to be started first or it's None)

Returns

str[]/None: returns data source headers

resource.inline

Whether resource inline

Returns

bool: returns true if resource is inline

resource.local

Whether resource local

Returns

bool: returns true if resource is local

resource.multipart

Whether resource multipart

Returns

bool: returns true if resource is multipart

resource.name

Resource name

Returns

str: name

resource.package

Package instance if the resource belongs to some package

Returns

Package/None: a package instance if available

resource.profile

Resource's profile

Returns

Profile: an instance of Profile class

resource.remote

Whether resource remote

Returns

bool: returns true if resource is remote

resource.schema

Resource's schema

Only for tabular resources

For tabular resources it returns Schema instance to interact with data schema. Read API documentation - tableschema.Schema.

Returns

tableschema.Schema: schema

resource.source

Resource's source

Combination of resource.source and resource.inline/local/remote/multipart provides predictable interface to work with resource data.

Returns

list/str: returns data or path property

resource.table

Return resource table

resource.tabular

Whether resource tabular

Returns

bool: returns true if resource is tabular

resource.valid

Validation status

Always true in strict mode.

Returns

bool: validation status

resource.iter

resource.iter(integrity=False, relations=False, **options)

Iterates through the resource data and emits rows cast based on table schema.

Only for tabular resources

Arguments

keyed (bool):
    yield keyed rows in a form of `{header1: value1, header2: value2}`
    (default is false; the form of rows is `[value1, value2]`)

extended (bool):
    yield extended rows in a for of `[rowNumber, [header1, header2], [value1, value2]]`
    (default is false; the form of rows is `[value1, value2]`)

cast (bool):
    disable data casting if false
    (default is true)

integrity (bool):
    if true actual size in BYTES and SHA256 hash of the file
    will be checked against `descriptor.bytes` and `descriptor.hash`
    (other hashing algorithms are not supported and will be skipped silently)

relations (bool):
    if true foreign key fields will be checked and resolved to its references

foreign_keys_values (dict):
    three-level dictionary of foreign key references optimized
    to speed up validation process in a form of
    `{resource1: {(fk_field1, fk_field2): {(value1, value2): {one_keyedrow}, ... }}}`.
    If not provided but relations is true, it will be created
    before the validation process by *index_foreign_keys_values* method

exc_handler (func):
    optional custom exception handler callable.
    Can be used to defer raising errors (i.e. "fail late"), e.g.
    for data validation purposes. Must support the signature below

Custom exception handler

def exc_handler(exc, row_number=None, row_data=None, error_data=None):
    '''Custom exception handler (example)

    # Arguments:
        exc(Exception):
            Deferred exception instance
        row_number(int):
            Data row number that triggers exception exc
        row_data(OrderedDict):
            Invalid data row source data
        error_data(OrderedDict):
            Data row source data field subset responsible for the error, if
            applicable (e.g. invalid primary or foreign key fields). May be
            identical to row_data.
    '''
    # ...

Raises

  • DataPackageException: base class of any error
  • CastError: data cast error
  • IntegrityError: integrity checking error
  • UniqueKeyError: unique key constraint violation
  • UnresolvedFKError: unresolved foreign key reference error

Returns

Iterator[list]: yields rows

resource.read

resource.read(integrity=False,
              relations=False,
              foreign_keys_values=False,
              **options)

Read the whole resource and return as array of rows

Only for tabular resources It has the same API as resource.iter except for

Arguments

  • limit (int): limit count of rows to read and return

Returns

list[]: returns rows

resource.check_integrity

resource.check_integrity()

Checks resource integrity

Only for tabular resources

It checks size in BYTES and SHA256 hash of the file against descriptor.bytes and descriptor.hash (other hashing algorithms are not supported and will be skipped silently).

Raises

  • exceptions.IntegrityError: raises if there are integrity issues

Returns

bool: returns True if no issues

resource.check_relations

resource.check_relations(foreign_keys_values=False)

Check relations

Only for tabular resources

It checks foreign keys and raises an exception if there are integrity issues.

Raises

  • exceptions.RelationError: raises if there are relation issues

Returns

bool: returns True if no issues

resource.drop_relations

resource.drop_relations()

Drop relations

Only for tabular resources

Remove relations data from memory

Returns

bool: returns True

resource.raw_iter

resource.raw_iter(stream=False)

Iterate over data chunks as bytes.

If stream is true File-like object will be returned.

Arguments

  • stream (bool): File-like object will be returned

Returns

bytes[]/filelike: returns bytes[]/filelike

resource.raw_read

resource.raw_read()

Returns resource data as bytes.

Returns

bytes: returns resource data in bytes

resource.infer

resource.infer(**options)

Infer resource metadata

Like name, format, mediatype, encoding, schema and profile. It commits this changes into resource instance.

Arguments

  • options: options will be passed to tableschema.infer call, for more control on results (e.g. for setting limit, confidence etc.).

Returns

dict: returns resource descriptor

resource.commit

resource.commit(strict=None)

Update resource instance if there are in-place changes in the descriptor.

Arguments

  • strict (bool): alter strict mode for further work

Raises

  • DataPackageException: raises error if something goes wrong

Returns

bool: returns true on success and false if not modified

resource.save

resource.save(target, storage=None, to_base_path=False, **options)

Saves this resource

Into storage if storage argument is passed or saves this resource's descriptor to json file otherwise.

Arguments

  • target (str): path where to save a resource
  • storage (str/tableschema.Storage): storage name like sql or storage instance
  • to_base_path (bool): save the resource to the resource's base path using the "<base_path>/<target>" route
  • options (dict): storage options to use for storage creation

Raises

  • DataPackageException: raises error if something goes wrong

Returns

bool: returns true on success Building index... Started generating documentation...

Group

Group(self, resources)

Group representation

Arguments

  • Resource[]: list of TABULAR resources

group.headers

Group's headers

Returns

str[]/None: returns headers

group.name

Group name

Returns

str: name

group.schema

Resource's schema

Returns

tableschema.Schema: schema

group.iter

group.iter(**options)

Iterates through the group data and emits rows cast based on table schema.

It concatenates all the resources and has the same API as resource.iter

group.read

group.read(limit=None, **options)

Read the whole group and return as array of rows

It concatenates all the resources and has the same API as resource.read

group.check_relations

group.check_relations()

Check group's relations

The same as resource.check_relations but without the optional argument foreign_keys_values. This method will test foreignKeys of the whole group at once otpimizing the process by creating the foreign_key_values hashmap only once before testing the set of resources.

Profile

Profile(self, profile)

Profile representation

Arguments

  • profile (str): profile name in registry or URL to JSON Schema

Raises

  • DataPackageException: raises error if something goes wrong

profile.jsonschema

JSONSchema content

Returns

dict: returns profile's JSON Schema contents

profile.name

Profile name

Returns

str/None: name if available

profile.validate

profile.validate(descriptor)

Validate a data package descriptor against the profile.

Arguments

  • descriptor (dict): retrieved and dereferenced data package descriptor

Raises

  • ValidationError: raises if not valid Returns

bool: returns True if valid

validate

validate(descriptor)

Validate a data package descriptor.

Arguments

  • descriptor (str/dict): package descriptor (one of): - local path - remote url - object

Raises

  • ValidationError: raises on invalid

Returns

bool: returns true on valid

infer

infer(pattern, base_path=None)

Infer a data package descriptor.

Argument pattern works only for local files

Arguments

  • pattern (str): glob file pattern

Returns

dict: returns data package descriptor

DataPackageException

DataPackageException(self, message, errors=[])

Base class for all DataPackage/TableSchema exceptions.

If there are multiple errors, they can be read from the exception object:

try:
    # lib action
except DataPackageException as exception:
    if exception.multiple:
        for error in exception.errors:
            # handle error

datapackageexception.errors

List of nested errors

Returns

DataPackageException[]: list of nested errors

datapackageexception.multiple

Whether it's a nested exception

Returns

bool: whether it's a nested exception

TableSchemaException

TableSchemaException(self, message, errors=[])

Base class for all TableSchema exceptions.

LoadError

LoadError(self, message, errors=[])

All loading errors.

CastError

CastError(self, message, errors=[])

All value cast errors.

IntegrityError

IntegrityError(self, message, errors=[])

All integrity errors.

RelationError

RelationError(self, message, errors=[])

All relations errors.

StorageError

StorageError(self, message, errors=[])

All storage errors.

Contributing

The project follows the Open Knowledge International coding standards.

Recommended way to get started is to create and activate a project virtual environment. To install package and development dependencies into active environment:

$ make install

To run tests with linting and coverage:

$ make test

Changelog

Here described only breaking and the most important changes. The full changelog and documentation for all released versions could be found in nicely formatted commit history.

v1.15

WARNING: it can be breaking for some setups, please read the discussions below

v1.14

  • Add experimental options for pick/skiping fileds/rows

v1.13

  • Add unsafe option to Package and Resource (#262)

v1.12

  • Use chardet for encoding deteciton by default. For cchardet: pip install datapackage[cchardet]

v1.11

  • resource/package.save now accept a to_base_path argument (#254)
  • package.save now returns a Storage instance if available

v1.10

  • Added an ability to check tabular resource's integrity

v1.9

  • Added resource.package property

v1.8

v1.7

v1.6

  • Added support for custom request session

v1.5

Updated behaviour:

  • Added support for Python 3.7

v1.4

New API added:

  • added skip_rows support to the resource descriptor

v1.3

New API added:

  • property package.base_path is now publicly available

v1.2

Updated behaviour:

  • CLI command $ datapackage infer now outputs only a JSON-formatted data package descriptor.

v1.1

New API added:

More Repositories

1

frictionless-py

Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data
Python
696
star
2

datapackage

Data Package is a standard consisting of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that facilitates findability, accessibility, interoperability, and reusability (FAIR) of data.
MDX
488
star
3

tableschema-py

A Python library for working with Table Schema.
Python
259
star
4

tabulator-py

Python library for reading and writing tabular data via streams.
Python
235
star
5

frictionlessdata.io

The main repository of the Frictionless Data project. Website, issues, and discussions
Vue
140
star
6

datapackage-pipelines

Framework for processing data packages in pipelines of modular components.
Python
117
star
7

dpm-js

[DEPRECATED] Please use https://datahub.io/docs/features/data-cli
JavaScript
109
star
8

livemark

Data presentation framework for Python that generates static sites from extended Markdown with interactive charts, tables, scripts, and other features
Python
95
star
9

tableschema-js

A JavaScript library for working with Table Schema.
JavaScript
82
star
10

frictionless-js

A lightweight, standardized library accessing files and datasets, especially tabular ones (CSV, Excel).
JavaScript
71
star
11

goodtables.io

Data validation as a service. Project retired, got to the current one at frictionsless/repository
Python
69
star
12

tableschema-sql-py

Generate SQL tables, load and extract data, based on JSON Table Schema descriptors.
Python
60
star
13

tableschema-pandas-py

Generate Pandas frames, load and extract data, based on JSON Table Schema descriptors.
Python
52
star
14

tableschema-go

A Go library for working with Table Schema.
Go
46
star
15

datapackage-js

A JavaScript library for working with Data Package.
JavaScript
43
star
16

datapackage-r

An R package for working with Data Package.
R
43
star
17

ckanext-datapackager

CKAN extension for importing/exporting Data Packages.
Python
36
star
18

frictionless-ci

Data management service that brings continuous data validation to tabular data in your repository via Github Action
JavaScript
36
star
19

data-quality-dashboard

Data Quality Dashboards display statistics on a collection of published data.
JavaScript
33
star
20

website-old

The Frictionless Data website.
CSS
31
star
21

examples

Example Data Packages for use in tutorials
R
28
star
22

frictionless-r

R package to read and write Frictionless Data Packages
R
27
star
23

datapackage-ui

Create and validate Data Packages in the browser
JavaScript
27
star
24

tableschema-r

An R library for working with Table Schema.
R
25
star
25

tableschema-java

A Java library for working with Table Schema.
Java
25
star
26

frictionless-darwin-core

DarwinCore Archive as Frictionless Data Package
Python
21
star
27

datapackage-go

A Go library for working with Data Package.
Go
21
star
28

datapackage-java

A Java library for working with Frictionless Data Data Packages.
Java
20
star
29

data-quality-spec

A spec for reporting errors in data quality.
JavaScript
20
star
30

delimiter

Edit CSV files in the browser and sync them with GitHub
Vue
19
star
31

awesome-frictionless

πŸŽ‰ A curated list of tools, libraries, patterns and projects in the Frictionless ecosystem.
19
star
32

data-quality-cli

CLI for creating databases for Data Quality Dashboards.
Python
19
star
33

tableschema-bigquery-py

Generate BigQuery tables, load and extract data, based on JSON Table Schema descriptors.
Python
18
star
34

schemas

[DEPRECATED] Please use https://github.com/frictionlessdata/specs
Python
17
star
35

schema-collaboration

Carles Pina Estany's 2020 Tool Fund: data managers and researchers collaborate to write the Frictionless Data packages, tabular schemas, etc.
Python
16
star
36

datapackage-render-js

React components for rendering Data Package Views.
JavaScript
15
star
37

goodtables-web

[DEPRECATED] Please use http://try.goodtables.io/
CSS
15
star
38

datapackage-v2-draft

This repository has been merged back into https://github.com/frictionlessdata/datapackage.
MDX
14
star
39

dplib-py

Python implementation of the Data Package standard and various models and utils for working with data.
Python
14
star
40

frictionless-prototype-py

[DEPRECATED] Please use - https://github.com/frictionlessdata/frictionless-py
Python
13
star
41

tableschema-rb

A Ruby library for working with JSON Table Schema.
Ruby
12
star
42

example-continuous-data-integration

[DEPRECATED] Please use https://goodtables.io
Python
12
star
43

googlesheets-datapackage-tools

Extension for importing and exporting Data Packages from/into Google Sheets.
JavaScript
12
star
44

website

Merged into https://github.com/frictionlessdata/frictionlessdata.io
Vue
12
star
45

tableschema-php

A php library for working with Table Schema.
PHP
12
star
46

components

Data integration components for TypeScript/React that implement Frictionless Data concepts in visual UI
TypeScript
11
star
47

datapackage-rb

Ruby library and tools for working with datapackages
Ruby
11
star
48

covid-tracker

A livemark tracking COVID-19 disease pandemic
HTML
10
star
49

forum

πŸ—£ Frictionless Data Forum esp for "How do I" type questions
10
star
50

datapackage-php

A php library for working with Data Package.
PHP
10
star
51

tableschema-elasticsearch-py

Generate Elasticsearch indexes based on Table Schema descriptors.
Python
10
star
52

tableschema-ui

A web UI for creating, editing and validating Table Schemas.
JavaScript
10
star
53

notebooks

Jupyter Notebooks presenting Frictionless Data.
Jupyter Notebook
9
star
54

frictionless-ckan-mapper

πŸ› οΈ A library for mapping CKAN metadata <=> Frictionless metadata
Python
9
star
55

datasets

A common set of test data files and data packages for testing Frictionless Data tooling
Python
9
star
56

DataPackage.jl

A Julia library for working with Data Package.
Julia
9
star
57

goodtables-ui

A web UI for goodtables validation and report visualizations.
TypeScript
9
star
58

datapackage-clj

A Clojure library for working with Data Package.
HTML
9
star
59

goodtables-js

Validate tabular data in JavaScript
JavaScript
8
star
60

schema-catalog

Community curated Schema Catalog
Vue
8
star
61

libraries

Data integration libraries for 10 programming languages that implement Frictionless Data concepts in code
JavaScript
8
star
62

TableSchema.jl

A Julia library for working with Table Schema.
Julia
7
star
63

data-packages

A livemark listing data packages hosted on Github
HTML
7
star
64

datapackage-pipelines-aws

Datapackage-pipelines integration with AWS
Python
7
star
65

frictionless-ci-demo

It's a demo repository for the Frictionless Repository project.
7
star
66

software-legacy

DEPRECATED / MERGED to https://github.com/frictionlessdata/project
Python
6
star
67

tableschema-clj

A Clojure library for working with Table Schema.
Clojure
6
star
68

tableschema-sql-js

Library for converting JSON Table Schema to an SQL table, for Node and the browser.
JavaScript
6
star
69

tableschema-spss-py

Read and write between SPSS and Table Schema.
Python
6
star
70

ckan-extensions

A livemark listing CKAN extensions hosted on Github
HTML
5
star
71

goodtables-py

Goodtables is a framework to validate tabular data [MAINTENANCE MODE]
Python
5
star
72

datapackage-pipelines-elasticsearch

Datapackage-pipelines integration with Elasticsearch
Python
4
star
73

datapackage-swift

A Swift library for working with Data Package.
Swift
4
star
74

software

The Frictionless Data project provides a rich set of open source software for working with data
HTML
4
star
75

pilot-open-archaeology

This is a repository for piloting a packaging approach to archaeology data
4
star
76

pilot-elife

Scripts and data for the eLife / Goodtables pilot
Python
3
star
77

toolfund

Frictionless Data Tool Fund Website
SCSS
3
star
78

livemark-project

Livemark project template
HTML
3
star
79

data-quality-databases

This repo contains databases created with https://github.com/frictionlessdata/data-quality-cli.
Makefile
3
star
80

datapackage-pipelines-goodtables

A datapackage-pipelines processor to validate tabular resources using goodtables
Python
3
star
81

community-insights

To tell a story about the Frictionless community using Livemark
HTML
3
star
82

docs

Documentation hub and automation
Python
3
star
83

dataflows-aws

AWS integration for dataflows
Python
3
star
84

tableschema-ckan-datastore-py

Generate CKAN DataStore tables, load and extract data, based on Table Schema descriptors.
Python
3
star
85

testsuite-extended

Testsuite for Frictioinless Data extended implementations.
Python
2
star
86

pilot-oxford

Pilot with Oxford
2
star
87

livemark-ckan

It's a CKAN plugin for Livemark
HTML
2
star
88

datapackage-pipelines-github

Datapackage-pipelines github
Python
2
star
89

tableschema-models-js

An object relational mapper based on Table Schema descriptors.
JavaScript
2
star
90

fellows

Website for fellows.frictionlessdata.io Lektor project
Jupyter Notebook
2
star
91

ckanext-datapackage

Data Package integreation for CKAN.
Python
2
star
92

pilot-catalyst

Pilot with the Catalyst cooperative
Python
2
star
93

tableschema-openrefine-py

Generate OpenRefine models, load and extract data, based on JSON Table Schema descriptors.
Python
2
star
94

geodata-training

This training guide for trainers of the OKF x Facebook GeoData Training Programme
HTML
2
star
95

tableschema-swift

A Swift library for working with Table Schema.
Swift
2
star
96

goodread

Goodread executes Python and Bash codeblocks in Markdown and writes the results back
Python
2
star
97

livemark-plugin

Livemark plugin template
HTML
1
star
98

goodtables.io-example

[EXAMPLE] Data repository for goodtables.io.
Makefile
1
star
99

datapackage-pipelines-ckan

Data Package Pipelines processors for CKAN
Python
1
star
100

pilot-pnnl

Pilot project with ADBio PNNL
1
star