datapackage-py
A library for working with Data Packages.
[Important Notice] We have released Frictionless Framework. This framework provides improved
datapackage
functionality extended to be a complete data solution. The change in not breaking for the existing software so no actions are required. Please read the Migration Guide fromdatapackage
to Frictionless Framework.
Features
Package
class for working with data packagesResource
class for working with data resourcesProfile
class for working with profilesvalidate
function for validating data package descriptorsinfer
function for inferring data package descriptors
Contents
Getting Started
Installation
The package use semantic versioning. It means that major versions could include breaking changes. It's highly recommended to specify datapackage
version range in your setup/requirements
file e.g. datapackage>=1.0,<2.0
.
$ pip install datapackage
OSX 10.14+
If you receive an error about the cchardet
package when installing datapackage on Mac OSX 10.14 (Mojave) or higher, follow these steps:
- Make sure you have the latest x-code by running the following in terminal:
xcode-select --install
- Then go to https://developer.apple.com/download/more/ and download the
command line tools
. Note, this requires an Apple ID. - Then, in terminal, run
open /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg
You can read more about these steps in this post.
Documentation
Introduction
Let's start with a simple example:
from datapackage import Package
package = Package('datapackage.json')
package.get_resource('resource').read()
Working with Package
A class for working with data packages. It provides various capabilities like loading local or remote data package, inferring a data package descriptor, saving a data package descriptor and many more.
Consider we have some local csv files in a data
directory. Let's create a data package based on this data using a Package
class:
data/cities.csv
city,location
london,"51.50,-0.11"
paris,"48.85,2.30"
rome,"41.89,12.51"
data/population.csv
city,year,population
london,2017,8780000
paris,2017,2240000
rome,2017,2860000
First we create a blank data package:
package = Package()
Now we're ready to infer a data package descriptor based on data files we have. Because we have two csv files we use glob pattern **/*.csv
:
package.infer('**/*.csv')
package.descriptor
#{ profile: 'tabular-data-package',
# resources:
# [ { path: 'data/cities.csv',
# profile: 'tabular-data-resource',
# encoding: 'utf-8',
# name: 'cities',
# format: 'csv',
# mediatype: 'text/csv',
# schema: [Object] },
# { path: 'data/population.csv',
# profile: 'tabular-data-resource',
# encoding: 'utf-8',
# name: 'population',
# format: 'csv',
# mediatype: 'text/csv',
# schema: [Object] } ] }
An infer
method has found all our files and inspected it to extract useful metadata like profile, encoding, format, Table Schema etc. Let's tweak it a little bit:
package.descriptor['resources'][1]['schema']['fields'][1]['type'] = 'year'
package.commit()
package.valid # true
Because our resources are tabular we could read it as a tabular data:
package.get_resource('population').read(keyed=True)
#[ { city: 'london', year: 2017, population: 8780000 },
# { city: 'paris', year: 2017, population: 2240000 },
# { city: 'rome', year: 2017, population: 2860000 } ]
Let's save our descriptor on the disk as a zip-file:
package.save('datapackage.zip')
To continue the work with the data package we just load it again but this time using local datapackage.zip
:
package = Package('datapackage.zip')
# Continue the work
It was onle basic introduction to the Package
class. To learn more let's take a look on Package
class API reference.
Working with Resource
A class for working with data resources. You can read or iterate tabular resources using the iter/read
methods and all resource as bytes using row_iter/row_read
methods.
Consider we have some local csv file. It could be inline data or remote link - all supported by Resource
class (except local files for in-brower usage of course). But say it's data.csv
for now:
city,location
london,"51.50,-0.11"
paris,"48.85,2.30"
rome,N/A
Let's create and read a resource. Because resource is tabular we could use resource.read
method with a keyed
option to get an array of keyed rows:
resource = Resource({path: 'data.csv'})
resource.tabular # true
resource.read(keyed=True)
# [
# {city: 'london', location: '51.50,-0.11'},
# {city: 'paris', location: '48.85,2.30'},
# {city: 'rome', location: 'N/A'},
# ]
resource.headers
# ['city', 'location']
# (reading has to be started first)
As we could see our locations are just a strings. But it should be geopoints. Also Rome's location is not available but it's also just a N/A
string instead of Python None
. First we have to infer resource metadata:
resource.infer()
resource.descriptor
#{ path: 'data.csv',
# profile: 'tabular-data-resource',
# encoding: 'utf-8',
# name: 'data',
# format: 'csv',
# mediatype: 'text/csv',
# schema: { fields: [ [Object], [Object] ], missingValues: [ '' ] } }
resource.read(keyed=True)
# Fails with a data validation error
Let's fix not available location. There is a missingValues
property in Table Schema specification. As a first try we set missingValues
to N/A
in resource.descriptor.schema
. Resource descriptor could be changed in-place but all changes should be commited by resource.commit()
:
resource.descriptor['schema']['missingValues'] = 'N/A'
resource.commit()
resource.valid # False
resource.errors
# [<ValidationError: "'N/A' is not of type 'array'">]
As a good citiziens we've decided to check out recource descriptor validity. And it's not valid! We should use an array for missingValues
property. Also don't forget to have an empty string as a missing value:
resource.descriptor['schema']['missingValues'] = ['', 'N/A']
resource.commit()
resource.valid # true
All good. It looks like we're ready to read our data again:
resource.read(keyed=True)
# [
# {city: 'london', location: [51.50,-0.11]},
# {city: 'paris', location: [48.85,2.30]},
# {city: 'rome', location: null},
# ]
Now we see that:
- locations are arrays with numeric lattide and longitude
- Rome's location is a native JavaScript
null
And because there are no errors on data reading we could be sure that our data is valid againt our schema. Let's save our resource descriptor:
resource.save('dataresource.json')
Let's check newly-crated dataresource.json
. It contains path to our data file, inferred metadata and our missingValues
tweak:
{
"path": "data.csv",
"profile": "tabular-data-resource",
"encoding": "utf-8",
"name": "data",
"format": "csv",
"mediatype": "text/csv",
"schema": {
"fields": [
{
"name": "city",
"type": "string",
"format": "default"
},
{
"name": "location",
"type": "geopoint",
"format": "default"
}
],
"missingValues": [
"",
"N/A"
]
}
}
If we decide to improve it even more we could update the dataresource.json
file and then open it again using local file name:
resource = Resource('dataresource.json')
# Continue the work
It was onle basic introduction to the Resource
class. To learn more let's take a look on Resource
class API reference.
Working with Group
A class representing a group of tabular resources. Groups can be used to read multiple resource as one or to export them, for example, to a database as one table. To define a group add the group: <name>
field to corresponding resources. The group's metadata will be created from the "leading" resource's metadata (the first resource with the group name).
Consider we have a data package with two tables partitioned by a year and a shared schema stored separately:
cars-2017.csv
name,value
bmw,2017
tesla,2017
nissan,2017
cars-2018.csv
name,value
bmw,2018
tesla,2018
nissan,2018
cars.schema.json
{
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "value",
"type": "integer"
}
]
}
datapackage.json
{
"name": "datapackage",
"resources": [
{
"group": "cars",
"name": "cars-2017",
"path": "cars-2017.csv",
"profile": "tabular-data-resource",
"schema": "cars.schema.json"
},
{
"group": "cars",
"name": "cars-2018",
"path": "cars-2018.csv",
"profile": "tabular-data-resource",
"schema": "cars.schema.json"
}
]
}
Let's read the resources separately:
package = Package('datapackage.json')
package.get_resource('cars-2017').read(keyed=True) == [
{'name': 'bmw', 'value': 2017},
{'name': 'tesla', 'value': 2017},
{'name': 'nissan', 'value': 2017},
]
package.get_resource('cars-2018').read(keyed=True) == [
{'name': 'bmw', 'value': 2018},
{'name': 'tesla', 'value': 2018},
{'name': 'nissan', 'value': 2018},
]
On the other hand, these resources defined with a group: cars
field. It means we can treat them as a group:
package = Package('datapackage.json')
package.get_group('cars').read(keyed=True) == [
{'name': 'bmw', 'value': 2017},
{'name': 'tesla', 'value': 2017},
{'name': 'nissan', 'value': 2017},
{'name': 'bmw', 'value': 2018},
{'name': 'tesla', 'value': 2018},
{'name': 'nissan', 'value': 2018},
]
We can use this approach when we need to save the data package to a storage, for example, to a SQL database. There is the merge_groups
flag to enable groupping behaviour:
package = Package('datapackage.json')
package.save(storage='sql', engine=engine)
# SQL tables:
# - cars-2017
# - cars-2018
package.save(storage='sql', engine=engine, merge_groups=True)
# SQL tables:
# - cars
Working with Profile
A component to represent JSON Schema profile from Profiles Registry:
profile = Profile('data-package')
profile.name # data-package
profile.jsonschema # JSON Schema contents
try:
valid = profile.validate(descriptor)
except exceptions.ValidationError as exception:
for error in exception.errors:
# handle individual error
Working with Foreign Keys
The library supports foreign keys described in the Table Schema specification. It means if your data package descriptor use resources[].schema.foreignKeys
property for some resources a data integrity will be checked on reading operations.
Consider we have a data package:
DESCRIPTOR = {
'resources': [
{
'name': 'teams',
'data': [
['id', 'name', 'city'],
['1', 'Arsenal', 'London'],
['2', 'Real', 'Madrid'],
['3', 'Bayern', 'Munich'],
],
'schema': {
'fields': [
{'name': 'id', 'type': 'integer'},
{'name': 'name', 'type': 'string'},
{'name': 'city', 'type': 'string'},
],
'foreignKeys': [
{
'fields': 'city',
'reference': {'resource': 'cities', 'fields': 'name'},
},
],
},
}, {
'name': 'cities',
'data': [
['name', 'country'],
['London', 'England'],
['Madrid', 'Spain'],
],
},
],
}
Let's check relations for a teams
resource:
from datapackage import Package
package = Package(DESCRIPTOR)
teams = package.get_resource('teams')
teams.check_relations()
# tableschema.exceptions.RelationError: Foreign key "['city']" violation in row "4"
As we could see there is a foreign key violation. That's because our lookup table cities
doesn't have a city of Munich
but we have a team from there. We need to fix it in cities
resource:
package.descriptor['resources'][1]['data'].append(['Munich', 'Germany'])
package.commit()
teams = package.get_resource('teams')
teams.check_relations()
# True
Fixed! But not only a check operation is available. We could use relations
argument for resource.iter/read
methods to dereference a resource relations:
teams.read(keyed=True, relations=True)
#[{'id': 1, 'name': 'Arsenal', 'city': {'name': 'London', 'country': 'England}},
# {'id': 2, 'name': 'Real', 'city': {'name': 'Madrid', 'country': 'Spain}},
# {'id': 3, 'name': 'Bayern', 'city': {'name': 'Munich', 'country': 'Germany}}]
Instead of plain city name we've got a dictionary containing a city data. These resource.iter/read
methods will fail with the same as resource.check_relations
error if there is an integrity issue. But only if relations=True
flag is passed.
Working with validate/infer
A standalone function to validate a data package descriptor:
from datapackage import validate, exceptions
try:
valid = validate(descriptor)
except exceptions.ValidationError as exception:
for error in exception.errors:
# handle individual error
A standalone function to infer a data package descriptor.
descriptor = infer('**/*.csv')
#{ profile: 'tabular-data-resource',
# resources:
# [ { path: 'data/cities.csv',
# profile: 'tabular-data-resource',
# encoding: 'utf-8',
# name: 'cities',
# format: 'csv',
# mediatype: 'text/csv',
# schema: [Object] },
# { path: 'data/population.csv',
# profile: 'tabular-data-resource',
# encoding: 'utf-8',
# name: 'population',
# format: 'csv',
# mediatype: 'text/csv',
# schema: [Object] } ] }
Frequently Asked Questions
Accessing data behind a proxy server?
Before the package = Package("https://xxx.json")
call set these environment variables:
import os
os.environ["HTTP_PROXY"] = 'xxx'
os.environ["HTTPS_PROXY"] = 'xxx'
API Reference
cli
cli()
Command-line interface
Usage: datapackage [OPTIONS] COMMAND [ARGS]...
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
infer
validate
Package
Package(self,
descriptor=None,
base_path=None,
strict=False,
unsafe=False,
storage=None,
schema=None,
default_base_path=None,
**options)
Package representation
Arguments
- descriptor (str/dict): data package descriptor as local path, url or object
- base_path (str): base path for all relative paths
- strict (bool): strict flag to alter validation behavior.
Setting it to
True
leads to throwing errors on any operation with invalid descriptor - unsafe (bool):
if
True
unsafe paths will be allowed. For more inforamtion https://specs.frictionlessdata.io/data-resource/#data-location. Default toFalse
- storage (str/tableschema.Storage): storage name like
sql
or storage instance - options (dict): storage options to use for storage creation
Raises
DataPackageException
: raises error if something goes wrong
package.base_path
Package's base path
Returns
str/None
: returns the data package base path
package.descriptor
Package's descriptor
Returns
dict
: descriptor
package.errors
Validation errors
Always empty in strict mode.
Returns
Exception[]
: validation errors
package.profile
Package's profile
Returns
Profile
: an instance of Profile
class
package.resource_names
Package's resource names
Returns
str[]
: returns an array of resource names
package.resources
Package's resources
Returns
Resource[]
: returns an array of Resource
instances
package.valid
Validation status
Always true in strict mode.
Returns
bool
: validation status
package.get_resource
package.get_resource(name)
Get data package resource by name.
Arguments
- name (str): data resource name
Returns
Resource/None
: returns Resource
instances or null if not found
package.add_resource
package.add_resource(descriptor)
Add new resource to data package.
The data package descriptor will be validated with newly added resource descriptor.
Arguments
- descriptor (dict): data resource descriptor
Raises
DataPackageException
: raises error if something goes wrong
Returns
Resource/None
: returns added Resource
instance or null if not added
package.remove_resource
package.remove_resource(name)
Remove data package resource by name.
The data package descriptor will be validated after resource descriptor removal.
Arguments
- name (str): data resource name
Raises
DataPackageException
: raises error if something goes wrong
Returns
Resource/None
: returns removed Resource
instances or null if not found
package.get_group
package.get_group(name)
Returns a group of tabular resources by name.
For more information about groups see Group.
Arguments
- name (str): name of a group of resources
Raises
DataPackageException
: raises error if something goes wrong
Returns
Group/None
: returns a Group
instance or null if not found
package.infer
package.infer(pattern=False)
Infer a data package metadata.
Argument
pattern
works only for local files
If pattern
is not provided only existent resources will be inferred
(added metadata like encoding, profile etc). If pattern
is provided
new resoures with file names mathing the pattern will be added and inferred.
It commits changes to data package instance.
Arguments
- pattern (str): glob pattern for new resources
Returns
dict
: returns data package descriptor
package.commit
package.commit(strict=None)
Update data package instance if there are in-place changes in the descriptor.
Example
package = Package({
'name': 'package',
'resources': [{'name': 'resource', 'data': ['data']}]
})
package.name # package
package.descriptor['name'] = 'renamed-package'
package.name # package
package.commit()
package.name # renamed-package
Arguments
- strict (bool): alter
strict
mode for further work
Raises
DataPackageException
: raises error if something goes wrong
Returns
bool
: returns true on success and false if not modified
package.save
package.save(target=None,
storage=None,
merge_groups=False,
to_base_path=False,
**options)
Saves this data package
It saves it to storage if storage
argument is passed or
saves this data package's descriptor to json file if target
arguments
ends with .json
or saves this data package to zip file otherwise.
Example
It creates a zip file into file_or_path
with the contents
of this Data Package and its resources. Every resource which content
lives in the local filesystem will be copied to the zip file.
Consider the following Data Package descriptor:
{
"name": "gdp",
"resources": [
{"name": "local", "format": "CSV", "path": "data.csv"},
{"name": "inline", "data": [4, 8, 15, 16, 23, 42]},
{"name": "remote", "url": "http://someplace.com/data.csv"}
]
}
The final structure of the zip file will be:
./datapackage.json
./data/local.csv
With the contents of datapackage.json
being the same as
returned datapackage.descriptor
. The resources' file names are generated
based on their name
and format
fields if they exist.
If the resource has no name
, it'll be used resource-X
,
where X
is the index of the resource in the resources
list (starting at zero).
If the resource has format
, it'll be lowercased and appended to the name
,
becoming "name.format
".
Arguments
- target (string/filelike): the file path or a file-like object where the contents of this Data Package will be saved into.
- storage (str/tableschema.Storage):
storage name like
sql
or storage instance - merge_groups (bool): save all the group's tabular resoruces into one bucket if a storage is provided (for example into one SQL table). Read more about Group.
- to_base_path (bool): save the package to the package's base path using the "<base_path>/<target>" route
- options (dict): storage options to use for storage creation
Raises
DataPackageException
: raises if there was some error writing the package
Returns
bool/Storage
: on success return true or a Storage
instance
Resource
Resource(self,
descriptor={},
base_path=None,
strict=False,
unsafe=False,
storage=None,
package=None,
**options)
Resource represenation
Arguments
- descriptor (str/dict): data resource descriptor as local path, url or object
- base_path (str): base path for all relative paths
- strict (bool):
strict flag to alter validation behavior. Setting it to
true
leads to throwing errors on any operation with invalid descriptor - unsafe (bool):
if
True
unsafe paths will be allowed. For more inforamtion https://specs.frictionlessdata.io/data-resource/#data-location. Default toFalse
- storage (str/tableschema.Storage): storage name like
sql
or storage instance - options (dict): storage options to use for storage creation
Raises
DataPackageException
: raises error if something goes wrong
resource.data
Return resource data
resource.descriptor
Package's descriptor
Returns
dict
: descriptor
resource.errors
Validation errors
Always empty in strict mode.
Returns
Exception[]
: validation errors
resource.group
Group name
Returns
str
: group name
resource.headers
Resource's headers
Only for tabular resources (reading has to be started first or it's
None
)
Returns
str[]/None
: returns data source headers
resource.inline
Whether resource inline
Returns
bool
: returns true if resource is inline
resource.local
Whether resource local
Returns
bool
: returns true if resource is local
resource.multipart
Whether resource multipart
Returns
bool
: returns true if resource is multipart
resource.name
Resource name
Returns
str
: name
resource.package
Package instance if the resource belongs to some package
Returns
Package/None
: a package instance if available
resource.profile
Resource's profile
Returns
Profile
: an instance of Profile
class
resource.remote
Whether resource remote
Returns
bool
: returns true if resource is remote
resource.schema
Resource's schema
Only for tabular resources
For tabular resources it returns Schema
instance to interact with data schema.
Read API documentation - tableschema.Schema.
Returns
tableschema.Schema
: schema
resource.source
Resource's source
Combination of resource.source
and resource.inline/local/remote/multipart
provides predictable interface to work with resource data.
Returns
list/str
: returns data
or path
property
resource.table
Return resource table
resource.tabular
Whether resource tabular
Returns
bool
: returns true if resource is tabular
resource.valid
Validation status
Always true in strict mode.
Returns
bool
: validation status
resource.iter
resource.iter(integrity=False, relations=False, **options)
Iterates through the resource data and emits rows cast based on table schema.
Only for tabular resources
Arguments
keyed (bool):
yield keyed rows in a form of `{header1: value1, header2: value2}`
(default is false; the form of rows is `[value1, value2]`)
extended (bool):
yield extended rows in a for of `[rowNumber, [header1, header2], [value1, value2]]`
(default is false; the form of rows is `[value1, value2]`)
cast (bool):
disable data casting if false
(default is true)
integrity (bool):
if true actual size in BYTES and SHA256 hash of the file
will be checked against `descriptor.bytes` and `descriptor.hash`
(other hashing algorithms are not supported and will be skipped silently)
relations (bool):
if true foreign key fields will be checked and resolved to its references
foreign_keys_values (dict):
three-level dictionary of foreign key references optimized
to speed up validation process in a form of
`{resource1: {(fk_field1, fk_field2): {(value1, value2): {one_keyedrow}, ... }}}`.
If not provided but relations is true, it will be created
before the validation process by *index_foreign_keys_values* method
exc_handler (func):
optional custom exception handler callable.
Can be used to defer raising errors (i.e. "fail late"), e.g.
for data validation purposes. Must support the signature below
Custom exception handler
def exc_handler(exc, row_number=None, row_data=None, error_data=None):
'''Custom exception handler (example)
# Arguments:
exc(Exception):
Deferred exception instance
row_number(int):
Data row number that triggers exception exc
row_data(OrderedDict):
Invalid data row source data
error_data(OrderedDict):
Data row source data field subset responsible for the error, if
applicable (e.g. invalid primary or foreign key fields). May be
identical to row_data.
'''
# ...
Raises
DataPackageException
: base class of any errorCastError
: data cast errorIntegrityError
: integrity checking errorUniqueKeyError
: unique key constraint violationUnresolvedFKError
: unresolved foreign key reference error
Returns
Iterator[list]
: yields rows
resource.read
resource.read(integrity=False,
relations=False,
foreign_keys_values=False,
**options)
Read the whole resource and return as array of rows
Only for tabular resources It has the same API as
resource.iter
except for
Arguments
- limit (int): limit count of rows to read and return
Returns
list[]
: returns rows
resource.check_integrity
resource.check_integrity()
Checks resource integrity
Only for tabular resources
It checks size in BYTES and SHA256 hash of the file
against descriptor.bytes
and descriptor.hash
(other hashing algorithms are not supported and will be skipped silently).
Raises
exceptions.IntegrityError
: raises if there are integrity issues
Returns
bool
: returns True if no issues
resource.check_relations
resource.check_relations(foreign_keys_values=False)
Check relations
Only for tabular resources
It checks foreign keys and raises an exception if there are integrity issues.
Raises
exceptions.RelationError
: raises if there are relation issues
Returns
bool
: returns True if no issues
resource.drop_relations
resource.drop_relations()
Drop relations
Only for tabular resources
Remove relations data from memory
Returns
bool
: returns True
resource.raw_iter
resource.raw_iter(stream=False)
Iterate over data chunks as bytes.
If stream
is true File-like object will be returned.
Arguments
- stream (bool): File-like object will be returned
Returns
bytes[]/filelike
: returns bytes[]/filelike
resource.raw_read
resource.raw_read()
Returns resource data as bytes.
Returns
bytes
: returns resource data in bytes
resource.infer
resource.infer(**options)
Infer resource metadata
Like name, format, mediatype, encoding, schema and profile. It commits this changes into resource instance.
Arguments
- options:
options will be passed to
tableschema.infer
call, for more control on results (e.g. for settinglimit
,confidence
etc.).
Returns
dict
: returns resource descriptor
resource.commit
resource.commit(strict=None)
Update resource instance if there are in-place changes in the descriptor.
Arguments
- strict (bool): alter
strict
mode for further work
Raises
DataPackageException
: raises error if something goes wrong
Returns
bool
: returns true on success and false if not modified
resource.save
resource.save(target, storage=None, to_base_path=False, **options)
Saves this resource
Into storage if storage
argument is passed or
saves this resource's descriptor to json file otherwise.
Arguments
- target (str): path where to save a resource
- storage (str/tableschema.Storage):
storage name like
sql
or storage instance - to_base_path (bool): save the resource to the resource's base path using the "<base_path>/<target>" route
- options (dict): storage options to use for storage creation
Raises
DataPackageException
: raises error if something goes wrong
Returns
bool
: returns true on success
Building index...
Started generating documentation...
Group
Group(self, resources)
Group representation
Arguments
- Resource[]: list of TABULAR resources
group.headers
Group's headers
Returns
str[]/None
: returns headers
group.name
Group name
Returns
str
: name
group.schema
Resource's schema
Returns
tableschema.Schema
: schema
group.iter
group.iter(**options)
Iterates through the group data and emits rows cast based on table schema.
It concatenates all the resources and has the same API as
resource.iter
group.read
group.read(limit=None, **options)
Read the whole group and return as array of rows
It concatenates all the resources and has the same API as
resource.read
group.check_relations
group.check_relations()
Check group's relations
The same as resource.check_relations
but without the optional
argument foreign_keys_values. This method will test foreignKeys of the
whole group at once otpimizing the process by creating the foreign_key_values
hashmap only once before testing the set of resources.
Profile
Profile(self, profile)
Profile representation
Arguments
- profile (str): profile name in registry or URL to JSON Schema
Raises
DataPackageException
: raises error if something goes wrong
profile.jsonschema
JSONSchema content
Returns
dict
: returns profile's JSON Schema contents
profile.name
Profile name
Returns
str/None
: name if available
profile.validate
profile.validate(descriptor)
Validate a data package descriptor
against the profile.
Arguments
- descriptor (dict): retrieved and dereferenced data package descriptor
Raises
ValidationError
: raises if not valid Returns
bool
: returns True if valid
validate
validate(descriptor)
Validate a data package descriptor.
Arguments
- descriptor (str/dict): package descriptor (one of): - local path - remote url - object
Raises
ValidationError
: raises on invalid
Returns
bool
: returns true on valid
infer
infer(pattern, base_path=None)
Infer a data package descriptor.
Argument
pattern
works only for local files
Arguments
- pattern (str): glob file pattern
Returns
dict
: returns data package descriptor
DataPackageException
DataPackageException(self, message, errors=[])
Base class for all DataPackage/TableSchema exceptions.
If there are multiple errors, they can be read from the exception object:
try:
# lib action
except DataPackageException as exception:
if exception.multiple:
for error in exception.errors:
# handle error
datapackageexception.errors
List of nested errors
Returns
DataPackageException[]
: list of nested errors
datapackageexception.multiple
Whether it's a nested exception
Returns
bool
: whether it's a nested exception
TableSchemaException
TableSchemaException(self, message, errors=[])
Base class for all TableSchema exceptions.
LoadError
LoadError(self, message, errors=[])
All loading errors.
CastError
CastError(self, message, errors=[])
All value cast errors.
IntegrityError
IntegrityError(self, message, errors=[])
All integrity errors.
RelationError
RelationError(self, message, errors=[])
All relations errors.
StorageError
StorageError(self, message, errors=[])
All storage errors.
Contributing
The project follows the Open Knowledge International coding standards.
Recommended way to get started is to create and activate a project virtual environment. To install package and development dependencies into active environment:
$ make install
To run tests with linting and coverage:
$ make test
Changelog
Here described only breaking and the most important changes. The full changelog and documentation for all released versions could be found in nicely formatted commit history.
v1.15
WARNING: it can be breaking for some setups, please read the discussions below
- Fixed header management according to the specs:
v1.14
- Add experimental options for pick/skiping fileds/rows
v1.13
- Add
unsafe
option to Package and Resource (#262)
v1.12
- Use
chardet
for encoding deteciton by default. Forcchardet
:pip install datapackage[cchardet]
v1.11
resource/package.save
now accept ato_base_path
argument (#254)package.save
now returns aStorage
instance if available
v1.10
- Added an ability to check tabular resource's integrity
v1.9
- Added
resource.package
property
v1.8
- Added support for groups of resources
v1.7
- Added support for compression of resources
v1.6
- Added support for custom request session
v1.5
Updated behaviour:
- Added support for Python 3.7
v1.4
New API added:
- added
skip_rows
support to the resource descriptor
v1.3
New API added:
- property
package.base_path
is now publicly available
v1.2
Updated behaviour:
- CLI command
$ datapackage infer
now outputs only a JSON-formatted data package descriptor.
v1.1
New API added:
- Added an integration between
Package/Resource
and thetableschema.Storage
- https://github.com/frictionlessdata/tableschema-py#storage. It allows to load and save data package from/to different storages like SQL/BigQuery/etc.