• Stars
    star
    572
  • Rank 77,995 (Top 2 %)
  • Language
    Python
  • License
    Other
  • Created over 6 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A friend to fetch your data files

Pooch: A friend to fetch your data files

Documentation (latest) β€’ Documentation (main branch) β€’ Contributing β€’ Contact

Part of the Fatiando a Terra project

Latest version on PyPI Latest version on conda-forge Test coverage status Compatible Python versions. DOI used to cite Pooch

About

Does your Python package include sample datasets? Are you shipping them with the code? Are they getting too big?

Pooch is here to help! It will manage a data registry by downloading your data files from a server only when needed and storing them locally in a data cache (a folder on your computer).

Here are Pooch's main features:

  • Pure Python and minimal dependencies.
  • Download a file only if necessary (it's not in the data cache or needs to be updated).
  • Verify download integrity through SHA256 hashes (also used to check if a file needs to be updated).
  • Designed to be extended: plug in custom download (FTP, scp, etc) and post-processing (unzip, decompress, rename) functions.
  • Includes utilities to unzip/decompress the data upon download to save loading time.
  • Can handle basic HTTP authentication (for servers that require a login) and printing download progress bars.
  • Easily set up an environment variable to overwrite the data cache location.

Are you a scientist or researcher? Pooch can help you too!

  • Automatically download your data files so you don't have to keep them in your GitHub repository.
  • Make sure everyone running the code has the same version of the data files (enforced through the SHA256 hashes).

Example

For a scientist downloading a data file for analysis:

import pooch
import pandas as pd

# Download a file and save it locally, returning the path to it.
# Running this again will not cause a download. Pooch will check the hash
# (checksum) of the downloaded file against the given value to make sure
# it's the right file (not corrupted or outdated).
fname_bathymetry = pooch.retrieve(
    url="https://github.com/fatiando-data/caribbean-bathymetry/releases/download/v1/caribbean-bathymetry.csv.xz",
    known_hash="md5:a7332aa6e69c77d49d7fb54b764caa82",
)

# Pooch can also download based on a DOI from certain providers.
fname_gravity = pooch.retrieve(
    url="doi:10.5281/zenodo.5882430/southern-africa-gravity.csv.xz",
    known_hash="md5:1dee324a14e647855366d6eb01a1ef35",
)

# Load the data with Pandas
data_bathymetry = pd.read_csv(fname_bathymetry)
data_gravity = pd.read_csv(fname_gravity)

For package developers including sample data in their projects:

"""
Module mypackage/datasets.py
"""
import pkg_resources
import pandas
import pooch

# Get the version string from your project. You have one of these, right?
from . import version

# Create a new friend to manage your sample data storage
GOODBOY = pooch.create(
    # Folder where the data will be stored. For a sensible default, use the
    # default cache folder for your OS.
    path=pooch.os_cache("mypackage"),
    # Base URL of the remote data store. Will call .format on this string
    # to insert the version (see below).
    base_url="https://github.com/myproject/mypackage/raw/{version}/data/",
    # Pooches are versioned so that you can use multiple versions of a
    # package simultaneously. Use PEP440 compliant version number. The
    # version will be appended to the path.
    version=version,
    # If a version as a "+XX.XXXXX" suffix, we'll assume that this is a dev
    # version and replace the version with this string.
    version_dev="main",
    # An environment variable that overwrites the path.
    env="MYPACKAGE_DATA_DIR",
    # The cache file registry. A dictionary with all files managed by this
    # pooch. Keys are the file names (relative to *base_url*) and values
    # are their respective SHA256 hashes. Files will be downloaded
    # automatically when needed (see fetch_gravity_data).
    registry={"gravity-data.csv": "89y10phsdwhs09whljwc09whcowsdhcwodcydw"}
)
# You can also load the registry from a file. Each line contains a file
# name and it's sha256 hash separated by a space. This makes it easier to
# manage large numbers of data files. The registry file should be packaged
# and distributed with your software.
GOODBOY.load_registry(
    pkg_resources.resource_stream("mypackage", "registry.txt")
)

# Define functions that your users can call to get back the data in memory
def fetch_gravity_data():
    """
    Load some sample gravity data to use in your docs.
    """
    # Fetch the path to a file in the local storage. If it's not there,
    # we'll download it.
    fname = GOODBOY.fetch("gravity-data.csv")
    # Load it with numpy/pandas/etc
    data = pandas.read_csv(fname)
    return data

Projects using Pooch

If you're using Pooch, send us a pull request adding your project to the list.

Getting involved

πŸ—¨οΈ Contact us: Find out more about how to reach us at fatiando.org/contact.

πŸ‘©πŸΎβ€πŸ’» Contributing to project development: Please read our Contributing Guide to see how you can help and give feedback.

πŸ§‘πŸΎβ€πŸ€β€πŸ§‘πŸΌ Code of conduct: This project is released with a Code of Conduct. By participating in this project you agree to abide by its terms.

Imposter syndrome disclaimer: We want your help. No, really. There may be a little voice inside your head that is telling you that you're not ready, that you aren't skilled enough to contribute. We assure you that the little voice in your head is wrong. Most importantly, there are many valuable ways to contribute besides writing code.

This disclaimer was adapted from the MetPy project.

License

This is free software: you can redistribute it and/or modify it under the terms of the BSD 3-clause License. A copy of this license is provided in LICENSE.txt.

More Repositories

1

verde

Processing and gridding spatial data, machine-learning style
Python
581
star
2

fatiando

DEPRECATED in favor of our newer libraries (see www.fatiando.org). Python toolkit for modeling and inversion in geophysics.
Python
205
star
3

harmonica

Forward modeling, inversion, and processing gravity and magnetic data
Python
194
star
4

rockhound

NOTICE: This library is no longer being developed. Use Ensaio instead (https://www.fatiando.org/ensaio). -- Download geophysical models/datasets and load them in Python
Python
35
star
5

boule

Reference ellipsoids for geodesy and geophysics
Python
33
star
6

ensaio

Practice datasets to probe your code
Python
19
star
7

transform2020

Material for the Verde tutorial at Transform 2020
Jupyter Notebook
17
star
8

tutorials

Tutorials that integrate the Fatiando a Terra software to solve data problems in geoscience
TeX
12
star
9

choclo

Kernel functions for your geophysical models
Python
11
star
10

transform21

Material for the Harmonica tutorial at Transform21
Jupyter Notebook
9
star
11

wavefd

2D finite difference seismic wave propagation
Python
9
star
12

community

Community resources, guidelines, meeting notes, authorship policy, maintenance, etc.
8
star
13

dependente

Inspect Python package dependencies
Python
6
star
14

continuous-integration

THIS REPOSITORY IS READ-ONLY and is no longer actively maintained. We have since moved to GitHub Actions, which makes most of these scripts redundant.
Shell
6
star
15

moulder

Interactive 2D gravity forward modeling.
Python
5
star
16

magali

Modeling and inversion of magnetic microscopy data πŸ§²πŸ”¬
Python
5
star
17

website

Sphinx sources used to generate the www.fatiando.org page
CSS
5
star
18

data

DEPRECATED: Datasets were moved to https://github.com/fatiando-data | Curated sample geoscience data for documentation and tutorials. This repository contains code for downloading and formatting the data for redistribution.
Jupyter Notebook
5
star
19

erizo

DISCONTINUED. Elastic multi-component interpolation of GPS/GNSS ground displacement.
Python
5
star
20

2023-kegs

Abstract and presentation of Fatiando in the KEGS 2023 Symposium
Jupyter Notebook
3
star
21

2021-gsh

Talk about Fatiando for the Geophysical Society of Houston
Jupyter Notebook
3
star
22

agu2021

Invited presentation about Fatiando at AGU2021
JavaScript
3
star
23

geometric

Case study for stripping out fatiando.mesher into a new package
Python
3
star
24

meeting-notes

NOTICE: This repository is archived. Content and functionality has been moved to https://github.com/fatiando/community
2
star
25

logo

The Fatiando a Terra logos, wallpapers, and other media
Jupyter Notebook
2
star
26

egu2021

Presentation submited to EGU2021 about Boule and Harmonica
Jupyter Notebook
2
star
27

maintenance

NOTICE: This repository is archived. Content and functionality has been moved to https://github.com/fatiando/community
1
star
28

prototypes

DEPRECATED: IPython notebooks with early prototypes of various things
1
star
29

burocrata

Check and insert copyright and license notices into source code
Python
1
star
30

2023-image

Abstract for IMAGE 2023
1
star
31

fatiando.github.io

HTML sources for fatiando.org. DON'T MAKE PULL REQUESTS HERE. Files are updated manually when a new release is made.
HTML
1
star
32

website-nikola

DEPRECATED: Old source code to generate fatiando.org using Nikola. Repo still used to link to the PDFs stored here.
CSS
1
star
33

birs2023-introduction

Slides for a short talk about the Fatiando project for our BIRS 2023 workshop
JavaScript
1
star
34

deeplook

A framework for solving inverse problems
1
star