• This repository has been archived on 22/Feb/2023
  • Stars
    star
    229
  • Rank 173,664 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created over 6 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Pandas ExtensionDType/Array backed by Apache Arrow

fletcher

CI Code style: black Binder

A library that provides a generic set of Pandas ExtensionDType/Array implementations backed by Apache Arrow. They support a wider range of types than Pandas natively supports and also bring a different set of constraints and behaviours that are beneficial in many situations.

🗃️ Archived successfully 🤘

This project has been archived as development has ceased around 2021. With the support of Apache Arrow-backed extension arrays in pandas, the major goal of this project has been fulfilled. As Marc Garcia outlines in his blog post "pandas 2.0 and the Arrow revolution (part I)" Apache Arrow support in pandas is now generally available and here to stay. fletcher has hopefully discovered some bugs along the way and gave inspiration to the implementation that is now in pandas.

Usage

To use fletcher in Pandas DataFrames, all you need to do is to wrap your data in a FletcherChunkedArray or FletcherContinuousArray object. Your data can be of either pyarrow.Array, pyarrow.ChunkedArray or a type that can be passed to pyarrow.array(…).

import fletcher as fr
import pandas as pd

df = pd.DataFrame({
    'str_chunked': fr.FletcherChunkedArray(['a', 'b', 'c']),
    'str_continuous': fr.FletcherContinuousArray(['a', 'b', 'c']),
})

df.info()

# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 3 entries, 0 to 2
# Data columns (total 2 columns):
#  #   Column          Non-Null Count  Dtype                      
# ---  ------          --------------  -----                      
#  0   str_chunked     3 non-null      fletcher_chunked[string]   
#  1   str_continuous  3 non-null      fletcher_continuous[string]
# dtypes: fletcher_chunked[string](1), fletcher_continuous[string](1)
# memory usage: 166.0 bytes

Development

While you can use fletcher in pip-based environments, we strongly recommend using a conda based development setup with packages from conda-forge.

# Create the conda environment with all necessary dependencies
conda env create

# Activate the newly created environment
conda activate fletcher

# Install fletcher into the current environment
python -m pip install -e . --no-build-isolation --no-use-pep517

# Run the unit tests (you should do this several times during development)
py.test -nauto

# Install pre-commit hooks
# These will then be automatically run on every commit and ensure that files
# are black formatted, have no flake8 issues and mypy checks the type consistency.
pre-commit install

Code formatting is done using black. This should keep everything in a consistent styling and the formatting is automatically adjusted via the pre-commit hooks.

Using pandas in development mode

To test and develop against pandas' master or your local fixes, you can install a development version of pandas using:

git clone https://github.com/pandas-dev/pandas
cd pandas

# Install additional pandas dependencies
conda install -y cython

# Build and install pandas
python setup.py build_ext --inplace -j 4
python -m pip install -e . --no-build-isolation --no-use-pep517

This links the development version of pandas into your fletcher conda environment. If you change any Python code in pandas, it is directly reflected in your environment. If you change any Cython code in pandas, you need to re-execute python setup.py build_ext --inplace -j 4.

Using (py)arrow nightlies

To test and develop against the latest development version of Apache Arrow (pyarrow), you can install it from the arrow-nightlies conda channel:

conda install -c arrow-nightlies arrow-cpp pyarrow

Benchmarks

In benchmarks/ we provide a set of benchmarks to compare the performance of fletcher against pandas and ensure that fletcher itself stays performant. The benchmarks are written using airspeed velocity. When developing the benchmarks you can run them using asv dev (use -b <pattern> to only run a selection of them) only once. To get real benchmark values, you should use asv run --python=same to run the benchmarks multiple times and get meaningful average runtimes.

More Repositories

1

altair-vue-vega-example

An example web app that display data using Altair, Vega and VueJS
JavaScript
15
star
2

libfuzzymatch

C++11 library for fast fuzzy searching
C++
14
star
3

node-tomahawkjs

Implementation of the JS plugins API from Tomahawk for NodeJS
JavaScript
12
star
4

nyc-taxi-fare-prediction-deployment-example

Deployment example for a scikit-learn/lightgbm pipeline
Jupyter Notebook
9
star
5

qjson-qt5json-wrapper

Wrapper library for wrapping QJson and Qt5's Json implemenation behind a simple common interface
C++
6
star
6

pulaski

JavaScript libary to interact with Tomahawk instances
JavaScript
4
star
7

data-science-io-benchmarks

Benchmarks for typical Data Science I/O in Python
Jupyter Notebook
4
star
8

logolyze

LogoLyze - The Graphical Analyzer
Java
4
star
9

cxx-lastfm-nationalities

C++ version of xhochy's nationality statistics script
C++
3
star
10

hubot-tomahk

Hubot Plugin to translate various music URLs into unified toma.hk links so that you are not restricted to the services of your peers
CoffeeScript
3
star
11

lastfm-nations

Application to determine where an artist comes from.
Ruby
3
star
12

node-tomahk

Node.JS binding to interact with toma.hk
JavaScript
3
star
13

dotfiles

Always where they need to be ..
Lua
2
star
14

songride

Music Mashup that tells you from where the artists you are listing to come from
CoffeeScript
2
star
15

archlinux-pkgbuilds

xhochy's PKGBUILDs for ArchLinux
Shell
1
star
16

vlc-meta-reader

Simple demonstration on how to read metadata from A/V files using libvlc
C++
1
star
17

pyconde23-parquet

Code sample to go along with my talk about advanced Parquet features at PyConDE 2023 in Berlin
1
star
18

archlinux-pkgs

1
star
19

xhochy.github.com

My Weblog
SCSS
1
star
20

delicious-feeds

Library for accessing Delicious Feeds
Ruby
1
star
21

scrobbler-notify

Displays the current songs and some other information via certain notify libs on different systems
Ruby
1
star
22

miniforge-images-bot

Dockerfile
1
star
23

quazip-qt5port

Special repository to only work on porting quazip to Qt5
C++
1
star
24

node-ktoblzcheck

Node.JS binding for ktoblzcheck
JavaScript
1
star
25

scrobbler-rescrobble

Scrobbles tracks that one user has heard to an other user's account
Ruby
1
star
26

scrobbler-ng-utils

Utilities for usage in combination with the scrobbler-ng gem
Ruby
1
star
27

xhochy-overlay

My own ebuilds and/or ebuilds that are not available in more "common" overlays.
Shell
1
star
28

cxx-scrobbler

C++ library for the Audioscrobbler/Last.fm API
C++
1
star
29

arrow-performance-setup

Vagrant setup to run the Arrow benchmarks in a reliable and reproducible fashion
Shell
1
star
30

carameldb

Simple database testing framework for JDBC based applications.
Java
1
star
31

til

Small things I learnt that may be useful in future
Batchfile
1
star
32

vocalist

Library for various music analysis
Ruby
1
star
33

rainpress

Rainpress is a compressor for CSS. It's written in ruby, but should not be limited to ruby projects.
Ruby
1
star