• Stars
    star
    500
  • Rank 87,543 (Top 2 %)
  • Language
    Rust
  • License
    Apache License 2.0
  • Created over 2 years ago
  • Updated 25 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Rust-based WebAssembly bindings to read and write Apache Parquet data

WASM Parquet npm version

WebAssembly bindings to read and write the Apache Parquet format to and from Apache Arrow.

This is designed to be used alongside a JavaScript Arrow implementation, such as the canonical JS Arrow library.

Including all compression codecs, the brotli-encoded WASM bundle is 907KB.

Install

parquet-wasm is published to NPM. Install with

yarn add parquet-wasm
# or
npm install parquet-wasm

API

Two APIs?

These bindings expose two APIs to users because there are two separate implementations of Parquet and Arrow in Rust.

  • parquet and arrow: These are the "official" Rust implementations of Arrow and Parquet. These projects started earlier and may be more feature complete.
  • parquet2 and arrow2: These are safer (in terms of memory access) and claim to be faster, though I haven't written my own benchmarks yet.

Since these parallel projects exist, why not give the user the choice of which to use? In general the reading API is identical in both APIs, however the write options differ between the two projects.

Choice of bundles

Presumably no one wants to use both parquet and parquet2 at once, so the default bundles separate parquet and parquet2 into separate entry points to keep bundle size as small as possible. The following describe the six bundles available:

Entry point Rust crates used Description Documentation
parquet-wasm/bundler/arrow1 parquet and arrow "Bundler" build, to be used in bundlers such as Webpack Link
parquet-wasm/node/arrow1 parquet and arrow Node build, to be used with require in NodeJS Link
parquet-wasm/esm/arrow1 parquet and arrow ESM, to be used directly from the Web as an ES Module Link
parquet-wasm or parquet-wasm/bundler/arrow2 parquet2 and arrow2 "Bundler" build, to be used in bundlers such as Webpack Link
parquet-wasm/node/arrow2 parquet2 and arrow2 Node build, to be used with require in NodeJS Link
parquet-wasm/esm/arrow2 parquet2 and arrow2 ESM, to be used directly from the Web as an ES Module Link

Note that when using the esm bundles, the default export must be awaited. See here for an example.

arrow2 API

This implementation uses the arrow2 and parquet2 Rust crates.

This is the default implementation and is more full-featured, including metadata handling and async reading. Refer to the API documentation for more details and examples.

arrow API

This implementation uses the arrow and parquet Rust crates.

Refer to the API documentation for more details and examples.

Debug functions

These functions are not present in normal builds to cut down on bundle size. To create a custom build, see Custom Builds below.

setPanicHook

setPanicHook(): void

Sets console_error_panic_hook in Rust, which provides better debugging of panics by having more informative console.error messages. Initialize this first if you're getting errors such as RuntimeError: Unreachable executed.

The WASM bundle must be compiled with the console_error_panic_hook for this function to exist.

Example

import { tableFromArrays, tableFromIPC, tableToIPC } from "apache-arrow";
import {
  readParquet,
  writeParquet,
  Compression,
  WriterPropertiesBuilder,
} from "parquet-wasm";

// Create Arrow Table in JS
const LENGTH = 2000;
const rainAmounts = Float32Array.from({ length: LENGTH }, () =>
  Number((Math.random() * 20).toFixed(1))
);

const rainDates = Array.from(
  { length: LENGTH },
  (_, i) => new Date(Date.now() - 1000 * 60 * 60 * 24 * i)
);

const rainfall = tableFromArrays({
  precipitation: rainAmounts,
  date: rainDates,
});

// Write Arrow Table to Parquet
const writerProperties = new WriterPropertiesBuilder()
  .setCompression(Compression.ZSTD)
  .build();
const parquetBuffer = writeParquet(
  tableToIPC(rainfall, "stream"),
  writerProperties
);

// Read Parquet buffer back to Arrow Table
const table = tableFromIPC(readParquet(parquetBuffer));
console.log(table.schema.toString());
// Schema<{ 0: precipitation: Float32, 1: date: Date64<MILLISECOND> }>

Published examples

Performance considerations

Tl;dr: Try the new readParquetFFI API, new in 0.4.0. This API is less well tested than the "normal" readParquet API, but should be faster and have much less memory overhead (by a factor of 2). If you hit any bugs, please create a reproducible issue.

Under the hood, parquet-wasm first decodes a Parquet file into Arrow in WebAssembly memory. But then that WebAssembly memory needs to be copied into JavaScript for use by Arrow JS. The "normal" read APIs (e.g. readParquet) use the Arrow IPC format to get the data back to JavaScript. But this requires another memory copy inside WebAssembly to assemble the various arrays into a single buffer to be copied back to JS.

Instead, the new readParquetFFI API uses Arrow's C Data Interface to be able to copy or view Arrow arrays from within WebAssembly memory without any serialization.

Note that this approach uses the arrow-js-ffi library to parse the Arrow C Data Interface definitions. This library has not yet been tested in production, so it may have bugs!

I wrote an interactive blog post on this approach and the Arrow C Data Interface if you want to read more!

Example

import { Table } from "apache-arrow";
import { parseRecordBatch } from "arrow-js-ffi";
// Edit the `parquet-wasm` import as necessary
import { readParquetFFI, __wasm } from "parquet-wasm/node2";

// A reference to the WebAssembly memory object. The way to access this is different for each
// environment. In Node, use the __wasm export as shown below. In ESM the memory object will
// be found on the returned default export.
const WASM_MEMORY = __wasm.memory;

const resp = await fetch("https://example.com/file.parquet");
const parquetUint8Array = new Uint8Array(await resp.arrayBuffer());
const wasmArrowTable = readParquetFFI(parquetUint8Array);

const recordBatches = [];
for (let i = 0; i < wasmArrowTable.numBatches(); i++) {
  // Note: Unless you know what you're doing, setting `true` below is recommended to _copy_
  // table data from WebAssembly into JavaScript memory. This may become the default in the
  // future.
  const recordBatch = parseRecordBatch(
    WASM_MEMORY.buffer,
    wasmArrowTable.arrayAddr(i),
    wasmArrowTable.schemaAddr(),
    true
  );
  recordBatches.push(recordBatch);
}

const table = new Table(recordBatches);

// VERY IMPORTANT! You must call `drop` on the Wasm table object when you're done using it
// to release the Wasm memory.
// Note that any access to the pointers in this table is undefined behavior after this call.
// Calling any `wasmArrowTable` method will error.
wasmArrowTable.drop();

Compression support

The Parquet specification permits several compression codecs. This library currently supports:

  • Uncompressed
  • Snappy
  • Gzip
  • Brotli
  • ZSTD
  • LZ4 (deprecated)
  • LZ4_RAW. Supported in arrow2 only.

LZ4 support in Parquet is a bit messy. As described here, there are two LZ4 compression options in Parquet (as of version 2.9.0). The original version LZ4 is now deprecated; it used an undocumented framing scheme which made interoperability difficult. The specification now reads:

It is strongly suggested that implementors of Parquet writers deprecate this compression codec in their user-facing APIs, and advise users to switch to the newer, interoperable LZ4_RAW codec.

It's currently unknown how widespread the ecosystem support is for LZ4_RAW. As of pyarrow v7, it now writes LZ4_RAW by default and presumably has read support for it as well.

Custom builds

In some cases, you may know ahead of time that your Parquet files will only include a single compression codec, say Snappy, or even no compression at all. In these cases, you may want to create a custom build of parquet-wasm to keep bundle size at a minimum. If you install the Rust toolchain and wasm-pack (see Development), you can create a custom build with only the compression codecs you require.

Note that this project uses Cargo syntax newly released in version 1.60. So you need version 1.60 or higher to compile this project. To upgrade your toolchain, use rustup update stable.

Example custom builds

Reader-only bundle with Snappy compression using the arrow and parquet crates:

wasm-pack build --no-default-features --features arrow1 --features snappy --features reader

Writer-only bundle with no compression support using the arrow2 and parquet2 crates, targeting Node:

wasm-pack build --target nodejs --no-default-features --features arrow2 --features writer

Debug bundle with reader and writer support, targeting Node, using arrow and parquet crates with all their supported compressions, with console_error_panic_hook enabled:

wasm-pack build \
  --dev \
  --target nodejs \
  --no-default-features \
  --features arrow1 \
  --features reader \
  --features writer \
  --features all_compressions \
  --features debug
# Or, given the fact that the default feature includes several of these features, a shorter version:
wasm-pack build --dev --target nodejs --features debug

Refer to the wasm-pack documentation for more info on flags such as --release, --dev, target, and to the Cargo documentation for more info on how to use features.

Available features

By default, arrow, all_compressions, reader, and writer features are enabled. Use --no-default-features to remove these defaults.

  • arrow1: Use the arrow and parquet crates
  • arrow2: Use the arrow2 and parquet2 crates
  • reader: Activate read support.
  • writer: Activate write support.
  • async: Activate asynchronous read support (only applies to the arrow2 endpoints).
  • all_compressions: Activate all supported compressions for the crate(s) in use.
  • brotli: Activate Brotli compression.
  • gzip: Activate Gzip compression.
  • snappy: Activate Snappy compression.
  • zstd: Activate ZSTD compression.
  • lz4: Activate LZ4_RAW compression (only applies to the arrow2 endpoints).
  • debug: Expose the setPanicHook function for better error messages for Rust panics. Additionally compiles CLI debug functions.

Future work

  • Example of pushdown predicate filtering, to download only chunks that match a specific condition
  • Column filtering, to download only certain columns
  • More tests

Acknowledgements

A starting point of my work came from @my-liminal-space's read-parquet-browser (which is also dual licensed MIT and Apache 2).

@domoritz's arrow-wasm was a very helpful reference for bootstrapping Rust-WASM bindings.

More Repositories

1

stata_kernel

A Jupyter kernel for Stata. Works with Windows, macOS, and Linux.
Python
262
star
2

arrow-js-ffi

Zero-copy reading of Arrow data from WebAssembly
TypeScript
103
star
3

pymartini

A Cython port of Martini for fast RTIN terrain mesh generation
Python
87
star
4

quantized-mesh-encoder

A fast Python Quantized Mesh encoder
Python
83
star
5

deck.gl-raster

deck.gl layers and WebGL modules for client-side satellite imagery analysis
JavaScript
81
star
6

usgs-topo-tiler

Python package to read Web Mercator map tiles from USGS Historical Topographic Maps
Python
76
star
7

arro3

A minimal Python library for Apache Arrow, connecting to the Rust arrow crate
Rust
72
star
8

geo-index

A Rust crate and Python library for packed, static, zero-copy spatial indexes.
Rust
71
star
9

pydelatin

Python bindings to `hmm` for fast terrain mesh generation
C
66
star
10

suncalc-py

A Python port of suncalc.js for calculating sun position and sunlight phases
Python
61
star
11

landsat8.earth

2D/3D WebGL Landsat 8 satellite image analysis
JavaScript
40
star
12

language-stata

Syntax highlighting for Stata in Atom
JavaScript
38
star
13

vscode-jupyter-python

Run automatically-inferred Python code blocks in the VS Code Jupyter extension
TypeScript
37
star
14

keplergl_cli

One-line geospatial data visualization using Kepler.gl
Python
34
star
15

naip-cogeo-mosaic

Serverless high-resolution aerial imagery map tiles from Cloud-Optimized GeoTIFFs for the U.S.
Python
30
star
16

dem-tiler

Serverless terrain contours, quantized mesh, and RGB-encoded terrain tiles
Python
27
star
17

stata-exec

Run Stata code from the Atom text editor on Windows, Mac, or Linux
JavaScript
26
star
18

all-transit

Interactive visualization of all transit in the Transitland database
JavaScript
23
star
19

snap-to-tin

Snap vector features to the faces of a triangulated irregular network (TIN)
TypeScript
21
star
20

medicare-documentation

Unified Medicare documentation in a single responsive website
Python
17
star
21

stata-png-fix

Fix for missing Stata icons on Linux
Shell
16
star
22

landsat-mosaic-latest

Auto-updating global Landsat 8 mosaic of Cloud-Optimized GeoTIFFs from SNS notifications
Python
15
star
23

arrow-wasm

Building block library for using Apache Arrow in Rust WebAssembly modules.
Rust
15
star
24

medicare_utils

Python package to assist working with Medicare data.
Python
13
star
25

spatially-partitioned-geoparquet

Exploring spatially-partitioned GeoParquet
Python
12
star
26

serverless-slope

Serverless, worldwide hillshade and slope angle shading tiles
Python
11
star
27

usgs-topo-mosaic

Serverless USGS Historical Topographic map tiles
Python
8
star
28

demquery

Wrapper around rasterio to query points on a Digital Elevation Model
Python
8
star
29

types-rasterio

Types for the Rasterio package
Python
7
star
30

pyflatbush

A Cython port of Flatbush for fast, static 2D spatial indexing
Cython
6
star
31

openmaptiles-fonts

Fontstacks to serve with openmaptiles
6
star
32

kylebarron.github.io

Source for kylebarron.dev
JavaScript
5
star
33

usgs-dem-mosaic

Create MosaicJSONs for USGS Cloud-Optimized GeoTIFF DEMs
Python
5
star
34

jupyterlab-stata-highlight

Jupyterlab extension to highlight Stata syntax
JavaScript
5
star
35

transitland_wrapper

Python wrapper for transit.land API
Python
4
star
36

rstar-py

Python bindings to the Rust rstar library for creating R* Trees
Rust
4
star
37

flight-scraper

Personal flight scraper to notify me for good flight prices
HTML
4
star
38

tippecanoe-lambda

Tippecanoe Lambda layer
Dockerfile
4
star
39

landsat-cogeo-mosaic

Create mosaicJSON for Landsat imagery
Python
4
star
40

snow-depth-tileserver

A tile server for US snow depth
Shell
4
star
41

stata-parquet-old

Read and write Parquet files from Stata
C
4
star
42

deck.gl-geoarrow

deck.gl layer for rendering GeoArrow data
TypeScript
3
star
43

stataParquet

A prototype Java implementation of reading Parquet files into Stata.
Java
3
star
44

types-affine

Types for the Affine package
Python
3
star
45

mapbox-gl-style-prune

Remove default values from Mapbox GL style
JavaScript
3
star
46

linux_setup

Script to install software on a new computer or VM instance
Shell
3
star
47

react-native-snap-carousel-example

Updated, working example from react-native-snap-carousel repository
JavaScript
3
star
48

serverless-aerial-imagery

Serve NAIP and Landsat aerial imagery tiles on the fly with AWS Lambda
Python
2
star
49

serverless-mesh

Serverless Quantized Mesh generation from AWS Terrain Tiles
JavaScript
2
star
50

adsb-extract

Create data extracts from ADSB historical archive
Python
2
star
51

quantized-mesh-encoder-js

Encode a mesh into the quantized mesh format
JavaScript
2
star
52

vector-nationalmap

Exploration of using US National Map data with OpenMapTiles
Python
2
star
53

docstring-fold

Simple Atom package to fold Python docstrings
JavaScript
1
star
54

aoc-2021

Advent of code 2021
Rust
1
star
55

deck.gl-mapbox-layer-example

React Example of using Deck.gl in the Mapbox GL context
JavaScript
1
star
56

southwest-scrape

Simple scraper for southwest airlines' website
Python
1
star
57

deck.gl-loading-benchmark

Benchmarking deck.gl loading and rendering performance with different input formats
1
star
58

stata-autodoc

Sphinx filter to include documentation from Stata program docstrings
Python
1
star
59

deckgl-hybrid-terrain-layer

An extruded terrain layer with vector features for deck.gl
JavaScript
1
star
60

npi-geocode

Offline geocoding of NPPES NPI data
Python
1
star
61

gmail_send

Simple Python package to send emails through Gmail
Python
1
star
62

ra-guide

A technical guide for incoming research assistants in Economics
1
star
63

deckgl-slope-layer

JavaScript
1
star
64

atom-gmail

Use Atom to write emails in Markdown and send through Gmail
CSS
1
star
65

viewshed.js

Implementation of viewshed algorithm in JS
JavaScript
1
star
66

s2-orbit-geometry

Calculate sentinel 2 geometries from orbits
Python
1
star
67

pct-tools

Some scripts for remote use while hiking the PCT
Python
1
star