• Stars
    star
    343
  • Rank 119,779 (Top 3 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created about 7 years ago
  • Updated 12 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

fully asynchronous, pure JavaScript implementation of the Parquet file format

CURRENT STATUS: INACTIVE

This project requires a major overhaul, as well as handling and sorting through dozens of issues and prs. Please contact me if you're up for the task.

parquet.js

fully asynchronous, pure node.js implementation of the Parquet file format

Build Status License: MIT npm version

This package contains a fully asynchronous, pure JavaScript implementation of the Parquet file format. The implementation conforms with the Parquet specification and is tested for compatibility with Apache's Java reference implementation.

What is Parquet?: Parquet is a column-oriented file format; it allows you to write a large amount of structured data to a file, compress it and then read parts of it back out efficiently. The Parquet format is based on Google's Dremel paper.

Installation

To use parquet.js with node.js, install it using npm:

  $ npm install parquetjs

parquet.js requires node.js >= 8

Usage: Writing files

Once you have installed the parquet.js library, you can import it as a single module:

var parquet = require('parquetjs');

Parquet files have a strict schema, similar to tables in a SQL database. So, in order to produce a Parquet file we first need to declare a new schema. Here is a simple example that shows how to instantiate a ParquetSchema object:

// declare a schema for the `fruits` table
var schema = new parquet.ParquetSchema({
  name: { type: 'UTF8' },
  quantity: { type: 'INT64' },
  price: { type: 'DOUBLE' },
  date: { type: 'TIMESTAMP_MILLIS' },
  in_stock: { type: 'BOOLEAN' }
});

Note that the Parquet schema supports nesting, so you can store complex, arbitrarily nested records into a single row (more on that later) while still maintaining good compression.

Once we have a schema, we can create a ParquetWriter object. The writer will take input rows as JSON objects, convert them to the Parquet format and store them on disk.

// create new ParquetWriter that writes to 'fruits.parquet`
var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');

// append a few rows to the file
await writer.appendRow({name: 'apples', quantity: 10, price: 2.5, date: new Date(), in_stock: true});
await writer.appendRow({name: 'oranges', quantity: 10, price: 2.5, date: new Date(), in_stock: true});

Once we are finished adding rows to the file, we have to tell the writer object to flush the metadata to disk and close the file by calling the close() method:

await writer.close();

Usage: Reading files

A parquet reader allows retrieving the rows from a parquet file in order. The basic usage is to create a reader and then retrieve a cursor/iterator which allows you to consume row after row until all rows have been read.

You may open more than one cursor and use them concurrently. All cursors become invalid once close() is called on the reader object.

// create new ParquetReader that reads from 'fruits.parquet`
let reader = await parquet.ParquetReader.openFile('fruits.parquet');

// create a new cursor
let cursor = reader.getCursor();

// read all records from the file and print them
let record = null;
while (record = await cursor.next()) {
  console.log(record);
}

When creating a cursor, you can optionally request that only a subset of the columns should be read from disk. For example:

// create a new cursor that will only return the `name` and `price` columns
let cursor = reader.getCursor(['name', 'price']);

It is important that you call close() after you are finished reading the file to avoid leaking file descriptors.

await reader.close();

Encodings

Internally, the Parquet format will store values from each field as consecutive arrays which can be compressed/encoded using a number of schemes.

Plain Encoding (PLAIN)

The most simple encoding scheme is the PLAIN encoding. It simply stores the values as they are without any compression. The PLAIN encoding is currently the default for all types except BOOLEAN:

var schema = new parquet.ParquetSchema({
  name: { type: 'UTF8', encoding: 'PLAIN' },
});

Run Length Encoding (RLE)

The Parquet hybrid run length and bitpacking encoding allows to compress runs of numbers very efficiently. Note that the RLE encoding can only be used in combination with the BOOLEAN, INT32 and INT64 types. The RLE encoding requires an additional bitWidth parameter that contains the maximum number of bits required to store the largest value of the field.

var schema = new parquet.ParquetSchema({
  age: { type: 'UINT_32', encoding: 'RLE', bitWidth: 7 },
});

Optional Fields

By default, all fields are required to be present in each row. You can also mark a field as 'optional' which will let you store rows with that field missing:

var schema = new parquet.ParquetSchema({
  name: { type: 'UTF8' },
  quantity: { type: 'INT64', optional: true },
});

var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');
await writer.appendRow({name: 'apples', quantity: 10 });
await writer.appendRow({name: 'banana' }); // not in stock

Nested Rows & Arrays

Parquet supports nested schemas that allow you to store rows that have a more complex structure than a simple tuple of scalar values. To declare a schema with a nested field, omit the type in the column definition and add a fields list instead:

Consider this example, which allows us to store a more advanced "fruits" table where each row contains a name, a list of colours and a list of "stock" objects.

// advanced fruits table
var schema = new parquet.ParquetSchema({
  name: { type: 'UTF8' },
  colours: { type: 'UTF8', repeated: true },
  stock: {
    repeated: true,
    fields: {
      price: { type: 'DOUBLE' },
      quantity: { type: 'INT64' },
    }
  }
});

// the above schema allows us to store the following rows:
var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');

await writer.appendRow({
  name: 'banana',
  colours: ['yellow'],
  stock: [
    { price: 2.45, quantity: 16 },
    { price: 2.60, quantity: 420 }
  ]
});

await writer.appendRow({
  name: 'apple',
  colours: ['red', 'green'],
  stock: [
    { price: 1.20, quantity: 42 },
    { price: 1.30, quantity: 230 }
  ]
});

await writer.close();

// reading nested rows with a list of explicit columns
let reader = await parquet.ParquetReader.openFile('fruits.parquet');

let cursor = reader.getCursor([['name'], ['stock', 'price']]);
let record = null;
while (record = await cursor.next()) {
  console.log(record);
}

await reader.close();

It might not be obvious why one would want to implement or use such a feature when the same can - in principle - be achieved by serializing the record using JSON (or a similar scheme) and then storing it into a UTF8 field:

Putting aside the philosophical discussion on the merits of strict typing, knowing about the structure and subtypes of all records (globally) means we do not have to duplicate this metadata (i.e. the field names) for every record. On top of that, knowing about the type of a field allows us to compress the remaining data more efficiently.

List of Supported Types & Encodings

We aim to be feature-complete and add new features as they are added to the Parquet specification; this is the list of currently implemented data types and encodings:

Logical TypePrimitive TypeEncodings
UTF8BYTE_ARRAYPLAIN
JSONBYTE_ARRAYPLAIN
BSONBYTE_ARRAYPLAIN
BYTE_ARRAYBYTE_ARRAYPLAIN
TIME_MILLISINT32PLAIN, RLE
TIME_MICROSINT64PLAIN, RLE
TIMESTAMP_MILLISINT64PLAIN, RLE
TIMESTAMP_MICROSINT64PLAIN, RLE
BOOLEANBOOLEANPLAIN, RLE
FLOATFLOATPLAIN
DOUBLEDOUBLEPLAIN
INT32INT32PLAIN, RLE
INT64INT64PLAIN, RLE
INT96INT96PLAIN
INT_8INT32PLAIN, RLE
INT_16INT32PLAIN, RLE
INT_32INT32PLAIN, RLE
INT_64INT64PLAIN, RLE
UINT_8INT32PLAIN, RLE
UINT_16INT32PLAIN, RLE
UINT_32INT32PLAIN, RLE
UINT_64INT64PLAIN, RLE

Buffering & Row Group Size

When writing a Parquet file, the ParquetWriter will buffer rows in memory until a row group is complete (or close() is called) and then write out the row group to disk.

The size of a row group is configurable by the user and controls the maximum number of rows that are buffered in memory at any given time as well as the number of rows that are co-located on disk:

var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');
writer.setRowGroupSize(8192);

Dependencies

Parquet uses thrift to encode the schema and other metadata, but the actual data does not use thrift.

Contributions

Please make sure you sign the contributor license agreement in order for us to be able to accept your contribution. We thank you very much!

License

Copyright (c) 2017-2019 ironSource Ltd.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

More Repositories

1

OneAdapter

A Viewholderless Adapter for RecyclerView, who supports builtin diffing, states (paging, empty...), events (clicking, swiping...), and more.
Kotlin
470
star
2

dslint

A lightweight Android linter for Kotlin DSL aimed to solve the problem of verifying mandatory DSL attributes at compile time.
Kotlin
66
star
3

aircon

Remote config management Android library powered by annotation processing and code generation.
Java
62
star
4

node-generator-chrome-ninja

Yeoman Chrome extension generator
JavaScript
59
star
5

node-dynamodb-stream

A wrapper around low level aws sdk that makes it easy to consume a dynamodb-stream
JavaScript
44
star
6

aws-api-read-stream

Turn an AWS api call into a readable stream
JavaScript
24
star
7

node-if-async

async conditional execution for async.js or standalone usage
JavaScript
23
star
8

stubborn

A retry engine
JavaScript
22
star
9

alleviate

Kotlin
19
star
10

node-generator-bare-react

Generator for React apps and components
JavaScript
14
star
11

node-generator-nom

A modular Yeoman generator to create or update node modules.
JavaScript
13
star
12

ec2-info

Extract ec2 metadata information from the local http interface on an ec2 instance
JavaScript
10
star
13

fusion-ui

Angular based, multi-frameworks components library.
TypeScript
6
star
14

portable-executable

HTML
5
star
15

json-proxy

takes an http stream of json objects and retransmits them one by one to another destination
JavaScript
5
star
16

node-config-prompt

Configuration store with a cli prompt and JSON Schema validation
JavaScript
4
star
17

vault-redshift-plugin

Plugin for Hashicorp Vault for RedShift database
Go
4
star
18

proxy

Duplicates traffic from prod to dev server
JavaScript
4
star
19

resumable-http-download

resumable http download
JavaScript
4
star
20

node-pg-metadata

quickly get the metadata of a postresql or redshift db
JavaScript
4
star
21

karma-mongodb-reporter

a karma reporter for saving test results to mongodb
JavaScript
3
star
22

node-pg-validate

validate js objects against postgres or redshift metadata/schema
JavaScript
3
star
23

node-dynamodb-value

convert js objects into dynamodb schema objects and back
JavaScript
3
star
24

exec-it

JavaScript
2
star
25

log4js-elasticsearch-aws

Extension for the log4js-elasticsearch which uses the AWS SDK to sign requests allowing you to use user/role based policies for AWS Elasticsearch Service.
JavaScript
2
star
26

iSmusic

hackathon music app.
HTML
2
star
27

parse-content-range-header

Parse an http response Content-Range header
JavaScript
2
star
28

aws-sqs-workflow

Simple workflow on top of aws sqs service
JavaScript
2
star
29

win-lnk

Manage Windows Shortcuts WIP
JavaScript
1
star
30

is-ec2-machine

Detect if current machine is running on AWS EC2.
JavaScript
1
star
31

bi-tools

Python
1
star
32

action-trigger-workflow

Trigger GitHub action workflow file from another repo and wait until it will be done.
Shell
1
star
33

mobile-api-lib-python

Python
1
star
34

gelf2kafka

Gelfd daemon that accepts gelf events and pushes them to kafka
Ruby
1
star
35

node-regedit-code-generator

scrape method signatures from msdn to generate vbscript code for https://github.com/ironsource/node-regedit
VBScript
1
star
36

node-multi-tap

JavaScript
1
star
37

node-ella

JavaScript
1
star
38

action-slack-notification

A GitHub Action to send a message to a Slack channel.
Shell
1
star