• Stars
    star
    317
  • Rank 132,216 (Top 3 %)
  • Language
    C++
  • License
    PostgreSQL License
  • Created almost 6 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Parquet foreign data wrapper for PostgreSQL

build experimental

parquet_fdw

Read-only Apache Parquet foreign data wrapper for PostgreSQL.

Installation

parquet_fdw requires libarrow and libparquet installed in your system (requires version 0.15+, for previous versions use branch arrow-0.14). Please refer to libarrow installation page or building guide. To build parquet_fdw run:

make install

or in case when PostgreSQL is installed in a custom location:

make install PG_CONFIG=/path/to/pg_config

It is possible to pass additional compilation flags through either custom CCFLAGS or standard PG_CFLAGS, PG_CXXFLAGS, PG_CPPFLAGS variables.

After extension was successfully installed run in psql:

create extension parquet_fdw;

Basic usage

To start using parquet_fdw one should first create a server and user mapping. For example:

create server parquet_srv foreign data wrapper parquet_fdw;
create user mapping for postgres server parquet_srv options (user 'postgres');

Now you should be able to create foreign table for Parquet files.

create foreign table userdata (
    id           int,
    first_name   text,
    last_name    text
)
server parquet_srv
options (
    filename '/mnt/userdata1.parquet'
);

Advanced

Currently parquet_fdw supports the following column types:

Arrow type SQL type
INT8 INT2
INT16 INT2
INT32 INT4
INT64 INT8
FLOAT FLOAT4
DOUBLE FLOAT8
TIMESTAMP TIMESTAMP
DATE32 DATE
STRING TEXT
BINARY BYTEA
LIST ARRAY
MAP JSONB

Currently parquet_fdw doesn't support structs and nested lists.

Foreign table may be created for a single Parquet file and for a set of files. It is also possible to specify a user defined function, which would return a list of file paths. Depending on the number of files and table options parquet_fdw may use one of the following execution strategies:

Strategy Description
Single File Basic single file reader
Multifile Reader which process Parquet files one by one in sequential manner
Multifile Merge Reader which merges presorted Parquet files so that the produced result is also ordered; used when sorted option is specified and the query plan implies ordering (e.g. contains ORDER BY clause)
Caching Multifile Merge Same as Multifile Merge, but keeps the number of simultaneously open files limited; used when the number of specified Parquet files exceeds max_open_files

Following table options are supported:

  • filename - space separated list of paths to Parquet files to read;
  • sorted - space separated list of columns that Parquet files are presorted by; that would help postgres to avoid redundant sorting when running query with ORDER BY clause or in other cases when having a presorted set is beneficial (Group Aggregate, Merge Join);
  • files_in_order - specifies that files specified by filename or returned by files_func are ordered according to sorted option and have no intersection rangewise; this allows to use Gather Merge node on top of parallel Multifile scan (default false);
  • use_mmap - whether memory map operations will be used instead of file read operations (default false);
  • use_threads - enables Apache Arrow's parallel columns decoding/decompression (default false);
  • files_func - user defined function that is used by parquet_fdw to retrieve the list of parquet files on each query; function must take one JSONB argument and return text array of full paths to parquet files;
  • files_func_arg - argument for the function, specified by files_func;
  • max_open_files - the limit for the number of Parquet files open simultaneously.

GUC variables:

  • parquet_fdw.use_threads - global switch that allow user to enable or disable threads (default true);
  • parquet_fdw.enable_multifile - enable Multifile reader (default true).
  • parquet_fdw.enable_multifile_merge - enable Multifile Merge reader (default true).

Parallel queries

parquet_fdw also supports parallel query execution (not to confuse with multi-threaded decoding feature of Apache Arrow).

Import

parquet_fdw also supports IMPORT FOREIGN SCHEMA command to discover parquet files in the specified directory on filesystem and create foreign tables according to those files. It can be used as follows:

import foreign schema "/path/to/directory"
from server parquet_srv
into public;

It is important that remote_schema here is a path to a local filesystem directory and is double quoted.

Another way to import parquet files into foreign tables is to use import_parquet or import_parquet_explicit:

create function import_parquet(
    tablename   text,
    schemaname  text,
    servername  text,
    userfunc    regproc,
    args        jsonb,
    options     jsonb)

create function import_parquet_explicit(
    tablename   text,
    schemaname  text,
    servername  text,
    attnames    text[],
    atttypes    regtype[],
    userfunc    regproc,
    args        jsonb,
    options     jsonb)

The only difference between import_parquet and import_parquet_explicit is that the latter allows to specify a set of attributes (columns) to import. attnames and atttypes here are the attributes names and attributes types arrays respectively (see the example below).

userfunc is a user-defined function. It must take a jsonb argument and return a text array of filesystem paths to parquet files to be imported. args is user-specified jsonb object that is passed to userfunc as its argument. A simple implementation of such function and its usage may look like this:

create function list_parquet_files(args jsonb)
returns text[] as
$$
begin
    return array_agg(args->>'dir' || '/' || filename)
           from pg_ls_dir(args->>'dir') as files(filename)
           where filename ~~ '%.parquet';
end
$$
language plpgsql;

select import_parquet_explicit(
    'abc',
    'public',
    'parquet_srv',
    array['one', 'three', 'six'],
    array['int8', 'text', 'bool']::regtype[],
    'list_parquet_files',
    '{"dir": "/path/to/directory"}',
    '{"sorted": "one"}'
);

More Repositories

1

rmq

Message queue system written in Go and backed by Redis
Go
1,479
star
2

go-wrk

a small heavy duty http/https benchmark tool written in go
Go
816
star
3

ios_sdk

This is the iOS SDK of
Objective-C
579
star
4

android_sdk

This is the Android SDK of
Java
573
star
5

redismq

a durable message queue system for go based on redis, see also https://github.com/adjust/rmq
Go
501
star
6

shrimp

a phantomjs based pdf renderer
Ruby
283
star
7

sdks

SDKs of Adjust
141
star
8

unity_sdk

This is the Unity SDK of
C#
133
star
9

react_native_sdk

This is the React Native SDK of
Objective-C
119
star
10

kafka_fdw

kafka foreign database wrapper for postresql
C
101
star
11

gohub

github webhook based deloyment server
Go
60
star
12

gorails

A set of go packages to integrate your Go app into existing Rails project.
Go
60
star
13

flutter_sdk

This is the Flutter SDK of
Dart
54
star
14

web_sdk

JavaScript
40
star
15

pgbundle

bundling postgres extension
Ruby
39
star
16

postgresql_extension_demo

37
star
17

cordova_sdk

This is the Cordova SDK of
JavaScript
35
star
18

istore

development repo for integer hstore replacement in postgres
C
34
star
19

goem

go extension manager
Go
33
star
20

AEProductController

Small wrapper for SKStoreProductViewController that handles tracking links.
Objective-C
33
star
21

pg_cryogen

Compressed append-only pluggable storage for PostgreSQL 12+
C
30
star
22

adobe_air_sdk

This is the Adobe AIR SDK of
Java
24
star
23

rport

Connection management and SQL parallelisation for R analytics on big database clusters
R
23
star
24

nvd3-rails

nvd3 reusable charts for rails 3
Ruby
23
star
25

goenv

go boilerplate code
Go
23
star
26

pg-telemetry

Useful monitoring views for PostgreSQL, packaged as an extension
PLpgSQL
22
star
27

hydra-curl

a bash script to download with hundreds of parallel curls
Shell
17
star
28

pg-base36

a base36 extension for postgres
C
16
star
29

wltree

adjust's patched version of postgres ltree
C
14
star
30

xamarin_sdk

This is the Xamarin SDK of
C#
14
star
31

api-client-r

an R client for the KPI service https://docs.adjust.com/en/kpi-service/
R
14
star
32

unreal_sdk

This is the Unreal SDK of
C++
14
star
33

pg-currency

1 Byte Currency ISO type for PostgreSQL
C
12
star
34

windows_sdk

This is the Windows SDK of
C#
10
star
35

schaufel

C
10
star
36

cocos2dx_sdk

This is the Cocos2d-x SDK of
C++
8
star
37

pg-country

country type for postgres
C
7
star
38

marmalade_sdk

This is the Marmalade SDK of http://www.adjust.com
C++
7
star
39

pg_c_dev

Repo for Extending postgresql with C
PLpgSQL
7
star
40

corona_sdk

This is the Corona SDK of
Java
7
star
41

pg-roleman

Role Management Extension with some Nice Common Functions
PLpgSQL
6
star
42

ajbool

triple bool for postgres
C
6
star
43

adjust_signature_sdk

6
star
44

go_conf

a go package to simplify configuring golang apps with database.yml rails style
Go
5
star
45

gentoo-overlay

adjust Gentoo overlay
Shell
5
star
46

AEPriceMatrix

Tier based currency conversion for iOS
Objective-C
4
star
47

pg-numhstore

a postgres extension to support inthstore and floathstore types
C
4
star
48

postgresql_exporter

Go
4
star
49

csv-gists-r

Facilitates maintaining CSV on gist.github.com directly from R objects
R
4
star
50

react-and-rockets

React & Rockets - Challenge for Adjust Frontend Developer
JavaScript
4
star
51

pg-device_type

a device_type extension for postgres
C
3
star
52

iOS6AdTracking

Objective-C
3
star
53

titanium_sdk

This is the Titanium SDK of
JavaScript
3
star
54

pg_querylog

Show queries running on PostgreSQL backends
C
3
star
55

pg_type_template

An extension template for Postgres type
Jinja
3
star
56

redis_failover

redis failover scripts
Perl
3
star
57

adjust_anes

Google Play Services ANE Builder
Makefile
2
star
58

pg_intmap

Compressed integer-to-integer map
C
2
star
59

githubWorkflows

Shared workflows
2
star
60

pg-ext-actions

Github action to build and test PostgreSQL extensions
Shell
2
star
61

rport_demo

A Demo Rport App
R
2
star
62

goautoneg

fork of goautoneg from https://bitbucket.org/ww/goautoneg
Go
2
star
63

pg-geoip2lookup

A PL/Perl based extension for PostgreSQL for lookup in data in geoip2 dos for PostgreSQL
SQLPL
2
star
64

all_substrings_tokenizer

C Postgres extension for extracting all substrings of a string
C
2
star
65

pg-language

Postgres extension of language enumeration type
PLpgSQL
2
star
66

pg_lock_pool

A postgres Extension to wait on a lock pool
PLpgSQL
2
star
67

file_cache

Filesystem-backed caching for data, big and small
Elixir
2
star
68

pg_spec

ruby minitest based pg_tap like test runner for postgres
PLpgSQL
1
star
69

webdev-assignment

Vue
1
star
70

design-tokens

CSS
1
star
71

airbrake-client-r

An R client for the Airbrake API
R
1
star
72

dumbo

postgres extension fun
Ruby
1
star
73

ios_adobe_extension

Adjust SDK extension for Adobe Experience Platform Mobile
Objective-C
1
star
74

postgres_tools

adjust postgres tools
Perl
1
star
75

michaelbot

Go
1
star
76

pg-ajversion

simple semantic version type for postgres
PLpgSQL
1
star
77

go_demo

gogo demo app
Ruby
1
star
78

mailbot

Perl
1
star
79

hubot

smart ass bot
CoffeeScript
1
star
80

zabbix-api

perl zabbix bindings
Perl
1
star
81

postgres_agg_funcs

aggregation helper functions written in c
C
1
star
82

pg-mvtbl

A postgres Extension to easily move tables around tablespaces
PLpgSQL
1
star
83

smart_banner_sdk

Adjust Smart Banner SDK
TypeScript
1
star
84

dev-docs

Developer documentation for Adjust's SDKs and APIs.
MDX
1
star
85

PGObject-Util-Replication-Slot

Replication Slot monitoring and management for PostgreSQL via Perl/CPAN
Makefile
1
star