• This repository has been archived on 03/Jul/2023
  • Stars
    star
    292
  • Rank 142,152 (Top 3 %)
  • Language
    C
  • License
    Other
  • Created almost 11 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

PostgreSQL protocol gateway for Presto distributed SQL query engine

Prestogres

PostgreSQL protocol gateway for Presto

Prestogres is a gateway server that allows clients to use PostgreSQL protocol to run queries on Presto.

You can use any PostgreSQL clients (see also Limitation section):

Prestogres also offers password-based authentication and SSL.

Prestogres internals at Slideshare

Documents


How it works?

Prestogres uses modified version of pgpool-II to rewrite queries before sending them to PostgreSQL. pgpool-II is originally a middleware to provide connection pool and load balancing to PostgreSQL. Prestogres hacked it as following:

  • When a client connects to pgpool-II, the modified pgpool-II runs SELECT setup_system_catalog(...) statement on PostgreSQL.
    • This function is implemented on PostgreSQL using PL/Python.
    • It gets list of tables from Presto, and runs CREATE TABLE for each tables.
    • Those created tables are empty, but clients can get the table schemas.
  • When the client runs a regular SELECT statement, the modified pgpool-II rewrites the query to run SELECT * FROM fetch_presto_query_results(...) statement.
    • This function runs the original query on Presto and returns the results.
  • If the statement is not regular SELECT (such as SET, SELECT from system catalogs, etc.), pgpool-II simply forwards the statement to PostgreSQL without rewriting.

In fact, there're some more tricks. See prestogres/pgsql/prestogres.py for the real behavior.


Limitation

  • Extended query is not supported
    • ODBC driver needs to set:
      • Server side prepare = no property (UseServerSidePrepare=0 at .ini file)
      • Level of rollback on errors = Transaction property (Protocol=7.4-0 or Protocol=6.4 at .ini file)
      • Unicode mode
    • JDBC driver needs to set:
      • protocolVersion=2 property
  • Temporary table is not supported
  • Some SQL commands of Presto don't work
    • Supported:
      • SELECT
      • EXPLAIN
      • INSERT INTO
      • CREATE TABLE
      • CREATE VIEW
    • Not supported:
      • DROP TABLE

Installation

1. Install PostgreSQL >= 9.3

You need to install PostgreSQL separately. Following commands install PostgreSQL 9.3 from postgresql.org:

Ubuntu/Debian:

# add apt source
$ sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
$ wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
$ sudo apt-get update
# install PostgreSQL
$ sudo apt-get install postgresql-9.3 postgresql-contrib-9.3 postgresql-server-dev-9.3 postgresql-plpython-9.3
# install other dependencies
$ sudo apt-get install gcc make libssl-dev libpcre3-dev

RedHat/CentOS:

# add yum source
$ sudo yum install http://yum.postgresql.org/9.3/redhat/rhel-6-x86_64/pgdg-redhat93-9.3-1.noarch.rpm
# install PostgreSQL
$ sudo yum install postgresql93-server postgresql93-contrib postgresql93-devel postgresql93-plpython
# install other dependencies
$ sudo yum install gcc make openssl-devel pcre-devel

Mac OS X:

You can install PostgreSQL using Homebrew.

brew install postgresql

2. Install Prestogres

Download the latest release from releases or clone the git repository. You can install the binary as following:

$ ./configure --program-prefix=prestogres- # if error occurs, add pg_config command $PATH (e.g. $ export PATH=/usr/pgsql-9.3/bin:$PATH)
$ make
$ sudo make install

You can find prestogres-ctl command:

$ prestogres-ctl --help

Running servers

You need to run 2 server programs: pgpool-II and PostgreSQL. You can use prestogres-ctl command to setup & run them as following:

# 1. Configure configuration file (at least presto_server parameter):
$ vi /usr/local/etc/prestogres.conf

# 2. Create a data directory:
$ prestogres-ctl create pgdata
# vi pgdata/postgresql.conf  # edit configuration if necessary

# 3. Start PostgreSQL
$ prestogres-ctl postgres -D pgdata

# 4. Open another shell, and initialize the database to install PL/Python functions
$ prestogres-ctl migrate

# 5. Start pgpool-II:
$ prestogres-ctl pgpool

# 6. Finally, you can connect to pgpool-II using psql command.
#    Database name ('hive') is name of a Presto catalog:
$ psql -h 127.0.0.1 -p 5439 -U presto hive

If configuration is correct, you can run SELECT * FROM sys.node; query. Otherwise, see log messages.

Setting shmem max parameter

Above command fails first time on most of environments! Error message is:

FATAL:  could not create shared memory segment: Cannot allocate memory
DETAIL:  Failed system call was shmget(key=6432001, size=3809280, 03600).
HINT:  This error usually means that PostgreSQL's request for a shared memory segment exceeded
available memory or swap space, or exceeded your kernel's SHMALL parameter.  You can either
reduce the request size or reconfigure the kernel with larger SHMALL.  To reduce the request
size (currently 3809280 bytes), reduce PostgreSQL's shared memory usage, perhaps by reducing
shared_buffers or max_connections.

You need to set 2 kernel parameters to run PostgreSQL.

Linux:

sudo bash -c "echo kernel.shmmax = 17179869184 >> /etc/sysctl.conf"
sudo bash -c "echo kernel.shmall = 4194304 >> /etc/sysctl.conf"
sudo sysctl -p /etc/sysctl.conf

Mac OS X:

$ sudo sysctl -w kern.sysv.shmmax=1073741824
$ sudo sysctl -w kern.sysv.shmall=1073741824

Configuration

Please read pgpool-II documentation for most of parameters used in prestogres.conf file. Following parameters are unique to Prestogres:

  • presto_server: address:port of a presto coordinator.
  • presto_catalog: (optional) catalog name of Presto (such as hive, etc.). By default, login database name is used as the catalog name
  • presto_schema: (optional) schema name of Presto (such as hive, etc.). By default, login database name is used as the schema name
  • presto_external_auth_prog: (optional) path to an external authentication program used by external authentication moethd. See following Authentication section for details.

You can overwrite these parameters for each connecting users (and databases) using prestogres_hba.conf file. See also following Authentication section.

Authentication

By default, Prestogres accepts all connections from 127.0.0.1 without password and rejects any other connections. You can change this behavior by updating $prefix/etc/prestogres_hba.conf file.

See sample prestogres_hba.conf file for details. Basic syntax is:

# TYPE  DATABASE  USER    CIDR-ADDRESS             METHOD        OPTIONS

# trust from 192.168.x.x without password
host    all       all     127.0.0.1/32             trust

# trust from 192.168.x.x without password
host    all       all     192.168.0.0/16           trust

# trust from 10.{1,2}.x.x without password
host    all       all     10.0.0.0/16,10.1.0.0/16  trust

# require password authentication from 10.3.x.x
host    all       all     10.3.0.0/16              md5

# overwrite presto_server address and catalog name if the login database name is altdb
host    altdb     all     0.0.0.0/0                md5           presto_server:alt.presto.example.com:8190,presto_catalog:hive

# run external command to authenticate if login user name is myuser
host    all       myuser  0.0.0.0/0                external      auth_prog:/opt/prestogres/auth.py

md5 method

This authentication method uses a password file ($prefix/etc/prestogres_passwd) to authenticate an user. You can use prestogres passwd command to add an user to this file:

$ prestogres-pg_md5 -pm -u myuser
password: (enter password here)

In prestogres_hba.conf file, you can set following options to the OPTIONS field:

  • presto_server: address:port of a presto coordinator, which overwrites presto_servers parameter in prestogres.conf.
  • presto_catalog: catalog name of Presto, which overwrites presto_catalog parameter in prestogres.conf.
  • presto_schema: schema name of Presto, which overwrites presto_schema parameter in prestogres.conf.
  • presto_user: user name to run queries on Presto (X-Presto-User). By default, login user name is used. Following pg_user parameter doesn't affect this parameter.
  • pg_database: (advanced) Overwrite database name on PostgreSQL. By default, login database name is used as-is. If this database does not exist on PostgreSQL, Prestogres automatically creates it.
  • pg_user: (advanced) Overwrite user name connecting to PostgreSQL. This value should be prestogres in most of cases. If you create another superuser on PostgreSQL manually, you may use this parameter.

external method

This authentication method runs an external command to authentication an user.

  • Note: This method is still experimental. Interface could be changed.
  • Note: This method requires clients to send password in clear text. It's recommended to enable SSL in prestogres.conf.

You need to set presto_external_auth_prog parameter in prestogres.conf or auth_prog option in prestogres_hba.conf. Prestogres runs the program every time when an user connects. The program receives following data through STDIN:

user:USER_NAME
password:PASSWORD
database:DATABASE
address:IPADDR

If you want to allow this connection, the program optionally prints parameters as following to STDOUT, and exists with status code 0:

presto_server:PRESTO_SERVER_ADDRESS
presto_catalog:PRESTO_CATALOG_NAME
presto_schema:PRESTO_SCHEMA_NAME
presto_user:USER_NAME
pg_database:DATABASE
pg_user:USER_NAME

If you want to reject this connection, the program exists with non-0 status code.


FAQ

I can connect from localhost but cannot from remote host

Prestogres trusts connections from localhost and rejects any other connections by default. To connect Prestogres from a remote host, you need to edit prestogres_hba.conf file.

See Authentication section for details.

I can connect to Prestogres but cannot run any queries

Prestogres gets all table information from Presto when you run the first query for each connection. If this initialization fails, all queries fail.

Prestogres runs following SQL on Presto to get table information. If this query fails on your Presto, Prestogres doesn't work.

select table_schema, table_name, column_name, is_nullable, data_type
from information_schema.columns

All queries by JDBC or ODBC clients fail with "Prestogres doesn't support extended query"

PostgreSQL has 2 protocols to run a query: simple query and extended query.

Extended query is a new protocol to support prepared statements (server-side prepared statements). Prestogres supports only simple query.

Fortunately, JDBC and ODBC clients implement prepared statements at client-side to be complatible with old PostgreSQL. You need to disable server-side prepared statements and enable the client-side implementation.

See Limitation section for the parameters.

If you have interest in the detailed protocol specification: PostgreSQL Frontend/Backend Protocol.

Time zone of timestamp type is wrong

Prestogres checks timezone session variable and passes it to Presto (X-Presto-Time-Zone).

  • To check timezone session variable: SHOW timezone
  • To change the timezone on a session: SET timezone TO UTC
  • To change the default timezone, edit timezone parameter at postgresql.conf file located in PostgreSQL's data directory you created using prestogres-ctl create command.

Prestogres is licensed under Apache License, Version 2.0.
Copyright (C) 2014 Sadayuki Furuhashi

More Repositories

1

digdag

Workload Automation System
Java
1,305
star
2

serverengine

A framework to implement robust multiprocess servers like Unicorn
Ruby
758
star
3

chef-td-agent

Chef Cookbook for td-agent (Treasure Agent or Fluentd)
Ruby
127
star
4

perfectqueue

Highly available distributed queue built on RDBMS
Ruby
124
star
5

td-agent

This repository is OBSOLETE, check gh/treasure-data/omnibus-td-agent
Shell
109
star
6

treasure-boxes

Treasure Boxes - pre-built pieces of code for developing, optimizing, and analyzing your data.
Python
109
star
7

perfectsched

Highly available distributed cron built on RDBMS
Ruby
97
star
8

omnibus-td-agent

td-agent (Fluentd) Packaging Scripts
Shell
82
star
9

trino-client-ruby

Trino/Presto client library for Ruby
Ruby
70
star
10

td-js-sdk

JavaScript SDK for Treasure Data
JavaScript
70
star
11

digdag-docs

Documents for Digdag Workflow Engine
HTML
50
star
12

td

CUI Interface
Ruby
49
star
13

elastic-beanstalk-td-agent

Example of installing td-agent on AWS Elastic Beanstalk (see .ebextentions directory)
Ruby
49
star
14

td-client-python

Treasure Data API library for Python
Python
47
star
15

pandas-td

Interactive data analysis with Pandas and Treasure Data.
Python
38
star
16

angular-treasure-overlay-spinner

Add a spinner to an element when binding is truthy.
JavaScript
36
star
17

kafka-fluentd-consumer

Kafka Consumer for Fluentd
Java
32
star
18

td-logger-ruby

Treasure Data logging library for Ruby / Rails
Ruby
27
star
19

luigi-td-example

Example Repository for Building Complex Data Pipeline with Luigi +TD
Python
24
star
20

td-ios-sdk

iOS SDK for Treasure Data
Objective-C
23
star
21

td-client-ruby

Ruby Client Library for Treasure Data
Ruby
23
star
22

td-android-sdk

Android SDK for Treasure Data
Java
22
star
23

heroku-td-agent

Treasure Agent on Heroku platform (accept HTTP logging)
Ruby
20
star
24

pytd

Treasure Data Driver for Python
Jupyter Notebook
18
star
25

luigi-td

Luigi Workflow Engine integration for Treasure Data
Python
16
star
26

td-logger-java

Treasure Data Logging Library for Java
Java
12
star
27

fluent-plugin-metricsense

MetricSense - application metrics collection plugin
Ruby
12
star
28

td-client-java

Java Client Library for Treasure Data
Java
12
star
29

td-client-node

Node.js Client Library for Treasure Data
JavaScript
12
star
30

metricsense

MetricSense for Ruby - application metrics collection API
Ruby
11
star
31

td-jdbc

JDBC Driver for Treasure Data
Java
11
star
32

embulk-input-google_analytics

Embulk Input Plugin for Google Analytics
Ruby
11
star
33

td-client-go

Go Client Library for Treasure Data
Go
11
star
34

sqsrun

Generic Amazon SQS Worker Executor Service
Ruby
10
star
35

Lead-List-from-CrunchBase-

Python
10
star
36

embulk-output-td

Embulk output plugin for Treasure Data
Java
10
star
37

td-ue4-sdk

Treasure Data Unreal Engine 4 SDK
C++
10
star
38

fluent-plugin-td-monitoring

Fluentd Plugin for Treasure Agent Monitoring Service
Ruby
10
star
39

stdout-hook

Import your event logs from STDOUT to TD or Fluentd
Ruby
9
star
40

ipython-notebook-examples

iPython notebook examples for Treasure Data
9
star
41

treasuredata_fdw

PostgreSQL Foreign Data Wrapper for Treasure Data
C
8
star
42

embulk-input-zendesk

Embulk Input Plugin for Zendesk
Java
8
star
43

embulk-input-td

Treasure Data Input Plugin for Embulk
Java
8
star
44

embulk-input-mixpanel

Embulk Input Plugin for Mixpanel
Ruby
8
star
45

td-notebooks

Jupyter notebook examples for Treasure Data
Jupyter Notebook
8
star
46

lambda-local-proxy

Local API proxy that calls an AWS Lambda function
Go
7
star
47

embulk-input-marketo

Embulk Input Plugin for Marketo
Java
7
star
48

treasure-academy-cdp

Python
6
star
49

fluent-plugin-td

Fluentd plugin for Treasure Data Service
Ruby
6
star
50

embulk-output-mailchimp

Embulk output plugin for Mailchimp
Java
6
star
51

lda-board

Auto segmentation UI using LDA
Ruby
5
star
52

fluent-plugin-librato-metrics

Librato Metrics output plugin for Fluentd event collector
Ruby
5
star
53

td-logger-python

Python logging module for td-agent
Python
4
star
54

embulk-filter-add_time

Java
4
star
55

embulk-input-jira

Embulk Input Plugin for JIRA
Java
4
star
56

RTD

Simple R client for Treasure Data
HTML
4
star
57

td-unity-sdk-package

Unity SDK for TreasureData
C#
3
star
58

eslint-plugin-td

Stores td-console config so that it can be reused
JavaScript
3
star
59

embulk-parser-query_string

Embulk parser plugin for URL-encoded key value pairs
Ruby
3
star
60

react-treasure-preview-table

console preview table for user data, react component
JavaScript
3
star
61

facebook-open-academy-fluentd-2015

This is the "course page" for Facebook Open Academy 2015 (Winter) for Fluentd
3
star
62

td-cordova-sdk

Treasure Data SDK Cordova Plugin
JavaScript
3
star
63

rsched

Generic Reliable Scheduler
Ruby
3
star
64

js-examples

HTML
3
star
65

hive-udf-neologd

Hive Japanese NLP UDFs with NEologd
Java
2
star
66

angular-treasure-focus-class

Adds a class to an element on focus and removes it when focus is lost.
JavaScript
2
star
67

prestogres-odbc

Fork of PostgreSQL ODBC driver for Prestogres
C
2
star
68

dockerfiles

The collection of Dockerfile
Shell
2
star
69

td-js-consent

This repo is for Treasure Data JavaScript Consent Management UIs
JavaScript
2
star
70

td2slack

Treasure Data to Slack app
Ruby
2
star
71

td-import-java

Treasure Data Import Tool by Java
Java
2
star
72

underwrap

A very thin wrapper of Undertow and Resteasy
Java
2
star
73

TD-API-Documentation-postman-collections

1
star
74

embulk-reporter-fluentd

Java
1
star
75

juju-layer-td-agent

Shell
1
star
76

subtree-deploy

Ruby
1
star
77

heroku-td

Heroku CLI plugin for Treasure Data
Ruby
1
star
78

td-react-native-sdk

Treasure Data React Native SDK
JavaScript
1
star
79

td2email

td2email
Ruby
1
star
80

PodSpecs

Ruby
1
star
81

treasure-academy-sql

1
star
82

td-libyaml

Binary Packaging Scripts for td-libyaml (dependency of td-agent package)
1
star
83

pytd-legacy

[DEPRECATED] This repo is being deprecated. Please check out
Python
1
star