• Stars
    star
    200
  • Rank 195,325 (Top 4 %)
  • Language
    JavaScript
  • Created over 13 years ago
  • Updated almost 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

WebUI for query engines: Hive and Presto

shib

DESCRIPTION

Shib is web client application for SQL-like query engines, written in Node.js, supporting

  • Hive (hiveserver, hiveserver2)
  • Facebook Presto
  • Google BigQuery

Once configured, we can switch query engines per executions.

Some extension features are supported:

  • Setup queries: options to specify queries executed before main query, like 'create functions ...'
  • Default Database: option to specify default database for Hive 0.6 or later

Versions

Latest version of 'shib' is v1.0.2.

'shib' versions are:

  • v1.0 series
    • metadata of v1.0 is NOT compatible with v0.3, so migration required
  • v0.3 series
    • use latest node (~> v0.10.26)
    • multi engines/databases support
    • presto support
    • storages of v0.3.x are compatible with v0.2
    • engine-wide database/table access controls
    • tagging for executed queries
    • authentication / logging for query execution
    • access controls based on authentication
  • v0.2 series
    • current status of master branch
    • uses local filesystem instead of KT, depends on latest node (v0.8.x, v0.10.x)
    • higher performance, more safe Web UI and updated features
    • storages of v0.2 are NOT complatible with v0.1
  • v0.1 series
    • uses KT, depends on node v0.6.x
    • see v0.1 tag

INSTALL

Hive/Presto

For Hive queries, shib requires HiveServer or HiveServer2. Setup and run these.

  • For HiveServer2
    • Configure hive.server2.authentication as NOSASL
      • Strongly recommended to configure hive.support.concurrency as false

For Presto, shib is tested with Presto version 0.57.

Node.js

To run shib, you must install node.js (v0.10.x recommended), and export PATH for installed node.

shib

Clone shib code.

$ git clone git://github.com/tagomoris/shib.git

Install libraries, configure addresses of HiveServer (and other specifications).

$ npm install
$ vi config.js

And run.

$ npm start

Shib listens on port 3000. see http://localhost:3000/

To switch environments for each shib instance, use NODE_ENV environment variable. (ex: production.js will be used with NODE_ENV=production)

$ NODE_ENV=production NODE_PATH=lib node app.js

Migrate metadata database from v0 to v1

Migration operation required to execute shib v1, with data in v0 era.

  1. Stop shib process
  2. Update shib code to v1
  3. Execute npm run migrate (for var/database.sqlite3 file)
  • This operation requires 5 minutes 30 seconds for 220MB database
  • Backup v0 database file is var/database.sqlite3.v0
  1. Start shib

Configuration

Shib can have 2 or more query executor engines.

HiveServer2

Basic configuration with HiveServer2 in config.js (or productions.js):

var servers = exports.servers = {
  listen: 3000,
  fetch_lines: 1000,   // lines per fetch in shib
  query_timeout: null, // shib waits queries forever
  setup_queries: [],
  storage: {
    datadir: './var'
  },
  engines: [
    { label: 'mycluster1',
      executer: {
        name: 'hiveserver2',
        host: 'hs2.mycluster1.local',
        port: 10000,
        username: 'hive',
        support_database: true
      },
      monitor: null
    },
  ],
};

username should be same as user name that hive job will be executed on. (password is not required for NOSASL transport.)

For UDFs, you can specify statements before query executions in setup_queries.

var servers = exports.servers = {
  listen: 3000,
  fetch_lines: 1000,
  query_timeout: null,
  setup_queries: [
    "add jar /path/to/jarfile/foo.jar",
    "create temporary function foofunc as 'package.of.udf.FooFunc'",
    "create temporary function barfunc as 'package.of.udf.BarFunc'"
  ],
  storage: {
    datadir: './var'
  },
  engines: [
    { label: 'mycluster1',
      executer: {
        name: 'hiveserver2',
        host: 'hs2.mycluster1.local',
        port: 10000,
        support_database: true
      },
      monitor: null
    },
  ],
};

HiveServer

Classic HiveServer is available if you want database supports instead of HiveServer2.

var servers = exports.servers = {
  listen: 3000,
  fetch_lines: 1000,
  query_timeout: null,
  setup_queries: [],
  storage: {
    datadir: './var'
  },
  engines: [
    { label: 'mycluster1',
      executer: {
        name: 'hiveserver',  // HiveServer(1)
        host: 'hs1.mycluster1.local',
        port: 10000,
        support_database: true,
        default_database: 'mylogs1'
      },
      monitor: null
    },
  ],
};

Presto

For Presto, use presto executer.

var servers = exports.servers = {
  listen: 3000,
  fetch_lines: 1000,
  query_timeout: 30, // 30 seconds for Presto query timeouts (it will fail)
  setup_queries: [],
  storage: {
    datadir: './var'
  },
  engines: [
    { label: 'prestocluster1',
      executer: {
        name: 'presto',
        host: 'coordinator.mycluster2.local',
        port: 8080,
        user: 'shib',
        catalog: 'hive',  // required configuration argument
        support_database: true,
        default_database: 'mylogs1'
      },
      monitor: null
    },
  ],
};

BigQuery

For BigQuery, use bigquery executer.

var servers = exports.servers = {
  listen: 3000,
  fetch_lines: 1000,
  query_timeout: 30, // 30 seconds for BigQuery query timeouts (it will fail)
  storage: {
    datadir: './var'
  },
  engines: [
    { label: 'bigquery',
      executer: {
        name: 'bigquery',
        default_database: 'mylogs1',
        project_id: 'gcp-project-id',
        key_filename: '/path/to/keyfile.json'
      },
      monitor: null
    }
  ]
};

For more detail about project_id and key_filename config, see https://github.com/GoogleCloudPlatform/gcloud-node#authorization .

Multi clusters and engines

Shib supports 2 or more engines for a cluster, and 2 or more clusters with same engines. These patterns are available.

  • HiveServer1, HiveServer2 and Presto for same data source
  • 2 or more catalogs for same Presto cluster
  • Many clusters which has one of HiveServer, HiveServer2 or Presto

You should write configurations in engines how you wants. fetch_lines, query_timeout and setup_queries in each engines overwrites global default of these configurations.

For example, This is configuration example.

  • ClusterA has HiveServer2
    • listenes port 10000
    • customized udfs in foo.jar are availabe
  • ClusterB has HiveServer
    • listenes port 10001
    • customized udfs in foo.jar are available
  • Presto cluster is configured with hive catalog and hive2 catalog
    • udfs are not available
var servers = exports.servers = {
  listen: 3000,
  fetch_lines: 1000,
  query_timeout: null,
  setup_queries: [
    "add jar /path/to/jarfile/foo.jar",
    "create temporary function foofunc as 'package.of.udf.FooFunc'",
    "create temporary function barfunc as 'package.of.udf.BarFunc'"
  ],
  storage: {
    datadir: './var'
  },
  engines: [
    { label: 'myclusterA',
      executer: {
        name: 'hiveserver2',
        host: 'master.a.cluster.local',
        port: 10000,
        support_database: true
      },
      monitor: null
    },
    { label: 'myclusterB',
      executer: {
        name: 'hiveserver',
        host: 'master.b.cluster.local',
        port: 10001,
        support_database: true,
        default_database: 'mylogs1'
      },
      monitor: null
    },
    { label: 'prestocluster1',
      executer: {
        name: 'presto',
        host: 'coordinator.p.cluster.local',
        port: 8080,
        user: 'shib',
        catalog: 'hive',
        support_database: true,
        default_database: 'mylogs1',
        query_timeout: 30,  // overwrite global config
        setup_queries: []   // overwrite global config
      },
      monitor: null
    },
    { label: 'prestocluster2',
      executer: {
        name: 'presto',
        host: 'coordinator.p.cluster.local',
        port: 8080,
        user: 'shib',
        catalog: 'hive2',  // one engine config per catalogs
        support_database: true,
        default_database: 'default',
        query_timeout: 30,  // overwrite global config
        setup_queries: []   // overwrite global config
      },
      monitor: null
    }
  ],
};

Access Control

Shib have access control list for Databases/Tables. Default is 'allow' for all databases/tables.

Shib's access control rules are:

  • configure per executer
  • database level default ('allow' or 'deny') without any optional rules makes that database unvisible
  • database level default + allow/deny table list makes its tables visible/unvisible
    • in this case, this 'database' is visible
  • default 'allow' or 'deny' decides visibilities of databases without any optional rules

'unvisible' databases and tables:

  • are not shown in tables/partitions list and schema list
  • cannot be queried by users (these queries always fails)

Access control options are written in 'executer' like this:

executer: {
  name: 'presto',
  host: 'coordinator.p.cluster.local',
  port: 8080,
  catalog: 'hive',
  support_database: true,
  default_database: 'default',
  query_timeout: 30,
  setup_queries: [],
  access_control: {
    databases: {
      secret: { default: "deny" },
      member: { default: "deny", allow: ["users"] },
      test:   { default: "allow", deny: ["secretData", "userMaster"] },
    },
    default: "allow"
  }
},

For more details, see wiki: Access Control.

Monitors

monitor configurations are used to get query status and to kill queries.

JobTracker (MRv1)

jobtracker monitor is available in MRv1 environment (w/ both of hiveserver and hiveserver2).

monitor: {
  name: 'jobtracker',
  host: 'jobtracker.hostname.local',
  port: 50030,
  mapred: '/usr/bin/mapred' // 'mapred' in PATH by default
}

For this feature, shib should be executed by a user who can execute command mapred job -kill JOB_ID.

YARN (MRv2)

yarn monitor is available in MRv2 environment (w/ both of hiveserver and hiveserver2).

monitor: {
  name: 'yarn',
  host: 'resourcemanager.hostname.local',
  port: 8088
}

In this case, shib kills query with Resource Manager REST API.

If you specify yarn command description, shib kills query with yarn application -kill APP_ID.

monitor: {
  name: 'yarn',
  host: 'resourcemanager.hostname.local',
  port: 8088,
  yarn: '/usr/bin/yarn'
}

Huahin Manager (obsolete)

For monitors in CDH4 + MRv1 environment, Huahin manager is available.

To show map/reduce status, and/or to kill actual map/reduce jobs behind hive query, shib can use Huahin Manager. Current version supports only 'Huahin Manager CDH4 + MRv1' only.

http://huahinframework.org/huahin-manager/

Configure monitor argument like below instead of null.

monitor: {
  name : 'huahin_mrv1',
  host: 'localhost',
  port: 9010
}

Authentication

Shib have authentication to log who execute queries and to control accesses: setup_queries_auth option means to specify queries executed before main query when authentication is required.

var servers = exports.servers = {
  listen: 3000,
  fetch_lines: 1000,   // lines per fetch in shib
  query_timeout: null, // shib waits queries forever
  setup_queries: [],
  setup_queries_auth: ["set hive.mapred.mode=strict"],
  storage: {
    datadir: './var'
  },
  auth: {
    type: 'http_basic_auth',
    url: 'http://your.internal.protected.service.example.com/',
    realm: '@your.service.example.com'
  },
  engines: [
    { label: 'mycluster1',
      executer: {
        name: 'hiveserver2',
        host: 'hs2.mycluster1.local',
        port: 10000,
        username: 'hive',
        support_database: true
      },
      monitor: null
    },
  ],
};

For more details, see wiki: Authentication.

Miscellaneous configurations

Disable "history" tab

Specify disable_history: true on servers.

var servers = exports.servers = {
  listen: 3000,
  fetch_lines: 1000,
  query_timeout: null, // seconds. (null:shib will wait query response infinitely).
  setup_queries: [],
  disable_history: true,
  storage: {
    datadir: './var'
  },

As HTTP Proxy for query engines

POST query string into /execute with some parameters.

curl -s -X POST -F 'querystring=SELECT COUNT(*) AS cnt FROM yourtable WHERE field="value"' http://shib.server.local:3000/execute | jq .
{
  "queryid": "69927e67c5b1d5f665697943cc4867ec",
  "results": [],
  "dbname": "default",
  "engine": "hiveserver",
  "querystring": "SELECT COUNT(*) AS cnt FROM yourtable WHERE field=\"value\""
}

Specify engineLabel and dbname for non-default query engines and databases:

curl -s -X POST -F "engineLabel=presto" -F "dbname=testing" -F "querystring=SELECT COUNT(*) AS cnt FROM yourtable WHERE field='value'" http://shib.server.local:3000/execute

If you want not to add your query into history tab, specify 'scheduled':

curl -s -X POST -F "scheduled=true" -F "querystring=SELECT COUNT(*) AS cnt FROM yourtable WHERE field='value'" http://shib.server.local:3000/execute

Then, fetch query's status whenever you want.

curl -s http://shib.server.local:3000/status/69927e67c5b1d5f665697943cc4867ec 
executed

Or get whole query object.

curl -s http://shib.server.local:3000/query/69927e67c5b1d5f665697943cc4867ec | jq .
{
  "queryid": "69927e67c5b1d5f665697943cc4867ec",
  "results": [
    {
      "resultid": "969629614dff69411a2f4f1733c9616a",
      "executed_at": "Wed Feb 26 2014 16:02:00 GMT+0900 (JST)"
    }
  ],
  "dbname": "default",
  "engine": "hiveserver",
  "querystring": "SELECT COUNT(*) AS cnt FROM yourtable WHERE field=\"value\""
}

If this query object has executed status, or a member of results, you can fetch its result by resultid.

# if you want elasped time or bytes or lines or ....
curl -s http://shib.server.local:3000/result/969629614dff69411a2f4f1733c9616a | jq .
{
  "schema": [
    {
      "type": "bigint",
      "name": "cnt"
    }
  ],
  "completed_msec": 1393398893759,
  "completed_at": "Wed Feb 26 2014 16:14:53 GMT+0900 (JST)",
  "completed_time": null,
  "bytes": 6,
  "queryid": "69927e67c5b1d5f665697943cc4867ec",
  "executed_time": null,
  "executed_at": "Wed Feb 26 2014 16:14:52 GMT+0900 (JST)",
  "executed_msec": 1393398892752,
  "resultid": "969629614dff69411a2f4f1733c9616a",
  "state": "done",
  "error": "",
  "lines": 2
}
# raw result data as TSV (fast)
curl -s http://shib.server.local:3000/download/tsv/969629614dff69411a2f4f1733c9616a | jq .
CNT
1234567
# or CSV (slow)
curl -s http://shib.server.local:3000/download/csv/969629614dff69411a2f4f1733c9616a | jq .
"CNT"
"1234567"

These HTTP requests/response are same with that javascript on browser does.


TODO

  • Paches are welcome!

License

Copyright 2011- TAGOMORI Satoshi (tagomoris)

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

More Repositories

1

xbuild

Language runtimes installer for production environment
Shell
197
star
2

deferral

Golang-style defer in Ruby
Ruby
175
star
3

fluent-plugin-forest

Ruby
149
star
4

fluent-plugin-secure-forward

Ruby
140
star
5

presto-client-node

Distributed query engine Presto client library for node.js
JavaScript
126
star
6

fluent-plugin-mysql

Ruby
91
star
7

fluent-agent-lite

Lightweight log delivery agent works w/ fluentd
Perl
79
star
8

fluent-plugin-parser

Ruby
75
star
9

fluent-plugin-flowcounter

TODO: one-line summary of your gem
Ruby
54
star
10

fluent-plugin-datacounter

Ruby
47
star
11

mysql2-cs-bind

'mysql2' extension to add pseudo prepared statement
Ruby
38
star
12

fluent-plugin-growthforecast

TODO: one-line summary of your gem
Ruby
37
star
13

maccro

Macro in Ruby
Ruby
30
star
14

fluent-plugin-file-alternative

Ruby
25
star
15

fluent-plugin-ping-message

Ruby
24
star
16

yabitz

Yet Another Business Information Tracker Z: host management application
Ruby
22
star
17

fluent-plugin-sampling-filter

TODO: one-line summary of your gem
Ruby
22
star
18

msgpack-inspect

A tool to inspect and dump the MessagePack binary data: msgpack.org[msgpack-inspect]
Ruby
20
star
19

fluent-plugin-route

This is copy of frsyuki's out_route
Ruby
20
star
20

fluent-plugin-numeric-monitor

Ruby
19
star
21

fluent-plugin-notifier

Ruby
18
star
22

rb-growthforecast

Ruby
17
star
23

astarisk

AST visualizer, named from "AST a risk"
Ruby
16
star
24

right_speed

Ruby
15
star
25

fluent-plugin-config-expander

Ruby
14
star
26

fluent-plugin-numeric-counter

Ruby
12
star
27

with_resources

Add "with" method in Ruby to allocate/release resources in safe way
Ruby
11
star
28

Apache-Log-Parser

Log Parser for Apache common, combined and other custom styles
Perl
11
star
29

fluent-mixin-config-placeholders

Ruby
10
star
30

fluent-plugin-pull_forward

Ruby
10
star
31

LFA

Web application framework to perform as Lambda Function Adapter
Ruby
8
star
32

scribe_line

Python script collection for log transfer with scribe
Python
7
star
33

binding-slicer

Let you write binding[:a, :b, :c] => Hash {a: a, b: b, c: c}
Ruby
7
star
34

fluent-plugin-hoop

TODO: one-line summary of your gem
Ruby
6
star
35

mysqldef_lambda_package

Docker container and scripts to run k0kubun/mysqldef
JavaScript
6
star
36

shibui

Perl
6
star
37

p5-Net-GrowthForecast

Client library for GrowthForecast
Perl
6
star
38

fluent-plugin-buffer-lightening

Ruby
6
star
39

stratum

O/R mapper library for ruby and MySQL on additional architecture
Ruby
6
star
40

fluent-mixin-plaintextformatter

Ruby
5
star
41

fluent-agent

Perl
4
star
42

fluent-plugin-ikachan

Ruby
4
star
43

fluentd-leak-test

Ruby
4
star
44

logstash-output-fluentd

Logstash plugin to forward data to Fluentd
Ruby
4
star
45

fluent-helper-plugin-spec

RSpec helper for Fluentd plugin development
Ruby
4
star
46

Net-Hadoop-DFSAdmin-ReportParser

Perl
4
star
47

fluent-plugin-dummydata-producer

Ruby
3
star
48

itunesconnect-reviews-bookmarklet

bookmarklet to load all reviews from all of countries in iTunes Connect
3
star
49

passenger-monitor

HTTP interface same as mod_status for passenger-status
Ruby
3
star
50

remote_driver

Python
3
star
51

fluentd-tester

Ruby
3
star
52

fluent-plugin-encrypt

Ruby
3
star
53

fluentd-book-samples

Ruby
3
star
54

fluent-plugin-ruby-memory-usage-profiler

Fluentd plugin to output memory profiler information for debugging of ruby/fluentd itself
Ruby
3
star
55

fluentd-v1-checker

Ruby
2
star
56

Net-Hadoop-WebHDFS

Perl
2
star
57

Net-Hadoop-Hive-QueryBuilder

Perl
2
star
58

ruby-memory-usage-profiler

Ruby
2
star
59

fluent-plugin-amplifier-filter

Ruby
2
star
60

scribed_launcher

start-stop script for facebook scribed
Python
2
star
61

isucon3-final-code

Perl
2
star
62

simpleoauth-gae

OAuth 1.0 library for Python on Google App Engine
Python
2
star
63

Net-Hadoop-Hoop

Hoop client library perl module
Perl
2
star
64

MessagePack-RPC-HTTP-Client

Perl
2
star
65

whada

Web Authentication Data Aggregator
Perl
2
star
66

fluent-plugin-reducer

TODO: one-line summary of your gem
Ruby
1
star
67

fluent-plugin-buffered-stdout

Ruby
1
star
68

fluent-plugin-http_file_upload

Ruby
1
star
69

ruby-cli

Ruby
1
star
70

node-scribed

facebook scribe server implementation on node.js (highly experimental)
JavaScript
1
star
71

fluent-plugin-deparser

Ruby
1
star
72

fluent-plugin-delay-inspector

Ruby
1
star
73

isucon7-elimination

Ruby
1
star
74

demo-webapps

Very simple Ruby webapps to verify app servers
Ruby
1
star
75

dyna_mo

Dynamic scope implementation for method overriding
Ruby
1
star
76

fluent-plugin-flatten-filter

Ruby
1
star
77

woothee-sqale

Ruby
1
star
78

logstash-output-treasure_data

Logstash output plugin to store data on Treasure Data service https://www.treasuredata.com/
Ruby
1
star
79

Net-Hadoop-HuahinManager

client library for Huahin Manager
Perl
1
star
80

fluent-plugin-test-counter

TODO: one-line summary of your gem
Ruby
1
star