• Stars
    star
    225
  • Rank 177,187 (Top 4 %)
  • Language
    Clojure
  • License
    Apache License 2.0
  • Created over 5 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An Amazon Athena driver for Metabase 0.32 and later

Metabase Athena Driver

Note: As of Metabase v0.45, Metabase now supports Amazon Athena as an official data source! πŸŽ‰
This driver will no longer be actively maintained. Any future issues can be asked about on the Metabase forum or with a detailed bug report.


CircleCI Latest Release Tested with Metabase v0.44.6 GitHub license

Installation

Beginning with Metabase 0.32, drivers must be stored in a plugins directory in the same directory where metabase.jar is, or you can specify the directory by setting the environment variable MB_PLUGINS_DIR. There are a few options to get up and running with a custom driver.

Docker

This repository has an example Dockerfile you can use to build the Amazon Athena Metabase driver and run the most recent supported version of Metabase:

git clone https://github.com/dacort/metabase-athena-driver.git
cd metabase-athena-driver
docker build -t metabase/athena .
docker run --name metabase-athena -p 3000:3000 metabase/athena

Then open http://localhost:3000 and skip down to Configuring

Download Metabase Jar and Run

  1. Download a fairly recent Metabase binary release (jar file) from the Metabase distribution page.
  2. Download the Athena driver jar from this repository's "Releases" page
  3. Create a directory and copy the metabase.jar to it.
  4. In that directory create a sub-directory called plugins.
  5. Copy the Athena driver jar to the plugins directory.
  6. Make sure you are the in the directory where your metabase.jar lives.
  7. Run java -jar metabase.jar.

In either case, you should see a message on startup similar to:

04-15 06:14:08 DEBUG plugins.lazy-loaded-driver :: Registering lazy loading driver :athena...
04-15 06:14:08 INFO driver.impl :: Registered driver :athena (parents: [:sql-jdbc]) 🚚

Configuring

Once you've started up Metabase, go to add a database and select "Amazon Athena".

You'll need to provide the AWS region, an access key and secret key, and an S3 bucket and prefix where query results will be written to.

Please note:

  • The provided bucket must be in the same region you specify.
  • If you do not provide an access key, the default credentials chain will be used.
  • The initial sync can take some time depending on how many databases and tables you have.

If you need an example IAM policy for providing read-only access to your customer-base, check out the Example IAM Policy below.

You can provide additional options if necessary. For example, to disable result set streaming and enable TRACE-level debugging, use UseResultsetStreaming=0;LogLevel=6.

Result set streaming is a performance optimization that streams results from Athena rather than using pagination logic, however it requries outbound access to TCP port 444 and not all organizations allow that.

Other options can be found in the "Driver Configuration Options" section of the Athena JDBC Driver Installation and Configuration Guide.

Contributing

Prerequisites

Build from source

The entire jar can now be built from the included Dockerfile.

  1. Build the project and copy the jar from the export stage
docker build --output jars --target stg_export .

You should now have a athena.metabase-driver.jar file in the jars/ directory.

  1. Download a fairly recent Metabase binary release (jar file) from the Metabase distribution page.

  2. Let's assume we download metabase.jar to ~/metabae/ and we built the project above. Copy the built jar to the Metabase plugins directly and run Metabase from there!

    TARGET_DIR=~/metabae
    mkdir ${TARGET_DIR}/plugins/
    cp jars/athena.metabase-driver.jar ${TARGET_DIR}/plugins/
    cd ${TARGET_DIR}/
    java -jar metabase.jar

You should see a message on startup similar to:

2019-05-07 23:27:32 INFO plugins.lazy-loaded-driver :: Registering lazy loading driver :athena...
2019-05-07 23:27:32 INFO metabase.driver :: Registered driver :athena (parents: #{:sql-jdbc}) 🚚

Testing

There are two different sets of tests in the project.

  1. Unit tests, located in the test_unit/ directory
  2. Integration tests, located in the standard test/ directory

The reason they're split out is because the integration tests require us to link the driver into the core Metabase code and run the full suite of tests there. I wanted to be able to have some lightweight unit tests that could be run without that overhead, so those are split out into the test_unit/ directory.

Running the tests requires you to have the metabase source relevant to the version you're building against. To make this easier, you can also run tests from the Dockerfile.

docker build -t metabase/athena-test --target stg_test .
docker run --rm --name mb-test metabase/athena-test

Resources

Example IAM Policy

This policy provides read-only access. Note you need to specify any buckets you want the user to be able to query from as well as the S3 bucket provided as part of the configuration where results are written to.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Athena",
      "Effect": "Allow",
      "Action": [
        "athena:BatchGetNamedQuery",
        "athena:BatchGetQueryExecution",
        "athena:GetNamedQuery",
        "athena:GetQueryExecution",
        "athena:GetQueryResults",
        "athena:GetQueryResultsStream",
        "athena:GetWorkGroup",
        "athena:ListDatabases",
        "athena:ListDataCatalogs",
        "athena:ListNamedQueries",
        "athena:ListQueryExecutions",
        "athena:ListTagsForResource",
        "athena:ListWorkGroups",
        "athena:ListTableMetadata",
        "athena:StartQueryExecution",
        "athena:StopQueryExecution",
        "athena:CreatePreparedStatement",
        "athena:DeletePreparedStatement",
        "athena:GetPreparedStatement"
      ],
      "Resource": "*"
    },
    {
      "Sid": "Glue",
      "Effect": "Allow",
      "Action": [
        "glue:BatchGetPartition",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetTableVersion",
        "glue:GetTableVersions"
      ],
      "Resource": "*"
    },
    {
      "Sid": "S3ReadAccess",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket", "s3:GetBucketLocation"],
      "Resource": [
        "arn:aws:s3:::bucket1",
        "arn:aws:s3:::bucket1/*",
        "arn:aws:s3:::bucket2",
        "arn:aws:s3:::bucket2/*"
      ]
    },
    {
      "Sid": "AthenaResultsBucket",
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:AbortMultipartUpload",
        "s3:ListBucket",
        "s3:GetBucketLocation"
      ],
      "Resource": ["arn:aws:s3:::bucket2", "arn:aws:s3:::bucket2/*"]
    }
  ]
}

If your customer-base needs access to create tables for whatever reason, they will need additional AWS Glue permissions. Here is an example policy granting that. Note that the Resource is * so this will give Delete/Update permissions to any table.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "glue:BatchCreatePartition",
        "glue:UpdateDatabase",
        "glue:DeleteDatabase",
        "glue:CreateTable",
        "glue:CreateDatabase",
        "glue:UpdateTable",
        "glue:BatchDeletePartition",
        "glue:BatchDeleteTable",
        "glue:DeleteTable",
        "glue:CreatePartition",
        "glue:DeletePartition",
        "glue:UpdatePartition"
      ],
      "Resource": "*"
    }
  ]
}

Known Issues

  • Cannot specify a single database to sync
  • Only native SQL queries are supported
    • Native SQL Queries must not end with a semi-colon (;)
    • Basic aggregations seem to work in the query builder
    • Parameterized queries are not supported
  • Sometimes, the initial database verification can time out
    • If this happens, configure a higher timeout value with the MB_DB_CONNECTION_TIMEOUT_MS environment variable
  • Heavily nested fields can result in a StackOverflowError
    • If this happens, increase the -Xss JVM parameter

Updated Dockerfile

Test image

docker build -t metabase/athena-test --target stg_test .
docker run -it --rm --name mb-test metabase/athena-test

Copy jars

docker build --output jars --target stg_export .

Run Metabase

docker build -t metabase/athena .
docker run --rm --name metabase-athena -p 3000:3000 metabase/athena

If you have an existing Metabase database you'd like to use, you can use the following command.

docker run --rm  -p 3000:3000 \
  -v ~/metabase-data:/metabase-data \
  -e MB_DB_FILE=/metabase-data/metabase.db \
  --name metabase-athena metabase/athena

More Repositories

1

athena-sqlite

A SQLite driver for S3 and Amazon Athena 😳
Python
96
star
2

mwhich

Generic API to search for movies or TV shows across Netflix, Hulu, iTunes, and Amazon Video on Demand
Ruby
73
star
3

faker-cli

Command-line interface to quickly generate fake CSV and JSON data
Python
72
star
4

duckdb-athena-extension

An experimental Athena extension for DuckDB 🐀
Rust
49
star
5

modern-data-lake-storage-layers

Jupyter notebooks and AWS CloudFormation template to show how Hudi, Iceberg, and Delta Lake work
Jupyter Notebook
47
star
6

demo-code

Bits of code I use during live demos
Jupyter Notebook
28
star
7

damons-data-lake

All the code related to building my own data lake
CSS
22
star
8

athena-federation-python-sdk

Unofficial Python SDK for Athena Federation
Python
16
star
9

golang-sse-demo

A brief demo of real-time plotting with Plotly, Go, and server-sent events
Go
15
star
10

ci-cd-serverless-spark

Demo for GitHub Universe 2022
Python
12
star
11

emr-serverless-sql-cli

An experimental tool for running SQL on EMR Serverless
Python
8
star
12

go-meerkat

Meerkat API documentation and Go client
Go
8
star
13

dm-whacker

A bookmarklet to automatically delete Twitter Direct Messages
JavaScript
8
star
14

s3-diff-uploader

Python code to demonstrate differential uploading of files to S3.
Python
6
star
15

zoomit

Launch Zoom meetings in a single click πŸ–±
Go
5
star
16

metabase-trino-driver

Trino Driver for Metabase
Clojure
5
star
17

firejab

A simple Campfire to Jabber bridge
Ruby
5
star
18

s3mpty

A batteries-included tool for deleting the contents of versioned S3 buckets.
Go
5
star
19

redpill

A simple script to get my base OS X system up and running
Ruby
5
star
20

emr-cli-examples

Varied ways of deploying PySpark code to EMR and how the EMR CLI can make it all as easy as a single command.
Python
5
star
21

macdownloads

Repository of Downloads for OS X
4
star
22

notatsxsw

A combination of jealousy and rage resulted in a Google AppEngine proxy that would filter out SxSW tweets.
Python
4
star
23

ideas

Damon's Ideas
3
star
24

is-remote

A journal of my adventures in remote work
3
star
25

tweepml

TweepML is an XML format used to represent a list of Tweeps (Twitter users)
Ruby
3
star
26

ugrep

Hacked up shell script to grep in UTF-16 files
Shell
3
star
27

emr-job-templates

A sample repository of production-ready Spark code for use with Amazon EMR.
Python
2
star
28

syslog-to-athena

Use Fluentd to send syslogs to Athena for great querying
Dockerfile
2
star
29

metabase-datasette-driver

A Datasette driver for Metabase
Clojure
2
star
30

athena-query-stats

Query your Athena query history using Athena πŸ™†β€β™‚οΈ
Python
2
star
31

slugplot

Weather visualization to show change in average temperature over time.
Jupyter Notebook
2
star
32

athena-excel

Python
2
star
33

jupyter-static-website

A way to continuously deploy Jupyter notebooks to a static website backed by S3.
Jupyter Notebook
2
star
34

emr-eks-airflow2-plugin

An experimental Airflow 2.0 plugin for EMR on EKS
Python
2
star
35

spark-local-environment

An example of using EMR Serverless container image for local environment
Dockerfile
2
star
36

log4j-us

Dynamic log4j generator
HTML
1
star
37

ziply-dsl-monitor

My DSL was severely broken...so I graphed it.
HTML
1
star
38

forklift

Forklift your cargo into different places 🚚
Go
1
star
39

sample-code

Various code bits I run into
Java
1
star
40

emr-eks-terraform

Example of deploying EMR on EKS with Terraform
HCL
1
star
41

choirmaster

Go-based poller for dynamic data sources to make them sing with choir.io
Go
1
star
42

airflow-example-dags

Example dags for airflow experimentation
Python
1
star
43

athena-gmail

Athena Gmail connector
Python
1
star
44

spark-tweeter

I know ... you always wanted your Spark jobs to be able to tweet, right?
Go
1
star
45

byteable-calc

A byte-size HTML/JS calculator for making big numbers human-readable
HTML
1
star
46

cargo-crates

An easy way to build data extractors in Docker.
Python
1
star