• Stars
    star
    249
  • Rank 162,987 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created about 3 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Python functions to query SQLite files stored on S3

sqlite-s3-query CircleCI Test Coverage

Python context managers to query a SQLite file stored on S3. It uses multiple HTTP range requests per query to avoid downloading the entire file, and so is suitable for large databases.

All queries using the same instance of the context will query the same version of the database object in S3. This means that a context is roughly equivalent to a REPEATABLE READ transaction, and queries should complete succesfully even if the database is replaced concurrently by another S3 client. Versioning must be enabled on the S3 bucket.

SQL statements that write to the database are not supported. If you're looking for a way to write to a SQLite database in S3, try sqlite-s3vfs.

Inspired by phiresky's sql.js-httpvfs, and dacort's Stack Overflow answer.

Installation

pip install sqlite_s3_query

The libsqlite3 binary library is also required, but this is typically already installed on most systems. The earliest version of libsqlite3 known to work is 2012-12-12 (3.7.15).

Usage

For single-statement queries, the sqlite_s3_query function can be used.

from sqlite_s3_query import sqlite_s3_query

with sqlite_s3_query(url='https://my-bucket.s3.eu-west-2.amazonaws.com/my-db.sqlite') as query:

    with query('SELECT * FROM my_table WHERE my_column = ?', params=('my-value',)) as (columns, rows):
        for row in rows:
            print(row)

    # Exactly the same results, even if the object in S3 was replaced
    with query('SELECT * FROM my_table WHERE my_column = ?', params=('my-value',)) as (columns, rows):
        for row in rows:
            print(row)

    # Or can use named parameters
    with query('SELECT * FROM my_table WHERE my_column = :my_param', named_params=((':my_param', 'my-value'),)) as (columns, rows):
        for row in rows:
            print(row)

For multi-statement queries, the sqlite_s3_query_multi function can be used.

from sqlite_s3_query import sqlite_s3_query_multi

with sqlite_s3_query_multi(url='https://my-bucket.s3.eu-west-2.amazonaws.com/my-db.sqlite') as query_multi:
    for (columns, rows) in query_multi('''
            SELECT * FROM my_table_a WHERE my_column_a = ?;
            SELECT * FROM my_table_b WHERE my_column_b = ?;
    ''', params=(('my-value-a',), ('my-value-b',)):
        for row in rows:
            print(row)

If in your project you query the same object from multiple places, functools.partial can be used to make an interface with less duplication.

from functools import partial
from sqlite_s3_query import sqlite_s3_query

query_my_db = partial(sqlite_s3_query,
    url='https://my-bucket.s3.eu-west-2.amazonaws.com/my-db.sqlite',
)

with \
        query_my_db() as query, \
        query('SELECT * FROM my_table WHERE my_col = ?', params=('my-value',)) as (columns, rows):

    for row in rows:
        print(row)

with \
        query_my_db() as query, \
        query('SELECT * FROM my_table_2 WHERE my_col = ?', params=('my-value',)) as (columns, rows):

    for row in rows:
        print(row)

Pandas DataFrame

You can create a Pandas DataFrame from query results by passing the rows iterable and columns tuple to the DataFrame constructor as below.

import pandas as pd
from sqlite_s3_query import sqlite_s3_query

with \
        sqlite_s3_query(url='https://my-bucket.s3.eu-west-2.amazonaws.com/my-db.sqlite') as query, \
        query('SELECT * FROM my_table WHERE my_column = ?', params=('my-value',)) as (columns, rows):

    df = pd.DataFrame(rows, columns=columns)

print(df)

Permissions

The AWS credentials must have both the s3:GetObject and s3:GetObjectVersion permissions on the database object. For example if the database is at the key my-db.sqlite in bucket my-bucket, then the minimal set of permissions are shown below.

{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Action": ["s3:GetObject", "s3:GetObjectVersion"],
        "Resource": "arn:aws:s3:::my-bucket/my-db.sqlite"
    }]
}

Credentials

The AWS region and the credentials are taken from environment variables, but this can be changed using the get_credentials parameter. Below shows the default implementation of this that can be overriden.

from sqlite_s3_query import sqlite_s3_query
import os

def get_credentials(_):
    return (
        os.environ['AWS_REGION'],
        os.environ['AWS_ACCESS_KEY_ID'],
        os.environ['AWS_SECRET_ACCESS_KEY'],
        os.environ.get('AWS_SESSION_TOKEN'),  # Only needed for temporary credentials
    )

query_my_db = partial(sqlite_s3_query,
    url='https://my-bucket.s3.eu-west-2.amazonaws.com/my-db.sqlite',
    get_credentials=get_credentials,
)

with \
        query_my_db() as query, \
        query('SELECT * FROM my_table_2 WHERE my_col = ?', params=('my-value',)) as (columns, rows):

    for row in rows:
        print(row)

sqlite-s3-query does not install or use boto3, but if you install it separately, you can use it to fetch credentials as in the below example. This can be useful when you want to use temporary credentials associated with an ECS or EC2 role, which boto3 fetches automatically.

import boto3
from sqlite_s3_query import sqlite_s3_query

def GetBoto3Credentials():
    session = boto3.Session()
    credentials = session.get_credentials()
    def get_credentials(_):
        return (session.region_name,) + credentials.get_frozen_credentials()

    return get_credentials

query_my_db = partial(sqlite_s3_query,
    url='https://my-bucket.s3.eu-west-2.amazonaws.com/my-db.sqlite',
    get_credentials=GetBoto3Credentials(),
)

with \
        query_my_db() as query, \
        query('SELECT * FROM my_table_2 WHERE my_col = ?', params=('my-value',)) as (columns, rows):

    for row in rows:
        print(row)

HTTP Client

The HTTP client can be changed by overriding the the default get_http_client parameter, which is shown below.

from functools import partial
import httpx
from sqlite_s3_query import sqlite_s3_query

query_my_db = partial(sqlite_s3_query,
    url='https://my-bucket.s3.eu-west-2.amazonaws.com/my-db.sqlite',
    get_http_client=lambda: httpx.Client(transport=httpx.HTTPTransport(retries=3)),
)

with \
        query_my_db() as query, \
        query('SELECT * FROM my_table WHERE my_col = ?', params=('my-value',)) as (columns, rows):

    for row in rows:
        print(row)

Location of libsqlite3

The location of the libsqlite3 library can be changed by overriding the get_libsqlite3 parameter.

from ctypes import cdll
from ctypes.util import find_library
from functools import partial
from sqlite_s3_query import sqlite_s3_query

query_my_db = partial(sqlite_s3_query,
    url='https://my-bucket.s3.eu-west-2.amazonaws.com/my-db.sqlite',
    get_libsqlite3=lambda: cdll.LoadLibrary(find_library('sqlite3'))
)

with \
        query_my_db() as query, \
        query('SELECT * FROM my_table WHERE my_col = ?', params=('my-value',)) as (columns, rows):

    for row in rows:
        print(row)

Multithreading

It is safe for multiple threads to call the same query function. Under the hood, each use of query uses a separate SQLite "connection" to the database combined with theSQLITE_OPEN_NOMUTEX flag, which makes this safe while not locking unnecessarily.

More Repositories

1

asyncio-buffered-pipeline

Utility function to parallelise pipelines of Python asyncio iterators/generators
Python
114
star
2

aiodnsresolver

Python asyncio DNS resolver
Python
62
star
3

fifolock

A flexible low-level tool to make synchronisation primitives in asyncio Python
Python
53
star
4

sqlite-memory-vfs

Python writable in-memory virtual filesystem for SQLite
Python
16
star
5

OpenTTDLab

A Python framework for running reproducible experiments using OpenTTD
Python
16
star
6

stream-inflate

Uncompress Deflate and Deflate64 streams in pure Python
Python
9
star
7

aiothrottler

Throttler for asyncio Python
Python
8
star
8

PDW-File-Browser

PHP
8
star
9

aiofastforward

Fast-forward time in asyncio Python by providing patched versions of loop.call_later, loop.call_at, loop.time, and asyncio.sleep.
Python
8
star
10

lowhaio

A lightweight Python asyncio HTTP client
Python
7
star
11

threaded-buffered-pipeline

Utility function to parallelise pipelines of Python iterables
Python
5
star
12

treelock

Fast read/write sub-tree locking for asyncio Python
Python
5
star
13

django-postgres-isolation-levels

A set of tests exploring PostgreSQL transactions and Django
Python
5
star
14

s3selectparser

Python parser for the S3 Select SQL language
Python
3
star
15

projections

Rotating the world before applying the Mercator projection in realtime in a browser
JavaScript
3
star
16

blog

My personal blog, powered by gulp, hosted on AWS S3 with Cloudfront as CDN
HTML
2
star
17

python-http-signature-client

Python implementation of the client side of the IETF draft "Signing HTTP Messages"
Python
2
star
18

python-http-signature-server

Python implementation of the server side of the IETF draft "Signing HTTP Messages"
Python
2
star
19

aiomemoize

Memoize ayncio Python function calls, with manual invalidation
Python
1
star
20

FullScreenImage

A MooTools plugin to show images in full screen, using a small Flash applet.
JavaScript
1
star
21

ffmpeg-tools

Couple of hacked-together scripts to join together jpgs frame-by-frame into a video with blackframes in between
Shell
1
star
22

ubuntu-ros-opencv

Docker image with ROS and OpenCV
Dockerfile
1
star
23

lowhaio-chunked

Chunked transfer request encoding for lowhaio
Python
1
star
24

lowhaio-aws-sigv4-unsigned-payload

AWS Signature Version 4 with unsigned payload signing for lowhaio
Python
1
star
25

aiomemoizettl

Memoize asyncio Python calls with a per-result TTL
Python
1
star
26

Fx.Presets

MooTools plugin that allows clear definition and use of groups of related animations on multiple elements.
JavaScript
1
star
27

aiotimeout

Timeout context manager for asyncio Python
Python
1
star
28

aiomemoizeconcurrent

Memoize concurrent asyncio Python function calls
Python
1
star
29

stream-parse-openttd

Python package to parse the contents of OpenTTD save games
Python
1
star