• Stars
    star
    426
  • Rank 101,884 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created over 7 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

SQLAlchemy dialect for BigQuery

SQLAlchemy Dialect for BigQuery

GA pypi versions

SQLALchemy Dialects

Quick Start

In order to use this library, you first need to go through the following steps:

  1. Select or create a Cloud Platform project.
  2. [Optional] Enable billing for your project.
  3. Enable the BigQuery Storage API.
  4. Setup Authentication.

Note

This library is only compatible with SQLAlchemy versions < 2.0.0

Installation

Install this library in a virtualenv using pip. virtualenv is a tool to create isolated Python environments. The basic problem it addresses is one of dependencies and versions, and indirectly permissions.

With virtualenv, it's possible to install this library without needing system install permissions, and without clashing with the installed system dependencies.

Supported Python Versions

Python >= 3.7

Unsupported Python Versions

Python <= 3.6.

Mac/Linux

pip install virtualenv
virtualenv <your-env>
source <your-env>/bin/activate
<your-env>/bin/pip install sqlalchemy-bigquery

Windows

pip install virtualenv
virtualenv <your-env>
<your-env>\Scripts\activate
<your-env>\Scripts\pip.exe install sqlalchemy-bigquery

Installations when processing large datasets

When handling large datasets, you may see speed increases by also installing the bqstorage dependencies. See the instructions above about creating a virtual environment and then install sqlalchemy-bigquery using the bqstorage extras:

source <your-env>/bin/activate
<your-env>/bin/pip install sqlalchemy-bigquery[bqstorage]

Usage

SQLAlchemy

from sqlalchemy import *
from sqlalchemy.engine import create_engine
from sqlalchemy.schema import *
engine = create_engine('bigquery://project')
table = Table('dataset.table', MetaData(bind=engine), autoload=True)
print(select([func.count('*')], from_obj=table).scalar())

Project

project in bigquery://project is used to instantiate BigQuery client with the specific project ID. To infer project from the environment, use bigquery:// – without project

Authentication

Follow the Google Cloud library guide for authentication.

Alternatively, you can choose either of the following approaches:

  • provide the path to a service account JSON file in create_engine() using the credentials_path parameter:
# provide the path to a service account JSON file
engine = create_engine('bigquery://', credentials_path='/path/to/keyfile.json')
  • pass the credentials in create_engine() as a Python dictionary using the credentials_info parameter:
# provide credentials as a Python dictionary
credentials_info = {
    "type": "service_account",
    "project_id": "your-service-account-project-id"
},
engine = create_engine('bigquery://', credentials_info=credentials_info)

Location

To specify location of your datasets pass location to create_engine():

engine = create_engine('bigquery://project', location="asia-northeast1")

Table names

To query tables from non-default projects or datasets, use the following format for the SQLAlchemy schema name: [project.]dataset, e.g.:

# If neither dataset nor project are the default
sample_table_1 = Table('natality', schema='bigquery-public-data.samples')
# If just dataset is not the default
sample_table_2 = Table('natality', schema='bigquery-public-data')

Batch size

By default, arraysize is set to 5000. arraysize is used to set the batch size for fetching results. To change it, pass arraysize to create_engine():

engine = create_engine('bigquery://project', arraysize=1000)

Page size for dataset.list_tables

By default, list_tables_page_size is set to 1000. list_tables_page_size is used to set the max_results for dataset.list_tables operation. To change it, pass list_tables_page_size to create_engine():

engine = create_engine('bigquery://project', list_tables_page_size=100)

Adding a Default Dataset

If you want to have the Client use a default dataset, specify it as the "database" portion of the connection string.

engine = create_engine('bigquery://project/dataset')

When using a default dataset, don't include the dataset name in the table name, e.g.:

table = Table('table_name')

Note that specifying a default dataset doesn't restrict execution of queries to that particular dataset when using raw queries, e.g.:

# Set default dataset to dataset_a
engine = create_engine('bigquery://project/dataset_a')

# This will still execute and return rows from dataset_b
engine.execute('SELECT * FROM dataset_b.table').fetchall()

Connection String Parameters

There are many situations where you can't call create_engine directly, such as when using tools like Flask SQLAlchemy. For situations like these, or for situations where you want the Client to have a default_query_job_config, you can pass many arguments in the query of the connection string.

The credentials_path, credentials_info, credentials_base64, location, arraysize and list_tables_page_size parameters are used by this library, and the rest are used to create a QueryJobConfig

Note that if you want to use query strings, it will be more reliable if you use three slashes, so 'bigquery:///?a=b' will work reliably, but 'bigquery://?a=b' might be interpreted as having a "database" of ?a=b, depending on the system being used to parse the connection string.

Here are examples of all the supported arguments. Any not present are either for legacy sql (which isn't supported by this library), or are too complex and are not implemented.

engine = create_engine(
    'bigquery://some-project/some-dataset' '?'
    'credentials_path=/some/path/to.json' '&'
    'location=some-location' '&'
    'arraysize=1000' '&'
    'list_tables_page_size=100' '&'
    'clustering_fields=a,b,c' '&'
    'create_disposition=CREATE_IF_NEEDED' '&'
    'destination=different-project.different-dataset.table' '&'
    'destination_encryption_configuration=some-configuration' '&'
    'dry_run=true' '&'
    'labels=a:b,c:d' '&'
    'maximum_bytes_billed=1000' '&'
    'priority=INTERACTIVE' '&'
    'schema_update_options=ALLOW_FIELD_ADDITION,ALLOW_FIELD_RELAXATION' '&'
    'use_query_cache=true' '&'
    'write_disposition=WRITE_APPEND'
)

In cases where you wish to include the full credentials in the connection URI you can base64 the credentials JSON file and supply the encoded string to the credentials_base64 parameter.

engine = create_engine(
    'bigquery://some-project/some-dataset' '?'
    'credentials_base64=eyJrZXkiOiJ2YWx1ZSJ9Cg==' '&'
    'location=some-location' '&'
    'arraysize=1000' '&'
    'list_tables_page_size=100' '&'
    'clustering_fields=a,b,c' '&'
    'create_disposition=CREATE_IF_NEEDED' '&'
    'destination=different-project.different-dataset.table' '&'
    'destination_encryption_configuration=some-configuration' '&'
    'dry_run=true' '&'
    'labels=a:b,c:d' '&'
    'maximum_bytes_billed=1000' '&'
    'priority=INTERACTIVE' '&'
    'schema_update_options=ALLOW_FIELD_ADDITION,ALLOW_FIELD_RELAXATION' '&'
    'use_query_cache=true' '&'
    'write_disposition=WRITE_APPEND'
)

To create the base64 encoded string you can use the command line tool base64, or openssl base64, or python -m base64.

Alternatively, you can use an online generator like www.base64encode.org <https://www.base64encode.org>_ to paste your credentials JSON file to be encoded.

Supplying Your Own BigQuery Client

The above connection string parameters allow you to influence how the BigQuery client used to execute your queries will be instantiated. If you need additional control, you can supply a BigQuery client of your own:

from google.cloud import bigquery

custom_bq_client = bigquery.Client(...)

engine = create_engine(
    'bigquery://some-project/some-dataset?user_supplied_client=True',
    connect_args={'client': custom_bq_client},
)

Creating tables

To add metadata to a table:

table = Table('mytable', ..., bigquery_description='my table description', bigquery_friendly_name='my table friendly name')

To add metadata to a column:

Column('mycolumn', doc='my column description')

Threading and Multiprocessing

Because this client uses the grpc library, it's safe to share instances across threads.

In multiprocessing scenarios, the best practice is to create client instances after the invocation of os.fork by multiprocessing.pool.Pool or multiprocessing.Process.

More Repositories

1

google-api-nodejs-client

Google's officially supported Node.js client library for accessing Google APIs. Support for authorization and authentication with OAuth 2.0, API Keys and JWT (Service Tokens) is included.
TypeScript
11,377
star
2

google-api-php-client

A PHP client library for accessing Google APIs
PHP
8,706
star
3

google-api-python-client

🐍 The official Python client library for Google's discovery based APIs.
Python
6,858
star
4

googleapis

Public interface definitions of Google APIs.
Starlark
6,797
star
5

release-please

generate release PRs based on the conventionalcommits.org spec
TypeScript
4,740
star
6

google-cloud-python

Google Cloud Client Library for Python
Python
4,324
star
7

google-api-go-client

Auto-generated Google APIs for Go.
Go
3,572
star
8

google-cloud-go

Google Cloud Client Libraries for Go.
Go
3,361
star
9

google-api-ruby-client

REST client for Google APIs
Ruby
2,679
star
10

google-cloud-node

Google Cloud Client Library for Node.js
TypeScript
2,654
star
11

google-cloud-java

Google Cloud Client Library for Java
Java
1,773
star
12

google-auth-library-nodejs

πŸ”‘ Google Auth Library for Node.js
TypeScript
1,549
star
13

google-http-java-client

Google HTTP Client Library for Java
Java
1,342
star
14

google-api-dotnet-client

Google APIs Client Library for .NET
C#
1,340
star
15

google-api-java-client

Google APIs Client Library for Java
Java
1,336
star
16

google-auth-library-php

Google Auth Library for PHP
PHP
1,323
star
17

google-cloud-ruby

Google Cloud Client Library for Ruby
Ruby
1,293
star
18

google-api-php-client-services

PHP
1,179
star
19

google-cloud-php

Google Cloud Client Library for PHP
PHP
1,085
star
20

elixir-google-api

Elixir client libraries for accessing Google APIs.
Elixir
1,011
star
21

google-cloud-dotnet

Google Cloud Client Libraries for .NET
C#
929
star
22

nodejs-storage

Node.js client for Google Cloud Storage: unified object storage for developers and enterprises, from live data serving to data analytics/ML to data archiving.
TypeScript
828
star
23

oauth2client

This is a Python library for accessing resources protected by OAuth 2.0.
Python
795
star
24

nodejs-dialogflow

Node.js client for Dialogflow: Design and integrate a conversational user interface into your applications and devices.
JavaScript
793
star
25

google-auth-library-python

Google Auth Python Library
Python
744
star
26

python-bigquery

Python
739
star
27

gaxios

An HTTP request client that provides an axios like interface over top of node-fetch. Super lightweight. Supports proxies and all sorts of other stuff.
TypeScript
692
star
28

nodejs-speech

This repository is deprecated. All of its content and history has been moved to googleapis/google-cloud-node.
684
star
29

python-aiplatform

A Python SDK for Vertex AI, a fully managed, end-to-end platform for data science and machine learning.
Python
626
star
30

repo-automation-bots

A collection of bots, based on probot, for performing common maintenance tasks across the open-source repos managed by Google on GitHub.
TypeScript
613
star
31

nodejs-firestore

Node.js client for Google Cloud Firestore: a NoSQL document database built for automatic scaling, high performance, and ease of application development.
JavaScript
612
star
32

google-oauth-java-client

Google OAuth Client Library for Java
Java
606
star
33

api-linter

A linter for APIs defined in protocol buffers.
Go
575
star
34

go-genproto

Generated code for Google Cloud client libraries.
Go
558
star
35

google-cloud-cpp

C++ Client Libraries for Google Cloud Services
C++
538
star
36

nodejs-pubsub

Node.js client for Google Cloud Pub/Sub: Ingest event streams from anywhere, at any scale, for simple, reliable, real-time stream analytics.
TypeScript
519
star
37

nodejs-translate

Node.js client for Google Cloud Translate: Dynamically translate text between thousands of language pairs.
JavaScript
514
star
38

nodejs-vision

Node.js client for Google Cloud Vision: Derive insight from images.
TypeScript
497
star
39

google-api-java-client-services

Generated Java code for Google APIs
497
star
40

python-bigquery-pandas

Google BigQuery connector for pandas
Python
447
star
41

nodejs-bigquery

Node.js client for Google Cloud BigQuery: A fast, economical and fully-managed enterprise data warehouse for large-scale data analytics.
TypeScript
420
star
42

google-auth-library-ruby

Google Auth Library for Ruby
Ruby
417
star
43

google-auth-library-java

Open source Auth client library for Java
Java
400
star
44

python-dialogflow

This library has moved to https://github.com/googleapis/google-cloud-python/tree/main/packages/google-cloud-dialogflow
397
star
45

python-pubsub

Python
390
star
46

signet

Signet is an OAuth 1.0 / OAuth 2.0 implementation.
Ruby
364
star
47

nodejs-text-to-speech

Node.js client for Google Cloud Text-to-Speech
JavaScript
355
star
48

python-speech

This library has moved to https://github.com/googleapis/google-cloud-python/tree/main/packages/google-cloud-speech
355
star
49

python-storage

Python
339
star
50

google-cloud-php-storage

PHP
322
star
51

google-cloud-php-core

PHP
319
star
52

gapic-generator

Tools for generating API client libraries from API Service Configuration descriptions.
Java
304
star
53

cloud-trace-nodejs

Node.js agent for Cloud Trace: automatically gather latency data about your application
TypeScript
272
star
54

gapic-generator-go

Generate Go API client libraries from Protocol Buffers.
Go
252
star
55

gax-php

Google API Extensions for PHP
PHP
232
star
56

api-common-protos

A standard library for use in specifying protocol buffer APIs.
Starlark
221
star
57

python-firestore

Python
214
star
58

google-cloud-datastore

Low-level, Protobuf-based Java and Python client libraries for Cloud Datastore. Check out google-cloud-java and google-cloud-python first!
Python
213
star
59

nodejs-datastore

Node.js client for Google Cloud Datastore: a highly-scalable NoSQL database for your web and mobile applications.
TypeScript
196
star
60

google-cloud-php-translate

PHP
194
star
61

python-bigquery-dataframes

BigQuery DataFrames
Python
186
star
62

google-cloud-rust

Rust
183
star
63

gapic-showcase

An API that demonstrates Generated API Client (GAPIC) features and common API patterns used by Google.
Go
174
star
64

github-repo-automation

A set of tools to automate multiple GitHub repository management.
TypeScript
174
star
65

google-cloud-php-firestore

PHP
170
star
66

cloud-debug-nodejs

Node.js agent for Google Cloud Debugger: investigate your code’s behavior in production
TypeScript
169
star
67

java-bigtable-hbase

Java libraries and HBase client extensions for accessing Google Cloud Bigtable
Java
165
star
68

gax-java

This library has moved to https://github.com/googleapis/sdk-platform-java/tree/main/gax-java.
162
star
69

python-vision

This library has moved to https://github.com/googleapis/google-cloud-python/tree/main/packages/google-cloud-vision
160
star
70

google-auth-library-python-oauthlib

Python
160
star
71

nodejs-logging

Node.js client for Stackdriver Logging: Store, search, analyze, monitor, and alert on log data and events from Google Cloud Platform and Amazon Web Services (AWS).
TypeScript
156
star
72

nodejs-tasks

Node.js client for Google Cloud Tasks: A fully managed service that allows you to manage the execution, dispatch and delivery of a large number of distributed tasks.
TypeScript
144
star
73

python-ndb

Python
144
star
74

google-cloudevents

Types for CloudEvents issued by Google
JavaScript
142
star
75

common-protos-php

PHP protocol buffer classes generated from https://github.com/googleapis/api-common-protos
PHP
132
star
76

artman

Artifact Manager, a build and packaging tool for Google API client libraries.
Python
132
star
77

proto-plus-python

Beautiful, idiomatic protocol buffers in Python
Python
132
star
78

googleapis.github.io

The GitHub pages site for the googleapis organization.
HTML
131
star
79

nodejs-language

Node.js client for Google Cloud Natural Language: Derive insights from unstructured text using Google machine learning.
JavaScript
131
star
80

java-pubsub

Java
126
star
81

python-analytics-data

Python
125
star
82

gapic-generator-python

Generate Python API client libraries from Protocol Buffers.
Python
122
star
83

google-auth-library-swift

Auth client library for Swift command-line tools and cloud services. Supports OAuth1, OAuth2, and Google Application Default Credentials.
Swift
122
star
84

python-api-core

Python
118
star
85

nodejs-compute

Node.js client for Google Compute Engine: Scalable, High-Performance Virtual Machines
JavaScript
115
star
86

python-texttospeech

Python
111
star
87

nodejs-spanner

Node.js client for Google Cloud Spanner: the world’s first fully managed relational database service to offer both strong consistency and horizontal scalability.
TypeScript
111
star
88

java-bigquery

Java
109
star
89

node-gtoken

πŸ”‘ Google Auth Service Account Tokens for Node.js
TypeScript
108
star
90

python-translate

This library has moved to https://github.com/googleapis/google-cloud-python/tree/main/packages/google-cloud-translate
108
star
91

java-storage

Java
104
star
92

go-sql-spanner

Google Cloud Spanner driver for Go's database/sql package.
Go
104
star
93

google-cloud-php-vision

PHP
103
star
94

gax-nodejs

Google API Extensions for Node.js
TypeScript
100
star
95

java-firestore

Java
100
star
96

nodejs-logging-winston

Node.js client integration between Stackdriver Logging and Winston.
TypeScript
100
star
97

python-logging

Python
99
star
98

nodejs-bigtable

Node.js client for Google Cloud Bigtable: Google's NoSQL Big Data database service.
TypeScript
91
star
99

nodejs-secret-manager

A cloud-hosted service that provides a secure and convenient tool for storing API keys, passwords, certificates, and other sensitive data.
JavaScript
89
star
100

synthtool

Python
87
star