• Stars
    star
    186
  • Rank 207,316 (Top 5 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 1 year ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

BigQuery DataFrames

BigQuery DataFrames

BigQuery DataFrames provides a Pythonic DataFrame and machine learning (ML) API powered by the BigQuery engine.

  • bigframes.pandas provides a pandas-compatible API for analytics.
  • bigframes.ml provides a scikit-learn-like API for ML.

BigQuery DataFrames is an open-source package. You can run pip install --upgrade bigframes to install the latest version.

Documentation

Quickstart

Prerequisites

Code sample

Import bigframes.pandas for a pandas-like interface. The read_gbq method accepts either a fully-qualified table ID or a SQL query.

import bigframes.pandas as bpd

bpd.options.bigquery.project = your_gcp_project_id
df1 = bpd.read_gbq("project.dataset.table")
df2 = bpd.read_gbq("SELECT a, b, c, FROM `project.dataset.table`")

Locations

BigQuery DataFrames uses a BigQuery session internally to manage metadata on the service side. This session is tied to a location . BigQuery DataFrames uses the US multi-region as the default location, but you can use session_options.location to set a different location. Every query in a session is executed in the location where the session was created. BigQuery DataFrames auto-populates bf.options.bigquery.location if the user starts with read_gbq/read_gbq_table/read_gbq_query() and specifies a table, either directly or in a SQL statement.

If you want to reset the location of the created DataFrame or Series objects, you can close the session by executing bigframes.pandas.close_session(). After that, you can reuse bigframes.pandas.options.bigquery.location to specify another location.

read_gbq() requires you to specify a location if the dataset you are querying is not in the US multi-region. If you try to read a table from another location, you get a NotFound exception.

Project

If bf.options.bigquery.project is not set, the $GOOGLE_CLOUD_PROJECT environment variable is used, which is set in the notebook runtime serving the BigQuery Studio/Vertex Notebooks.

ML Capabilities

The ML capabilities in BigQuery DataFrames let you preprocess data, and then train models on that data. You can also chain these actions together to create data pipelines.

Preprocess data

Create transformers to prepare data for use in estimators (models) by using the bigframes.ml.preprocessing module and the bigframes.ml.compose module. BigQuery DataFrames offers the following transformations:

  • Use the KBinsDiscretizer class in the bigframes.ml.preprocessing module to bin continuous data into intervals.
  • Use the LabelEncoder class in the bigframes.ml.preprocessing module to normalize the target labels as integer values.
  • Use the MaxAbsScaler class in the bigframes.ml.preprocessing module to scale each feature to the range [-1, 1] by its maximum absolute value.
  • Use the MinMaxScaler class in the bigframes.ml.preprocessing module to standardize features by scaling each feature to the range [0, 1].
  • Use the StandardScaler class in the bigframes.ml.preprocessing module to standardize features by removing the mean and scaling to unit variance.
  • Use the OneHotEncoder class in the bigframes.ml.preprocessing module to transform categorical values into numeric format.
  • Use the ColumnTransformer class in the bigframes.ml.compose module to apply transformers to DataFrames columns.

Train models

Create estimators to train models in BigQuery DataFrames.

Clustering models

Create estimators for clustering models by using the bigframes.ml.cluster module.

  • Use the KMeans class to create K-means clustering models. Use these models for data segmentation. For example, identifying customer segments. K-means is an unsupervised learning technique, so model training doesn't require labels or split data for training or evaluation.

Decomposition models

Create estimators for decomposition models by using the bigframes.ml.decomposition module.

  • Use the PCA class to create principal component analysis (PCA) models. Use these models for computing principal components and using them to perform a change of basis on the data. This provides dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data's variation as possible.

Ensemble models

Create estimators for ensemble models by using the bigframes.ml.ensemble module.

  • Use the RandomForestClassifier class to create random forest classifier models. Use these models for constructing multiple learning method decision trees for classification.
  • Use the RandomForestRegressor class to create random forest regression models. Use these models for constructing multiple learning method decision trees for regression.
  • Use the XGBClassifier class to create gradient boosted tree classifier models. Use these models for additively constructing multiple learning method decision trees for classification.
  • Use the XGBRegressor class to create gradient boosted tree regression models. Use these models for additively constructing multiple learning method decision trees for regression.

Forecasting models

Create estimators for forecasting models by using the bigframes.ml.forecasting module.

Imported models

Create estimators for imported models by using the bigframes.ml.imported module.

Linear models

Create estimators for linear models by using the bigframes.ml.linear_model module.

  • Use the LinearRegression class to create linear regression models. Use these models for forecasting. For example, forecasting the sales of an item on a given day.
  • Use the LogisticRegression class to create logistic regression models. Use these models for the classification of two or more possible values such as whether an input is low-value, medium-value, or high-value.

Large language models

Create estimators for LLMs by using the bigframes.ml.llm module.

Create pipelines

Create ML pipelines by using bigframes.ml.pipeline module. Pipelines let you assemble several ML steps to be cross-validated together while setting different parameters. This simplifies your code, and allows you to deploy data preprocessing steps and an estimator together.

  • Use the Pipeline class to create a pipeline of transforms with a final estimator.

ML remote models

Requirements

To use BigQuery DataFrames ML remote models (bigframes.ml.remote or bigframes.ml.llm), you must enable the following APIs:

  • The BigQuery API (bigquery.googleapis.com)
  • The BigQuery Connection API (bigqueryconnection.googleapis.com)
  • The Vertex AI API (aiplatform.googleapis.com)

and you must be granted the following IAM roles:

  • BigQuery Data Editor (roles/bigquery.dataEditor)
  • BigQuery Connection Admin (roles/bigquery.connectionAdmin)
  • Service Account User (roles/iam.serviceAccountUser) on the service account [email protected]
  • Vertex AI User (roles/aiplatform.user)
  • Project IAM Admin (roles/resourcemanager.projectIamAdmin) if using default BigQuery connection, or Browser (roles/browser) if using a pre-created connection

ML locations

bigframes.ml supports the same locations as BigQuery ML. BigQuery ML model prediction and other ML functions are supported in all BigQuery regions. Support for model training varies by region. For more information, see BigQuery ML locations.

Data types

BigQuery DataFrames supports the following numpy and pandas dtypes:

  • numpy.dtype("O")
  • pandas.BooleanDtype()
  • pandas.Float64Dtype()
  • pandas.Int64Dtype()
  • pandas.StringDtype(storage="pyarrow")
  • pandas.ArrowDtype(pa.date32())
  • pandas.ArrowDtype(pa.time64("us"))
  • pandas.ArrowDtype(pa.timestamp("us"))
  • pandas.ArrowDtype(pa.timestamp("us", tz="UTC"))

BigQuery DataFrames doesn’t support the following BigQuery data types:

  • ARRAY
  • NUMERIC
  • BIGNUMERIC
  • INTERVAL
  • STRUCT
  • JSON

All other BigQuery data types display as the object type.

Remote functions

BigQuery DataFrames gives you the ability to turn your custom scalar functions into BigQuery remote functions . Creating a remote function in BigQuery DataFrames (See code samples) creates a BigQuery remote function, a BigQuery connection , and a Cloud Functions (2nd gen) function .

BigQuery connections are created in the same location as the BigQuery DataFrames session, using the name you provide in the custom function definition. To view and manage connections, do the following:

  1. Go to BigQuery in the Google Cloud Console.
  2. Select the project in which you created the remote function.
  3. In the Explorer pane, expand that project and then expand External connections.

BigQuery remote functions are created in the dataset you specify, or in a special type of hidden dataset referred to as an anonymous dataset. To view and manage remote functions created in a user provided dataset, do the following:

  1. Go to BigQuery in the Google Cloud Console.
  2. Select the project in which you created the remote function.
  3. In the Explorer pane, expand that project, expand the dataset in which you created the remote function, and then expand Routines.

To view and manage Cloud Functions functions, use the Functions page and use the project picker to select the project in which you created the function. For easy identification, the names of the functions created by BigQuery DataFrames are prefixed by bigframes.

Requirements

To use BigQuery DataFrames remote functions, you must enable the following APIs:

  • The BigQuery API (bigquery.googleapis.com)
  • The BigQuery Connection API (bigqueryconnection.googleapis.com)
  • The Cloud Functions API (cloudfunctions.googleapis.com)
  • The Cloud Run API (run.googleapis.com)
  • The Artifact Registry API (artifactregistry.googleapis.com)
  • The Cloud Build API (cloudbuild.googleapis.com )
  • The Cloud Resource Manager API (cloudresourcemanager.googleapis.com)

To use BigQuery DataFrames remote functions, you must be granted the following IAM roles:

  • BigQuery Data Editor (roles/bigquery.dataEditor)
  • BigQuery Connection Admin (roles/bigquery.connectionAdmin)
  • Cloud Functions Developer (roles/cloudfunctions.developer)
  • Service Account User (roles/iam.serviceAccountUser) on the service account [email protected]
  • Storage Object Viewer (roles/storage.objectViewer)
  • Project IAM Admin (roles/resourcemanager.projectIamAdmin) if using default BigQuery connection, or Browser (roles/browser) if using a pre-created connection

Limitations

  • Remote functions take about 90 seconds to become available when you first create them.
  • Trivial changes in the notebook, such as inserting a new cell or renaming a variable, might cause the remote function to be re-created, even if these changes are unrelated to the remote function code.
  • BigQuery DataFrames does not differentiate any personal data you include in the remote function code. The remote function code is serialized as an opaque box to deploy it as a Cloud Functions function.
  • The Cloud Functions (2nd gen) functions, BigQuery connections, and BigQuery remote functions created by BigQuery DataFrames persist in Google Cloud. If you don’t want to keep these resources, you must delete them separately using an appropriate Cloud Functions or BigQuery interface.
  • A project can have up to 1000 Cloud Functions (2nd gen) functions at a time. See Cloud Functions quotas for all the limits.

Quotas and limits

BigQuery quotas including hardware, software, and network components.

Session termination

Each BigQuery DataFrames DataFrame or Series object is tied to a BigQuery DataFrames session, which is in turn based on a BigQuery session. BigQuery sessions auto-terminate ; when this happens, you can’t use previously created DataFrame or Series objects and must re-create them using a new BigQuery DataFrames session. You can do this by running bigframes.pandas.close_session() and then re-running the BigQuery DataFrames expressions.

Data processing location

BigQuery DataFrames is designed for scale, which it achieves by keeping data and processing on the BigQuery service. However, you can bring data into the memory of your client machine by calling .to_pandas() on a DataFrame or Series object. If you choose to do this, the memory limitation of your client machine applies.

License

BigQuery DataFrames is distributed with the Apache-2.0 license.

It also contains code derived from the following third-party packages:

For details, see the third_party directory.

Contact Us

For further help and provide feedback, you can email us at [email protected].

More Repositories

1

google-api-nodejs-client

Google's officially supported Node.js client library for accessing Google APIs. Support for authorization and authentication with OAuth 2.0, API Keys and JWT (Service Tokens) is included.
TypeScript
11,377
star
2

google-api-php-client

A PHP client library for accessing Google APIs
PHP
8,706
star
3

google-api-python-client

🐍 The official Python client library for Google's discovery based APIs.
Python
6,858
star
4

googleapis

Public interface definitions of Google APIs.
Starlark
6,797
star
5

release-please

generate release PRs based on the conventionalcommits.org spec
TypeScript
4,740
star
6

google-cloud-python

Google Cloud Client Library for Python
Python
4,324
star
7

google-api-go-client

Auto-generated Google APIs for Go.
Go
3,572
star
8

google-cloud-go

Google Cloud Client Libraries for Go.
Go
3,361
star
9

google-api-ruby-client

REST client for Google APIs
Ruby
2,679
star
10

google-cloud-node

Google Cloud Client Library for Node.js
TypeScript
2,654
star
11

google-cloud-java

Google Cloud Client Library for Java
Java
1,773
star
12

google-auth-library-nodejs

🔑 Google Auth Library for Node.js
TypeScript
1,549
star
13

google-http-java-client

Google HTTP Client Library for Java
Java
1,342
star
14

google-api-dotnet-client

Google APIs Client Library for .NET
C#
1,340
star
15

google-api-java-client

Google APIs Client Library for Java
Java
1,336
star
16

google-auth-library-php

Google Auth Library for PHP
PHP
1,323
star
17

google-cloud-ruby

Google Cloud Client Library for Ruby
Ruby
1,293
star
18

google-api-php-client-services

PHP
1,179
star
19

google-cloud-php

Google Cloud Client Library for PHP
PHP
1,085
star
20

elixir-google-api

Elixir client libraries for accessing Google APIs.
Elixir
1,011
star
21

google-cloud-dotnet

Google Cloud Client Libraries for .NET
C#
929
star
22

nodejs-storage

Node.js client for Google Cloud Storage: unified object storage for developers and enterprises, from live data serving to data analytics/ML to data archiving.
TypeScript
828
star
23

oauth2client

This is a Python library for accessing resources protected by OAuth 2.0.
Python
795
star
24

nodejs-dialogflow

Node.js client for Dialogflow: Design and integrate a conversational user interface into your applications and devices.
JavaScript
793
star
25

google-auth-library-python

Google Auth Python Library
Python
744
star
26

python-bigquery

Python
739
star
27

gaxios

An HTTP request client that provides an axios like interface over top of node-fetch. Super lightweight. Supports proxies and all sorts of other stuff.
TypeScript
692
star
28

nodejs-speech

This repository is deprecated. All of its content and history has been moved to googleapis/google-cloud-node.
684
star
29

python-aiplatform

A Python SDK for Vertex AI, a fully managed, end-to-end platform for data science and machine learning.
Python
626
star
30

repo-automation-bots

A collection of bots, based on probot, for performing common maintenance tasks across the open-source repos managed by Google on GitHub.
TypeScript
613
star
31

nodejs-firestore

Node.js client for Google Cloud Firestore: a NoSQL document database built for automatic scaling, high performance, and ease of application development.
JavaScript
612
star
32

google-oauth-java-client

Google OAuth Client Library for Java
Java
606
star
33

api-linter

A linter for APIs defined in protocol buffers.
Go
575
star
34

go-genproto

Generated code for Google Cloud client libraries.
Go
558
star
35

google-cloud-cpp

C++ Client Libraries for Google Cloud Services
C++
538
star
36

nodejs-pubsub

Node.js client for Google Cloud Pub/Sub: Ingest event streams from anywhere, at any scale, for simple, reliable, real-time stream analytics.
TypeScript
519
star
37

nodejs-translate

Node.js client for Google Cloud Translate: Dynamically translate text between thousands of language pairs.
JavaScript
514
star
38

nodejs-vision

Node.js client for Google Cloud Vision: Derive insight from images.
TypeScript
497
star
39

google-api-java-client-services

Generated Java code for Google APIs
497
star
40

python-bigquery-pandas

Google BigQuery connector for pandas
Python
447
star
41

python-bigquery-sqlalchemy

SQLAlchemy dialect for BigQuery
Python
426
star
42

nodejs-bigquery

Node.js client for Google Cloud BigQuery: A fast, economical and fully-managed enterprise data warehouse for large-scale data analytics.
TypeScript
420
star
43

google-auth-library-ruby

Google Auth Library for Ruby
Ruby
417
star
44

google-auth-library-java

Open source Auth client library for Java
Java
400
star
45

python-dialogflow

This library has moved to https://github.com/googleapis/google-cloud-python/tree/main/packages/google-cloud-dialogflow
397
star
46

python-pubsub

Python
390
star
47

signet

Signet is an OAuth 1.0 / OAuth 2.0 implementation.
Ruby
364
star
48

nodejs-text-to-speech

Node.js client for Google Cloud Text-to-Speech
JavaScript
355
star
49

python-speech

This library has moved to https://github.com/googleapis/google-cloud-python/tree/main/packages/google-cloud-speech
355
star
50

python-storage

Python
339
star
51

google-cloud-php-storage

PHP
322
star
52

google-cloud-php-core

PHP
319
star
53

gapic-generator

Tools for generating API client libraries from API Service Configuration descriptions.
Java
304
star
54

cloud-trace-nodejs

Node.js agent for Cloud Trace: automatically gather latency data about your application
TypeScript
272
star
55

gapic-generator-go

Generate Go API client libraries from Protocol Buffers.
Go
252
star
56

gax-php

Google API Extensions for PHP
PHP
232
star
57

api-common-protos

A standard library for use in specifying protocol buffer APIs.
Starlark
221
star
58

python-firestore

Python
214
star
59

google-cloud-datastore

Low-level, Protobuf-based Java and Python client libraries for Cloud Datastore. Check out google-cloud-java and google-cloud-python first!
Python
213
star
60

nodejs-datastore

Node.js client for Google Cloud Datastore: a highly-scalable NoSQL database for your web and mobile applications.
TypeScript
196
star
61

google-cloud-php-translate

PHP
194
star
62

google-cloud-rust

Rust
183
star
63

gapic-showcase

An API that demonstrates Generated API Client (GAPIC) features and common API patterns used by Google.
Go
174
star
64

github-repo-automation

A set of tools to automate multiple GitHub repository management.
TypeScript
174
star
65

google-cloud-php-firestore

PHP
170
star
66

cloud-debug-nodejs

Node.js agent for Google Cloud Debugger: investigate your code’s behavior in production
TypeScript
169
star
67

java-bigtable-hbase

Java libraries and HBase client extensions for accessing Google Cloud Bigtable
Java
165
star
68

gax-java

This library has moved to https://github.com/googleapis/sdk-platform-java/tree/main/gax-java.
162
star
69

python-vision

This library has moved to https://github.com/googleapis/google-cloud-python/tree/main/packages/google-cloud-vision
160
star
70

google-auth-library-python-oauthlib

Python
160
star
71

nodejs-logging

Node.js client for Stackdriver Logging: Store, search, analyze, monitor, and alert on log data and events from Google Cloud Platform and Amazon Web Services (AWS).
TypeScript
156
star
72

nodejs-tasks

Node.js client for Google Cloud Tasks: A fully managed service that allows you to manage the execution, dispatch and delivery of a large number of distributed tasks.
TypeScript
144
star
73

python-ndb

Python
144
star
74

google-cloudevents

Types for CloudEvents issued by Google
JavaScript
142
star
75

common-protos-php

PHP protocol buffer classes generated from https://github.com/googleapis/api-common-protos
PHP
132
star
76

artman

Artifact Manager, a build and packaging tool for Google API client libraries.
Python
132
star
77

proto-plus-python

Beautiful, idiomatic protocol buffers in Python
Python
132
star
78

googleapis.github.io

The GitHub pages site for the googleapis organization.
HTML
131
star
79

nodejs-language

Node.js client for Google Cloud Natural Language: Derive insights from unstructured text using Google machine learning.
JavaScript
131
star
80

java-pubsub

Java
126
star
81

python-analytics-data

Python
125
star
82

gapic-generator-python

Generate Python API client libraries from Protocol Buffers.
Python
122
star
83

google-auth-library-swift

Auth client library for Swift command-line tools and cloud services. Supports OAuth1, OAuth2, and Google Application Default Credentials.
Swift
122
star
84

python-api-core

Python
118
star
85

nodejs-compute

Node.js client for Google Compute Engine: Scalable, High-Performance Virtual Machines
JavaScript
115
star
86

python-texttospeech

Python
111
star
87

nodejs-spanner

Node.js client for Google Cloud Spanner: the world’s first fully managed relational database service to offer both strong consistency and horizontal scalability.
TypeScript
111
star
88

java-bigquery

Java
109
star
89

node-gtoken

🔑 Google Auth Service Account Tokens for Node.js
TypeScript
108
star
90

python-translate

This library has moved to https://github.com/googleapis/google-cloud-python/tree/main/packages/google-cloud-translate
108
star
91

java-storage

Java
104
star
92

go-sql-spanner

Google Cloud Spanner driver for Go's database/sql package.
Go
104
star
93

google-cloud-php-vision

PHP
103
star
94

gax-nodejs

Google API Extensions for Node.js
TypeScript
100
star
95

java-firestore

Java
100
star
96

nodejs-logging-winston

Node.js client integration between Stackdriver Logging and Winston.
TypeScript
100
star
97

python-logging

Python
99
star
98

nodejs-bigtable

Node.js client for Google Cloud Bigtable: Google's NoSQL Big Data database service.
TypeScript
91
star
99

nodejs-secret-manager

A cloud-hosted service that provides a secure and convenient tool for storing API keys, passwords, certificates, and other sensitive data.
JavaScript
89
star
100

synthtool

Python
87
star