• Stars
    star
    219
  • Rank 181,133 (Top 4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 6 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

VerticaPy is a Python library that exposes sci-kit like functionality to conduct data science projects on data stored in Vertica, thus taking advantage Vertica’s speed and built-in analytics and machine learning capabilities.

2022-12-01: VerticaPy secures 100 stars.

📢 2020-06-27: Vertica-ML-Python has been renamed to VerticaPy.

⚠️ The following README is for VerticaPy 1.0.0-beta and onwards, and so some of the elements may not be present in the previous versions.

VerticaPy

PyPI version Conda Version License Python Version codecov Code style: black linting: pylint

VerticaPy is a Python library with scikit-like functionality used to conduct data science projects on data stored in Vertica, taking advantage of Vertica’s speed and built-in analytics and machine learning features. VerticaPy offers robust support for the entire data science life cycle, uses a 'pipeline' mechanism to sequentialize data transformation operations, and offers beautiful graphical options.

Table of Contents


Introduction

Vertica was the first real analytic columnar database and is still the fastest in the market. However, SQL alone isn't flexible enough to meet the needs of data scientists.

Python has quickly become the most popular tool in this domain, owing much of its flexibility to its high-level of abstraction and impressively large and ever-growing set of libraries. Its accessibility has led to the development of popular and perfomant APIs, like pandas and scikit-learn, and a dedicated community of data scientists. Unfortunately, Python only works in-memory as a single-node process. This problem has led to the rise of distributed programming languages, but they too, are limited as in-memory processes and, as such, will never be able to process all of your data in this era, and moving data for processing is prohobitively expensive. On top of all of this, data scientists must also find convenient ways to deploy their data and models. The whole process is time consuming.

VerticaPy aims to solve all of these problems. The idea is simple: instead of moving data around for processing, VerticaPy brings the logic to the data.

3+ years in the making, we're proud to bring you VerticaPy.

Main Advantages:

  • Easy Data Exploration.
  • Fast Data Preparation.
  • In-Database Machine Learning.
  • Easy Model Evaluation.
  • Easy Model Deployment.
  • Flexibility of using either Python or SQL.

⬆️ Back to TOC

Installation

To install VerticaPy with pip:

# Latest release version
root@ubuntu:~$ pip3 install verticapy[all]

# Latest commit on master branch
root@ubuntu:~$ pip3 install git+https://github.com/vertica/verticapy.git@master

To install VerticaPy from source, run the following command from the root directory:

root@ubuntu:~$ python3 setup.py install

A detailed installation guide is available at:

https://www.vertica.com/python/installation.php

⬆️ Back to TOC

Connecting to the Database

VerticaPy is compatible with several clients. For details, see the connection page.

⬆️ Back to TOC

Documentation

The easiest and most accurate way to find documentation for a particular function is to use the help function:

import verticapy as vp
help(vp.vDataFrame)

Official documentation is available at:

https://www.vertica.com/python/documentation_last/

But note the above is not currently updated as per VerticaPy 1.0.0-beta. It will be done soon.

⬆️ Back to TOC

Use-cases

Examples and case-studies:

https://www.vertica.com/python/examples/

⬆️ Back to TOC

Highlighted Features

SQL Magic

You can use VerticaPy to execute SQL queries directly from a Jupyter notebook. For details, see SQL Magic:

Example

Load the SQL extension.

%load_ext verticapy.sql

Execute your SQL queries.

%%sql
SELECT version();

# Output
# Vertica Analytic Database v11.0.1-0

⬆️ Back to TOC

SQL Plots

You can create interactive, professional plots directly from SQL.

To create plots, simply provide the type of plot along with the SQL command.

Example

%load_ext verticapy.jupyter.extensions.chart_magic
%chart -k pie -c "SELECT pclass, AVG(age) AS av_avg FROM titanic GROUP BY 1;"

⬆️ Back to TOC

Multiple Database Connection using DBLINK

In a single platform, multiple databases (e.g. PostgreSQL, Vertica, MySQL, In-memory) can be accessed using SQL and python.

Example

%%sql
/* Fetch TAIL_NUMBER and CITY after Joining the flight_vertica table with airports table in MySQL database. */
SELECT flight_vertica.TAIL_NUMBER, airports.CITY AS Departing_City
FROM flight_vertica
INNER JOIN &&& airports &&&
ON flight_vertica.ORIGIN_AIRPORT = airports.IATA_CODE;

In the example above, the 'flight_vertica' table is stored in Vertica, whereas the 'airports' table is stored in MySQL. We can associate special symbols "&&&" to the different databases to fetch the data. The best part is that all the aggregation is pushed to the databases (i.e. it is not done in memory)!

For more details on how to setup DBLINK, please visit the github repo. To learn about using DBLINK in VerticaPy, check out the documentation page.

⬆️ Back to TOC

Python and SQL Combo

VerticaPy has a unique place in the market because it allows users to use Python and SQL in the same environment.

Example

import verticapy as vp
selected_titanic = vp.vDataFrame(
    "(SELECT pclass, embarked, AVG(survived) FROM public.titanic GROUP BY 1, 2) x"
)
selected_titanic.groupby(columns=["pclass"], expr=["AVG(AVG)"])

⬆️ Back to TOC

Charts

Verticapy comes integrated with three popular plotting libraries: matplotlib, highcharts, and plotly.

A gallery of VerticaPy-generated charts is available at:

https://www.vertica.com/python/gallery/

⬆️ Back to TOC

Complete Machine Learning Pipeline

  • Data Ingestion

    VerticaPy allows users to ingest data from a diverse range of sources, such as AVRO, Parquet, CSV, JSON etc. With a simple command "read_file", VerticaPy automatically infers the source type and the data type.

    import verticapy as vp
    read_file(
        "/home/laliga/2012.json",
        table_name="laliga",
    )

Note: Not all columns are displayed in the screenshot above because of width restriction here.

As shown above, it has created a nested structure for the complex data. The actual file structure is below:

We can even see the SQL underneath every VerticaPy command by turning on the genSQL option:

  import verticapy as vp
  read_file("/home/laliga/2012.json", table_name="laliga", genSQL=True)
 CREATE LOCAL TEMPORARY TABLE "laliga"
    ("away_score" INT, 
     "away_team" ROW("away_team_gender" VARCHAR, 
                     "away_team_group"  VARCHAR, 
                     "away_team_id"     INT, ... 
                                        ROW("id"   INT, 
                                            "name" VARCHAR)), 
     "competition" ROW("competition_id"   INT, 
                       "competition_name" VARCHAR, 
                       "country_name"     VARCHAR), 
     "competition_stage" ROW("id"   INT, 
                             "name" VARCHAR), 
     "home_score" INT, 
     "home_team" ROW("country" ROW("id"   INT, 
                                   "name" VARCHAR), 
                     "home_team_gender" VARCHAR, 
                     "home_team_group"  VARCHAR, 
                     "home_team_id"     INT, ...), 
     "kick_off"     TIME, 
     "last_updated" DATE, 
     "match_DATE"   DATE, 
     "match_id"     INT, ... 
                    ROW("data_version"          DATE, 
                        "shot_fidelity_version" INT, 
                        "xy_fidelity_version"   INT), 
     "season" ROW("season_id"   INT, 
                  "season_name" VARCHAR)) 
     ON COMMIT PRESERVE ROWS
     COPY "v_temp_schema"."laliga" 
     FROM '/home/laliga/2012.json' 
     PARSER FJsonParser()

VerticaPy provides functions for importing other specific file types, such as read_json and read_csv(#https://www.vertica.com/python/documentation_last/utilities/read_csv/). Since these functions focus on a particular file type, they offer more options for tackling the data. For example, read_json has a "flatten_arrays" parameter that allows you to flatten nested JSON arrays.

  • Data Exploration

    There are many options for descriptive and visual exploration.

from verticapy.datasets import load_iris
iris_data = load_iris()
iris_data.scatter(
    ["SepalWidthCm", "SepalLengthCm", "PetalLengthCm"], 
    by="Species", 
    max_nb_points=30
)

The Correlation Matrix is also very fast and convenient to compute. Users can choose from a wide variety of correaltions, including cramer, spearman, pearson etc.

from verticapy.datasets import load_titanic
titanic = load_titanic()
titanic.corr(method="spearman")

By turning on the SQL print option, users can see and copy SQL queries:

from verticapy import set_option
set_option("sql_on", True)
SELECT
    /*+LABEL('vDataframe._aggregate_matrix')*/ CORR_MATRIX("pclass", "survived", "age", "sibsp", "parch", "fare", "body") OVER ()  
  FROM
(
  SELECT
    RANK() OVER (ORDER BY "pclass") AS "pclass",
    RANK() OVER (ORDER BY "survived") AS "survived",
    RANK() OVER (ORDER BY "age") AS "age",
    RANK() OVER (ORDER BY "sibsp") AS "sibsp",
    RANK() OVER (ORDER BY "parch") AS "parch",
    RANK() OVER (ORDER BY "fare") AS "fare",
    RANK() OVER (ORDER BY "body") AS "body"  
  FROM
"public"."titanic") spearman_table

VerticaPy allows users to calculate a focused correlation using the "focus" parameter:

titanic.corr(method="spearman", focus="survived")

import random
import verticapy as vp
data = vp.vDataFrame({"Heights": [random.randint(10, 60) for _ in range(40)] + [100]})
data.outliers_plot(columns="Heights")

  • Machine Learning

    ML is the strongest suite of VerticaPy as it capitalizes on the speed of in-database training and prediction by using SQL in the background to interact with the database. ML for VerticaPy covers a vast array of tools, including time series forecasting, clustering, and classification.

# titanic_vd is already loaded
# Logistic Regression model is already loaded
stepwise_result = stepwise(
    model,
    input_relation=titanic_vd,
    X=[
        "age",
        "fare",
        "parch",
        "pclass",
    ],
    y="survived",
    direction="backward",
    height=600,
    width=800,
)

⬆️ Back to TOC

Loading Predefined Datasets

VerticaPy provides some predefined datasets that can be easily loaded. These datasets include the iris dataset, titanic dataset, amazon, and more.

There are two ways to access the provided datasets:

(1) Use the standard python method:

from verticapy.datasets import load_iris
iris_data = load_iris()

(2) Use the standard name of the dataset from the public schema:

iris_data = vp.vDataFrame(input_relation = "public.iris")

⬆️ Back to TOC

Quickstart

The following example follows the VerticaPy quickstart guide.

Install the library using with pip.

root@ubuntu:~$ pip3 install verticapy[all]

Create a new Vertica connection:

import verticapy as vp
vp.new_connection({
    "host": "10.211.55.14", 
    "port": "5433", 
    "database": "testdb", 
    "password": "XxX", 
    "user": "dbadmin"},
    name="Vertica_New_Connection")

Use the newly created connection:

vp.connect("Vertica_New_Connection")

Create a VerticaPy schema for native VerticaPy models (that is, models available in VerticaPy, but not Vertica itself):

vp.create_verticapy_schema()

Create a vDataFrame of your relation:

from verticapy import vDataFrame
vdf = vDataFrame("my_relation")

Load a sample dataset:

from verticapy.datasets import load_titanic
vdf = load_titanic()

Examine your data:

vdf.describe()

Print the SQL query with set_option:

set_option("sql_on", True)
vdf.describe()

# Output
## Compute the descriptive statistics of all the numerical columns ##

SELECT 
  SUMMARIZE_NUMCOL("pclass", "survived", "age", "sibsp", "parch", "fare", "body") OVER ()
FROM public.titanic

With VerticaPy, it is now possible to solve a ML problem with few lines of code.

from verticapy.machine_learning.model_selection.model_validation import cross_validate
from verticapy.machine_learning.vertica import RandomForestClassifier

# Data Preparation
vdf["sex"].label_encode()["boat"].fillna(method="0ifnull")["name"].str_extract(
    " ([A-Za-z]+)\."
).eval("family_size", expr="parch + sibsp + 1").drop(
    columns=["cabin", "body", "ticket", "home.dest"]
)[
    "fare"
].fill_outliers().fillna()

# Model Evaluation
cross_validate(
    RandomForestClassifier("rf_titanic", max_leaf_nodes=100, n_estimators=30),
    vdf,
    ["age", "family_size", "sex", "pclass", "fare", "boat"],
    "survived",
    cutoff=0.35,
)

# Features importance
model.fit(vdf, ["age", "family_size", "sex", "pclass", "fare", "boat"], "survived")
model.features_importance()

# ROC Curve
model = RandomForestClassifier(
    name = "public.RF_titanic",
    n_estimators = 20,
    max_features = "auto",
    max_leaf_nodes = 32, 
    sample = 0.7,
    max_depth = 3,
    min_samples_leaf = 5,
    min_info_gain = 0.0,
    nbins = 32
)
model.fit(
    "public.titanic", # input relation
    ["age", "fare", "sex"], # predictors
    "survived" # response
)

# Roc Curve
model.roc_curve()

Enjoy!

⬆️ Back to TOC

Help and Support

Contributing

For a short guide on contribution standards, see CONTRIBUTING.md.

Communication

⬆️ Back to TOC

More Repositories

1

vertica-python

Official native Python client for the Vertica Analytics Database.
Python
379
star
2

DistributedR

R
163
star
3

Vertica-Extension-Packages

User Defined Extensions (UDX) to the Vertica Analytic Database
HTML
119
star
4

ddR

Standard API for Distributed Data Structures in R
R
118
star
5

vertica-sql-go

Official native Go client for the Vertica Analytics Database.
Go
61
star
6

Graph-Analytics----Triangle-Counting

Use Big data tools such as Vertica, Hadoop and PIG to count triangles in a graph. Experimentally compare their performance.
Java
49
star
7

vertica-kubernetes

Operator, container and Helm chart to deploy Vertica in Kubernetes
Go
44
star
8

Vertica-Hadoop-Connector

Vertica Hadoop Connector
Java
38
star
9

awesome-vertica

A curated list of awesome Vertica libraries, tools and resources
29
star
10

dbt-vertica

Official dbt adapter for Vertica
Python
25
star
11

Machine-Learning-Examples

Vertica Machine Learning examples and example data.
Python
24
star
12

vertica.dplyr

R
22
star
13

dblink

A Vertica User Defined Transform function to interact with other databases via ODBC
C++
20
star
14

spark-connector

This component acts as a bridge between Spark and Vertica, allowing the user to either retrieve data from Vertica for processing in Spark, or store processed data from Spark into Vertica.
Scala
20
star
15

VerticaPyLab

Use Docker to install a simple demo Vertica.
Jupyter Notebook
19
star
16

vertica-containers

Docker containers for building and testing Vertica extensions
Shell
14
star
17

UDx-Examples

Vertica User Defined Extension (UDx) examples.
C++
12
star
18

vertica-nodejs

Official native node.js client for the Vertica Analytics Database.
JavaScript
12
star
19

vertica-grafana-datasource-.2.0

This project defines the official Vertica datasource plugin for Grafana.
JavaScript
12
star
20

ODBC-Loader

C++
11
star
21

PSTL

Parallel Streaming Transformation Loader
Java
9
star
22

vertica-prometheus-exporter

Official open source Vertica Prometheus Exporter
Go
8
star
23

Vertica-Geospatial

Vertica-Geospatial Examples
HTML
7
star
24

r-dataconnector

Distributed Data Connector R package
C++
7
star
25

Social-Media-Connector

Java
5
star
26

client-application-examples

Examples of how to write a client application to connect to vertica
C#
4
star
27

Vertica-AMI-Packer-Templates

Packer templates and provision scripts for creating Vertica AMIs
Shell
4
star
28

aws-lambda-vertica-loader

Automatic HP Vertica Database Loader for AWS S3
JavaScript
4
star
29

hackathon

A getting started kit for Hackathons using Vertica!
Python
4
star
30

vertica-grafana-datasource

Official Vertica datasource plugin for Grafana.
TypeScript
3
star
31

grafana-dashboards

Dashboards that can be imported into Grafana that use the Prometheus metrics Vertica exports
Python
2
star
32

vertica.github.io

Vertica Open Source
HTML
2
star
33

FlexTable

HPE Vertica Flex Table source code of VMap APIs and parsers
C++
2
star
34

vertica-sqlalchemy-dialect

Official Vertica SQLAlchemy dialect to connect Vertica and SQLAlchemy applications.
Python
2
star
35

vcluster

Go library to administer a vertica cluster
Go
2
star
36

charts

1
star
37

vertica-airbyte

Python
1
star
38

DistributedR-demos

CSS
1
star
39

data2ddl

command line tool guessing a table's matching CREATE TABLE statement with optimal data types from a CSV file
C
1
star
40

integrators-guide

A community repo for users that integrate with Vertica with provisioning, monitoring, management, storage, clients, loaders, and other tools.
SCSS
1
star
41

vertica-testenv

Docker-based testing environment for Vertica.
Shell
1
star
42

vertica-highcharts

Python
1
star