• Stars
    star
    327
  • Rank 128,686 (Top 3 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created about 6 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Fundamentals of Spark with Python (using PySpark), code examples

Spark with Python

Apache Spark

Apache Spark is one of the hottest new trends in the technology domain. It is the framework with probably the highest potential to realize the fruit of the marriage between Big Data and Machine Learning. It runs fast (up to 100x faster than traditional Hadoop MapReduce due to in-memory operation, offers robust, distributed, fault-tolerant data objects (called RDD), and integrates beautifully with the world of machine learning and graph analytics through supplementary packages like Mlib and GraphX.

Spark is implemented on Hadoop/HDFS and written mostly in Scala, a functional programming language, similar to Java. In fact, Scala needs the latest Java installation on your system and runs on JVM. However, for most of the beginners, Scala is not a language that they learn first to venture into the world of data science. Fortunately, Spark provides a wonderful Python integration, called PySpark, which lets Python programmers to interface with the Spark framework and learn how to manipulate data at scale and work with objects and algorithms over a distributed file system.

Notebooks

RDD and basics

Dataframe

Setting up Apache Spark with Python 3 and Jupyter notebook

Unlike most Python libraries, getting PySpark to start working properly is not as straightforward as pip install ... and import ... Most of us with Python-based data science and Jupyter/IPython background take this workflow as granted for all popular Python packages. We tend to just head over to our CMD or BASH shell, type the pip install command, launch a Jupyter notebook and import the library to start practicing.

But, PySpark+Jupyter combo needs a little bit more love :-)


Check which version of Python is running. Python 3.4+ is needed.

python3 --version

Update apt-get

sudo apt-get update

Install pip3 (or pip for Python3)

sudo apt install python3-pip

Install Jupyter for Python3

pip3 install jupyter

Augment the PATH variable to launch Jupyter notebook

export PATH=$PATH:~/.local/bin

Java 8 is shown to work with UBUNTU 18.04 LTS/SPARK-2.3.1-BIN-HADOOP2.7

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get install oracle-java8-installer
sudo apt-get install oracle-java8-set-default

Set Java related PATH variables

export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export JRE_HOME=/usr/lib/jvm/java-8-oracle/jre

Install Scala

sudo apt-get install scala

Install py4j for Python-Java integration

pip3 install py4j

Download latest Apache Spark (with pre-built Hadoop) from Apache download server. Unpack Apache Spark after downloading

sudo tar -zxvf spark-2.3.1-bin-hadoop2.7.tgz

Set variables to launch PySpark with Python3 and enable it to be called from Jupyter notebook. Add all the following lines to the end of your .bashrc file

export SPARK_HOME='/home/tirtha/Spark/spark-2.3.1-bin-hadoop2.7'
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin

Source .bashrc

source .bashrc

Basics of RDD

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations.

Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through deterministic operations on either data on stable storage or other RDDs. RDD is a fault-tolerant collection of elements that can be operated on in parallel.

There are two ways to create RDDs,

  • parallelizing an existing collection in your driver program,
  • referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format.

Basics of the Dataframe

DataFrame

In Apache Spark, a DataFrame is a distributed collection of rows under named columns. It is conceptually equivalent to a table in a relational database, an Excel sheet with Column headers, or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. It also shares some common characteristics with RDD:

  • Immutable in nature : We can create DataFrame / RDD once but can’t change it. And we can transform a DataFrame / RDD after applying transformations.
  • Lazy Evaluations: Which means that a task is not executed until an action is performed.
  • Distributed: RDD and DataFrame both are distributed in nature.

Advantages of the Dataframe

  • DataFrames are designed for processing large collection of structured or semi-structured data.
  • Observations in Spark DataFrame are organised under named columns, which helps Apache Spark to understand the schema of a DataFrame. This helps Spark optimize execution plan on these queries.
  • DataFrame in Apache Spark has the ability to handle petabytes of data.
  • DataFrame has a support for wide range of data format and sources.
  • It has API support for different languages like Python, R, Scala, Java.

Spark SQL

Spark SQL provides a DataFrame API that can perform relational operations on both external data sources and Spark's built-in distributed collections—at scale!

To support a wide variety of diverse data sources and algorithms in Big Data, Spark SQL introduces a novel extensible optimizer called Catalyst, which makes it easy to add data sources, optimization rules, and data types for advanced analytics such as machine learning. Essentially, Spark SQL leverages the power of Spark to perform distributed, robust, in-memory computations at massive scale on Big Data.

Spark SQL provides state-of-the-art SQL performance and also maintains compatibility with all existing structures and components supported by Apache Hive (a popular Big Data warehouse framework) including data formats, user-defined functions (UDFs), and the metastore. Besides this, it also helps in ingesting a wide variety of data formats from Big Data sources and enterprise data warehouses like JSON, Hive, Parquet, and so on, and performing a combination of relational and procedural operations for more complex, advanced analytics.

Spark-2

Speed of Spark SQL

Spark SQL has been shown to be extremely fast, even comparable to C++ based engines such as Impala.

spark_speed

Following graph shows a nice benchmark result of DataFrames vs. RDDs in different languages, which gives an interesting perspective on how optimized DataFrames can be.

spark-speed-2

Why is Spark SQL so fast and optimized? The reason is because of a new extensible optimizer, Catalyst, based on functional programming constructs in Scala.

Catalyst's extensible design has two purposes.

  • Makes it easy to add new optimization techniques and features to Spark SQL, especially to tackle diverse problems around Big Data, semi-structured data, and advanced analytics
  • Ease of being able to extend the optimizer—for example, by adding data source-specific rules that can push filtering or aggregation into external storage systems or support for new data types

More Repositories

1

Machine-Learning-with-Python

Practice and tutorial-style notebooks covering wide variety of machine learning techniques
Jupyter Notebook
3,065
star
2

Data-science-best-resources

Carefully curated resource links for data science in one place
2,858
star
3

Papers-Literature-ML-DL-RL-AI

Highly cited and useful papers related to machine learning, deep learning, AI, game theory, reinforcement learning
2,320
star
4

Stats-Maths-with-Python

General statistics, mathematical programming, and numerical/scientific computing scripts and notebooks in Python
Jupyter Notebook
852
star
5

Deep-learning-with-Python

Deep learning codes and projects using Python
Jupyter Notebook
346
star
6

pydbgen

Random dataframe and database table generator
Python
299
star
7

Web-Database-Analytics

Web scrapping and related analytics using Python tools
Jupyter Notebook
269
star
8

UCI-ML-API

Simple API for UCI Machine Learning Dataset Repository (search, download, analyze)
Python
246
star
9

Design-of-experiment-Python

Design-of-experiment (DOE) generator for science, engineering, and statistics
Jupyter Notebook
244
star
10

Optimization-Python

General optimization (LP, MIP, QP, continuous and discrete optimization etc.) using Python
Jupyter Notebook
222
star
11

DS-with-PySimpleGUI

Data science and Machine Learning GUI programs/ desktop apps with PySimpleGUI package
Jupyter Notebook
167
star
12

Interactive_Machine_Learning

IPython widgets, interactive plots, interactive machine learning
Jupyter Notebook
150
star
13

doepy

Design of Experiment Generator. Read the docs at: https://doepy.readthedocs.io/en/latest/
Python
143
star
14

PyTorch_Machine_Learning

Machine learning, Deep Learning, CNN with PyTorch
Jupyter Notebook
80
star
15

Synthetic-data-gen

Various methods for generating synthetic data for data science and ML
Jupyter Notebook
75
star
16

Finance-with-Python

Financial data analytics with Python
Jupyter Notebook
72
star
17

Covid-19-analysis

Analysis with Covid-19 data
Jupyter Notebook
60
star
18

Julia-data-science

Data science and numerical computing with Julia
Jupyter Notebook
57
star
19

R-stats-machine-learning

Misc Statistics and Machine Learning codes in R
R
40
star
20

Algorithm-Data-Structures-Python

Various useful data structures in Python
Jupyter Notebook
37
star
21

TensorFlow_Basics

Basic TensorFlow mechanics, operations, class definitions, and neural networks building. Examples from deeplearning.ai Tensorflow course using Google Colab platform.
Jupyter Notebook
35
star
22

Scikit-image-processing

Image processing examples with Numpy, Scipy, and Scikit-image
Jupyter Notebook
34
star
23

Digital-Twin

Digital twin with Python
Jupyter Notebook
33
star
24

mlr

Multiple linear regression with statistical inference, residual analysis, direct CSV loading, and other features
Python
31
star
25

Packt-Data_Wrangling

Code repo for Packt course I developed, "Beginning Data Wrangling with Python"
Jupyter Notebook
28
star
26

ML-apps-with-Streamlit

Building simple ML apps with Streamlit
Python
24
star
27

PyScript-examples

Examples of web pages developed with PyScript framework
23
star
28

tirthajyoti.github.io

Tirthajyoti's Home Page about machine learning, statistics, analytics
HTML
22
star
29

Algorithm_Maths_Python

General math scripts and important algorithms' implementation in Python 3
Jupyter Notebook
21
star
30

Symbolic-computation-Python

Symbolic computation using SymPy and various applications
Jupyter Notebook
20
star
31

RL_basics

Basic Reinforcement Learning algorithms
Jupyter Notebook
17
star
32

GradDescent

MATLAB implementation of Gradient Descent algorithm for Multivariate Linear Regression
MATLAB
16
star
33

Convolutional-Networks

Various conv nets using TensorFlow, Keras, or other tools
Jupyter Notebook
14
star
34

Dask-analytics-ML

Data science and ML with Dask
Jupyter Notebook
13
star
35

Magnimind-Stats-Bootcamp-Jan-2020

Magnimind Bootcamp Stats for Data Science
Jupyter Notebook
12
star
36

PyWebIO

Web apps generated by pure Python script using PyWebIO
Python
11
star
37

Scikit-image-book

Scikit-image-book-built-with-Jupyter-book
Jupyter Notebook
11
star
38

Stats_data_science_ValleyML

Notebooks for the ValleyML Bootcamp (Aug 2019) "Statistical methods for data science"
Jupyter Notebook
10
star
39

Randomized_Optimization

Randomized optimization techniques for NN and other problems
HTML
8
star
40

HyperparameterLearningTF

Learning the impact of Hyperparameters in a deep learning model
Jupyter Notebook
7
star
41

D3.js-examples

Simple D3.js code examples
JavaScript
6
star
42

MNIST_digit_recognition

MNIST hand-written digit recognition by fully-connected and convolutional neural networks - boiler plate code for easy reproduction and tutorial purpose.
Jupyter Notebook
6
star
43

tirthajyoti

5
star
44

DeepNetworksR

Multi-layer neural networks code examples in R
R
5
star
45

Random_Function_Generator

Random function generator, with generation by symbolic input
Jupyter Notebook
4
star
46

Stanford-SCI-52

Jupyter Notebook
4
star
47

Gradio-apps

Python web apps built with Gradio
3
star
48

mldsutils

My own ml and ds utils package
Jupyter Notebook
3
star
49

ghPage-test

test for gh pages
2
star
50

FunnyWordGen

Funny word (random) generator using Python 3
Python
2
star
51

Saturn-cloud

Write-ups for Saturn-cloud
1
star