• Stars
    star
    151
  • Rank 246,057 (Top 5 %)
  • Language
    Scala
  • License
    Apache License 2.0
  • Created about 7 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Apache Spark on AWS Lambda

Note: "This repo contains Vulnerable Code, as such should not be used for any purpose whatsoever."

Spark on Lambda - README


AWS Lambda is a Function as a Service which is serverless, scales up quickly and bills usage at 100ms granularity. We thought it would be interesting to see if we can get Apache Spark run on Lambda. This is an interesting idea we had, in order to validate we just hacked it into a prototype to see if it works. We were able to make it work making some changes in Spark's scheduler and shuffle areas. Since AWS Lambda has a 5 minute max run time limit, we have to shuffle over an external storage. So we hacked the shuffle parts of Spark code to shuffle over an external storage like S3.

This is a prototype and its not battle tested possibly can have bugs. The changes are made against OS Apache Spark-2.1.0 version. We also have a fork of Spark-2.2.0 which has few bugs will be pushed here soon. We welcome contributions from developers.

For users, who wants to try out:

Bring up an EC2 machine with AWS credentials to invoke lambda function (~/.aws/credentials) in a VPC. Right now we only support credentials file way of loading lambda credentials with AWSLambdaClient. The spark driver will run on this machine. Also configure a security group for this machine.

Spark on Lambda package for driver [s3://public-qubole/lambda/spark-2.1.0-bin-spark-lambda-2.1.0.tgz] - This can be downloaded to an ec2 instance where the driver can be launched as Driver is generally long running needs to run inside an EC2 instance

Create the Lambda function with name spark-lambda from AWS console using the (https://github.com/qubole/spark-on-lambda/bin/lambda/spark-lambda-os.py) and configure lambda function’s VPC and subnet to be same as that of the EC2 machine. Right now we use private IPs to register with Spark driver but this can be fixed to use public IP there by the Spark driver even can run on Mac or PC or any VM. It would also be nice to have a Docker container having the package which works out of the box.

Also configure the security group of the lambda function to be the same as that of the EC2 machine. Note: Lambda role should have access to [s3://public-qubole/] Also if you want to copy the packages to your bucket, use

aws s3 cp s3://s3://public-qubole/lambda/spark-lambda-149.zip s3://YOUR_BUCKET/
aws s3 cp s3://s3://public-qubole/lambda/spark-2.1.0-bin-spark-lambda-2.1.0.tgz s3://YOUR_BUCKET/

Spark on Lambda package for executor to be launched inside lambda [s3://public-qubole/lambda/spark-lambda-149.zip] - This will be used in the lambda (executor) side. In order to use this package on the lambda side, pass spark configs like below:

    1. spark.lambda.s3.bucket s3://public-qubole/
    2. spark.lambda.function.name spark-lambda
    3. spark.lambda.spark.software.version 149

Launch spark-shell

/usr/lib/spark/bin/spark-shell --conf spark.hadoop.fs.s3n.awsAccessKeyId= --conf spark.hadoop.fs.s3n.awsSecretAccessKey=

Spark on Lambda configs (spark-defaults.conf)

spark.shuffle.s3.enabled true
spark.shuffle.s3.bucket s3://  -- Bucket to write shuffle (intermediate) data
spark.lambda.s3.bucket s3://public-qubole/  
spark.lambda.concurrent.requests.max 50
spark.lambda.function.name spark-lambda
spark.lambda.spark.software.version 149
spark.hadoop.fs.s3n.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.AbstractFileSystem.s3.impl org.apache.hadoop.fs.s3a.S3A
spark.hadoop.fs.AbstractFileSystem.s3n.impl org.apache.hadoop.fs.s3a.S3A
spark.hadoop.fs.AbstractFileSystem.s3a.impl org.apache.hadoop.fs.s3a.S3A
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 2

For developers, who wants to make changes:

To compile

./dev/make-distribution.sh --name spark-lambda-2.1.0 --tgz -Phive -Phadoop-2.7 -Dhadoop.version=2.6.0-qds-0.4.13 -DskipTests 

Due to aws-java-sdk-1.7.4.jar which is used by hadoop-aws.jar and aws-java-sdk-core-1.1.0.jar has compatibility issues, so as of now we have to compile it using Qubole shaded hadoop-aws-2.6.0-qds-0.4.13.jar.

To create lambda package for executors

bash -x bin/lambda/spark-lambda 149 (spark.lambda.spark.software.version) spark-2.1.0-bin-spark-lambda-2.1.0.tgz [s3://public-qubole/] (this maps to the config value of spark.lambda.s3.bucket)

(spark/bin/lambda/spark-lambda-os.py) is the helper lambda function used to bootstrap lambda environment with necessary Spark packages to run executors.

Above Lambda function has to be created inside VPC which is same as the EC2 instance where driver is brought up for having communication between Driver and Executors (lambda function)

References

  1. http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/

More Repositories

1

sparklens

Qubole Sparklens tool for performance tuning Apache Spark
Scala
562
star
2

rubix

Cache File System optimized for columnar formats and object stores
Java
182
star
3

kinesis-sql

Kinesis Connector for Structured Streaming
Scala
137
star
4

afctl

afctl helps to manage and deploy Apache Airflow projects faster and smoother.
Python
130
star
5

presto-udfs

Plugin for Presto to allow addition of user functions easily
Java
115
star
6

quark

Quark is a data virtualization engine over analytic databases.
Java
98
star
7

streamx

kafka-connect-s3 : Ingest data from Kafka to Object Stores(s3)
Java
97
star
8

spark-acid

ACID Data Source for Apache Spark based on Hive ACID
Scala
96
star
9

qds-sdk-py

Python SDK for accessing Qubole Data Service
Python
51
star
10

uchit

Python
29
star
11

streaminglens

Qubole Streaminglens tool for tuning Spark Structured Streaming Pipelines
Scala
17
star
12

s3-sqs-connector

A library for reading data from Amzon S3 with optimised listing using Amazon SQS using Spark SQL Streaming ( or Structured streaming).
Scala
17
star
13

spark-state-store

Rocksdb state storage implementation for Structured Streaming.
Scala
16
star
14

presto-kinesis

Presto connector to Amazon Kinesis service.
Java
14
star
15

kinesis-storage-handler

Hive Storage Handler for Kinesis.
Java
11
star
16

qds-sdk-java

A Java library that provides the tools you need to authenticate with, and use the Qubole Data Service API.
Java
7
star
17

demotrends

Code required to setup the demo trends website (http://demotrends.qubole.com)
Ruby
6
star
18

qubole-terraform

HCL
6
star
19

space-ui

UI Ember components based on Space design specs
JavaScript
5
star
20

caching-metastore-client

A metastore client that caches objects
Java
5
star
21

rubix-admin

Admin scripts for Rubix
Python
5
star
22

tco

Python
4
star
23

qds-sdk-R

R extension to execute Hive Commands through Qubole Data Service Python SDK.
Python
4
star
24

docker-images

Qubole Docker Images
Dockerfile
4
star
25

tableau-qubole-connector

JavaScript
3
star
26

metriks-addons

Utilities for collecting metrics in a Rails Application
Ruby
3
star
27

qds-sdk-ruby

Ruby SDK for Qubole API
Ruby
3
star
28

qubole-log-datasets

3
star
29

hubot-qubole

Interaction with Qubole Data Services APIs via Hubot framework
CoffeeScript
3
star
30

customer-success

HCL
2
star
31

bootstrap-functions

Useful functions for Qubole cluster bootstraps
Shell
2
star
32

qubole-jar-test

A maven project to test that qubole jars can be listed as dependencies
Java
2
star
33

etl-examples

Scala
2
star
34

perf-kit-queries

2
star
35

tuning-paper

TeX
2
star
36

blogs

1
star
37

jupyter

1
star
38

presto-event-listeners

1
star
39

qubole-rstudio-example

1
star
40

presto

Presto
Java
1
star
41

qubole.github.io

Qubole OSS Page
1
star
42

quboletsdb

Setup opentsdb using Qubole
Python
1
star