Spark Performance Tests
This is a performance testing framework for Apache Spark 1.0+.
Features
- Suites of performance tests for Spark, PySpark, Spark Streaming, and MLlib.
- Parameterized test configurations:
- Sweeps sets of parameters to test against multiple Spark and test configurations.
- Automatically downloads and builds Spark:
- Maintains a cache of successful builds to enable rapid testing against multiple Spark versions.
- [...]
For questions, bug reports, or feature requests, please open an issue on GitHub.
Coverage
- Spark Core RDD
- list coming soon
- SQL and DataFrames
- coming soon
- Machine Learning
- glm-regression: Generalized Linear Regression Model
- glm-classification: Generalized Linear Classification Model
- naive-bayes: Naive Bayes
- naive-bayes-bernoulli: Bernoulli Naive Bayes
- decision-tree: Decision Tree
- als: Alternating Least Squares
- kmeans: K-Means clustering
- gmm: Gaussian Mixture Model
- svd: Singular Value Decomposition
- pca: Principal Component Analysis
- summary-statistics: Summary Statistics (min, max, ...)
- block-matrix-mult: Matrix Multiplication
- pearson: Pearson's Correlation
- spearman: Spearman's Correlation
- chi-sq-feature/gof/mat: Chi-square Tests
- word2vec: Word2Vec distributed presentation of words
- fp-growth: FP-growth frequent item sets
- python-glm-classification: Generalized Linear Classification Model
- python-glm-regression: Generalized Linear Regression Model
- python-naive-bayes: Naive Bayes
- python-als: Alternating Least Squares
- python-kmeans: K-Means clustering
- python-pearson: Pearson's Correlation
- python-spearman: Spearman's Correlation
Dependencies
The spark-perf
scripts require Python 2.7+. If you're using an earlier version of Python, you may need to install the argparse
library using easy_install argparse
.
Support for automatically building Spark requires Maven. On spark-ec2
clusters, this can be installed using the ./bin/spark-ec2/install-maven
script from this project.
Configuration
To configure spark-perf
, copy config/config.py.template
to config/config.py
and edit that file. See config.py.template
for detailed configuration instructions. After editing config.py
, execute ./bin/run
to run performance tests. You can pass the --config
option to use a custom configuration file.
The following sections describe some additional settings to change for certain test environments:
Running locally
-
Set up local SSH server/keys such that
ssh localhost
works on your machine without a password. -
Set config.py options that are friendly for local execution:
SPARK_HOME_DIR = /path/to/your/spark SPARK_CLUSTER_URL = "spark://%s:7077" % socket.gethostname() SCALE_FACTOR = .05 SPARK_DRIVER_MEMORY = 512m spark.executor.memory = 2g
-
Uncomment at least one
SPARK_TESTS
entry.
Running on an existing Spark cluster
-
SSH into the machine hosting the standalone master
-
Set config.py options:
SPARK_HOME_DIR = /path/to/your/spark/install SPARK_CLUSTER_URL = "spark://<your-master-hostname>:7077" SCALE_FACTOR = <depends on your hardware> SPARK_DRIVER_MEMORY = <depends on your hardware> spark.executor.memory = <depends on your hardware>
-
Uncomment at least one
SPARK_TESTS
entry.
Running on a spark-ec2 cluster with a custom Spark version
-
Launch an EC2 cluster with Spark's EC2 scripts.
-
Set config.py options:
USE_CLUSTER_SPARK = False SPARK_COMMIT_ID = <what you want test> SCALE_FACTOR = <depends on your hardware> SPARK_DRIVER_MEMORY = <depends on your hardware> spark.executor.memory = <depends on your hardware>
-
Uncomment at least one
SPARK_TESTS
entry.
License
This project is licensed under the Apache 2.0 License. See LICENSE for full license text.