• Stars
    star
    260
  • Rank 157,189 (Top 4 %)
  • Language
    Python
  • License
    Other
  • Created over 9 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A pure Python implementation of Apache Spark's RDD and DStream interfaces.
https://raw.githubusercontent.com/svenkreiss/pysparkling/master/logo/logo-w100.png

pysparkling

Pysparkling provides a faster, more responsive way to develop programs for PySpark. It enables code intended for Spark applications to execute entirely in Python, without incurring the overhead of initializing and passing data through the JVM and Hadoop. The focus is on having a lightweight and fast implementation for small datasets at the expense of some data resilience features and some parallel processing features.

How does it work? To switch execution of a script from PySpark to pysparkling, have the code initialize a pysparkling Context instead of a SparkContext, and use the pysparkling Context to set up your RDDs. The beauty is you don't have to change a single line of code after the Context initialization, because pysparkling's API is (almost) exactly the same as PySpark's. Since it's so easy to switch between PySpark and pysparkling, you can choose the right tool for your use case.

When would I use it? Say you are writing a Spark application because you need robust computation on huge datasets, but you also want the same application to provide fast answers on a small dataset. You're finding Spark is not responsive enough for your needs, but you don't want to rewrite an entire separate application for the small-answers-fast problem. You'd rather reuse your Spark code but somehow get it to run fast. Pysparkling bypasses the stuff that causes Spark's long startup times and less responsive feel.

Here are a few areas where pysparkling excels:

  • Small to medium-scale exploratory data analysis
  • Application prototyping
  • Low-latency web deployments
  • Unit tests

Install

python3 -m pip install "pysparkling[s3,hdfs,http,streaming]"

Documentation:

https://raw.githubusercontent.com/svenkreiss/pysparkling/master/docs/readthedocs.png

Other links: Github, pypi-badge , test-badge , Documentation Status

Features

  • Supports URI schemes s3://, hdfs://, gs://, http:// and file:// for Amazon S3, HDFS, Google Storage, web and local file access. Specify multiple files separated by comma. Resolves * and ? wildcards.
  • Handles .gz, .zip, .lzma, .xz, .bz2, .tar, .tar.gz and .tar.bz2 compressed files. Supports reading of .7z files.
  • Parallelization via multiprocessing.Pool, concurrent.futures.ThreadPoolExecutor or any other Pool-like objects that have a map(func, iterable) method.
  • Plain pysparkling does not have any dependencies (use pip install pysparkling). Some file access methods have optional dependencies: boto for AWS S3, requests for http, hdfs for hdfs

Examples

Some demos are in the notebooks docs/demo.ipynb and docs/iris.ipynb .

Word Count

from pysparkling import Context

counts = (
    Context()
    .textFile('README.rst')
    .map(lambda line: ''.join(ch if ch.isalnum() else ' ' for ch in line))
    .flatMap(lambda line: line.split(' '))
    .map(lambda word: (word, 1))
    .reduceByKey(lambda a, b: a + b)
)
print(counts.collect())

which prints a long list of pairs of words and their counts.

More Repositories

1

html5validator

Command line tool to validate HTML5 files. Great for continuous integration.
Python
309
star
2

unicodeit

Converts LaTeX tags to unicode: \mathcal{H} → ℋ. Available on the web or as Automator script for the Mac.
Python
248
star
3

socialforce

Differentiable Social Force simulation with universal interaction potentials.
Jupyter Notebook
118
star
4

databench

Data analysis tool.
Python
83
star
5

databench_examples

Example analyses for Databench.
Python
28
star
6

localcrawl

Crawl and render JavaScript templates.
Python
8
star
7

pelican-jsmath

Pass math to JavaScript renderers.
Python
7
star
8

pelican-theme-validator

Automatically create git branches with the output of pelican builds. Connect to TravisCI and show an overview of the status.
Python
4
star
9

databench_spark_test

Demo of using PySpark and Databench together.
Python
4
star
10

decouple

Decouple and recouple.
Python
4
star
11

databench_go

Go language kernel for Databench. Write your data analysis in Go and visualize and interact with it in the browser.
Go
3
star
12

decoupledDemo

Demo of recoupling a decoupled project. Effective likelihoods and template parametrizations are hosted on the web.
Python
2
star
13

dvds-js

Distributed versioned data structures implemented in JavaScript for browsers and node.js.
JavaScript
2
star
14

svenkreiss.github.io

Personal
HTML
2
star
15

PyROOTUtils

Python utilities for ROOT.
Python
2
star
16

dockbroker

Experimental. Clients ask dockbrokers for an "offer" for how much money they can execute a job and then pick the cheapest. Brokers that have data locally available or already cached part of the Docker image are cheaper and therefore preferred. Estimated time to completion will affect the price. Clients can build reputation in brokers when they deliver on time and brokers build reputation in clients when their job description and estimated run-time are good.
Go
2
star
17

LHCHiggsCouplings

Python interface to the numbers published by the LHC Higgs Cross Section Working Group in Yellow Report 3.
Python
1
star
18

databench_examples_viewer

Runs on Heroku.
Python
1
star
19

BatchLikelihoodScan

Creates (profile) likelihood scans of RooFit/RooStats models in any dimension locally or on batch systems.
Python
1
star
20

docker-flask-gevent

Ubuntu base. Need an image that has python-dev and all dependencies (174 of them) already built.
Shell
1
star