• This repository has been archived on 08/May/2024
  • Stars
    star
    2,691
  • Rank 16,967 (Top 0.4 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created over 12 years ago
  • Updated almost 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Python clone of Spark, a MapReduce alike framework in Python

DPark

pypi status ci status Join the chat at https://gitter.im/douban/dpark

DPark is a Python clone of Spark, MapReduce(R) alike computing framework supporting iterative computation.

Installation

## Due to the use of C extensions, some libraries need to be installed first.

$ sudo apt-get install libtool pkg-config build-essential autoconf automake
$ sudo apt-get install python-dev
$ sudo apt-get install libzmq-dev

## Then just pip install dpark (``sudo`` maybe needed if you encounter permission problem).

$ pip install dpark

Example

for word counting (wc.py):

from dpark import DparkContext
ctx = DparkContext()
file = ctx.textFile("/tmp/words.txt")
words = file.flatMap(lambda x:x.split()).map(lambda x:(x,1))
wc = words.reduceByKey(lambda x,y:x+y).collectAsMap()
print wc

This script can run locally or on a Mesos cluster without any modification, just using different command-line arguments:

$ python wc.py
$ python wc.py -m process
$ python wc.py -m host[:port]

See examples/ for more use cases.

Configuration

DPark can run with Mesos 0.9 or higher.

If a $MESOS_MASTER environment variable is set, you can use a shortcut and run DPark with Mesos just by typing

$ python wc.py -m mesos

$MESOS_MASTER can be any scheme of Mesos master, such as

$ export MESOS_MASTER=zk://zk1:2181,zk2:2181,zk3:2181/mesos_master

In order to speed up shuffling, you should deploy Nginx at port 5055 for accessing data in DPARK_WORK_DIR (default is /tmp/dpark), such as:

server {
        listen 5055;
        server_name localhost;
        root /tmp/dpark/;
}

UI

2 DAGs:

  1. stage graph: stage is a running unit, contain a set of task, each run same ops for a split of rdd.
  2. use api callsite graph

UI when running

Just open the url from log like start listening on Web UI http://server_01:40812 .

UI after running

  1. before run, config LOGHUB & LOGHUB_PATH_FORMAT in dpark.conf, pre-create LOGHUB_DIR.
  2. get log hubdir from log like logging/prof to LOGHUB_DIR/2018/09/27/16/b2e3349b-9858-4153-b491-80699c757485-8754, which in clude mesos framework id.
  3. run dpark_web.py -p 9999 -l LOGHUB_DIR/2018/09/27/16/b2e3349b-9858-4153-b491-80699c757485-8728/, dpark_web.py is in tools/

UI examples for features

show sharing shuffle map output

rdd = DparkContext().makeRDD([(1,1)]).map(m).groupByKey()
rdd.map(m).collect()
rdd.map(m).collect()

images/share_mapoutput.png

combine nodes iff with same lineage, form a logic tree inside stage, then each node contain a PIPELINE of rdds.

rdd1 = get_rdd()
rdd2 = dc.union([get_rdd() for i in range(2)])
rdd3 = get_rdd().groupByKey()
dc.union([rdd1, rdd2, rdd3]).collect()

images/unions.png

More docs (in Chinese)

https://dpark.readthedocs.io/zh_CN/latest/

https://github.com/jackfengji/test_pro/wiki

Mailing list: [email protected] (http://groups.google.com/group/dpark-users)

More Repositories

1

DOUAudioStreamer

A Core Audio based streaming audio player for iOS and macOS
Objective-C
2,768
star
2

code

[DEPRECATED]Douban CODE
CSS
1,811
star
3

beansdb

Archived, see GoBeansDB instead.
C
870
star
4

douban-client

Python client library for Douban APIs (OAuth 2.0)
Python
744
star
5

rexxar-android

Mobile Hybrid Framework Rexxar Android Container
Java
667
star
6

rexxar-ios

Mobile Hybrid Framework Rexxar iOS Container
Objective-C
578
star
7

FRDIntent

A framework for handle the call between view controllers in iOS
Swift
492
star
8

gobeansdb

Distributed object storage server from Douban Inc.
Go
451
star
9

libmc

Fast and light-weight memcached client for C++ / #python / #golang #libmc
C++
442
star
10

greenify

Make blocking C library work with gevent
C
427
star
11

ynm3k

UI Automation + YUItest driven acceptance tests that can be hooked into Jenkins
JavaScript
410
star
12

paracel

Distributed training framework with parameter server
C++
337
star
13

douban-objc-client

Objective-C client library for Douban APIs (OAuth 2.0)
Objective-C
254
star
14

beanseye

Proxy and monitor for beansdb in Go
Go
233
star
15

rexxar-web

Mobile Hybrid Framework Rexxar Web SDK
JavaScript
206
star
16

Kenshin

Kenshin: A time-series database alternative to Graphite Whisper with 40x improvement in IOPS
Python
206
star
17

tfmesos

Tensorflow in Docker on Mesos #tfmesos #tensorflow #mesos
Python
191
star
18

pymesos

A pure python implementation of Mesos scheduler and executor
Python
163
star
19

brownant

Brownant is a web data extracting framework.
Python
159
star
20

linguist

Language Savant, Python clone of github/linguist.
Python
153
star
21

graph-index

index of Graphite & Diamond
Python
129
star
22

CaoE

Kill all children processes when the parent dies
Python
101
star
23

douban-quixote

Douban's Quixote
Python
82
star
24

douban-utils

Douban's Utils
Python
59
star
25

python-libmemcached

DEPRECATED, use https://github.com/douban/libmc instead. python-libmemcached is a python extention for libmemcached
Python
57
star
26

PyCharlockHolmes

Character encoding detecting library for Python using ICU and libmagic.
Common Lisp
50
star
27

DOUSNSSharing

SNS OAuth 2 binding and sharing
Objective-C
47
star
28

ellen

Ellen is a wrapper of pygit2 and git command.
Python
41
star
29

Polymorph

Transform value of dictionary to property of Objective-C class, by using a `dynamic` like directive.
Objective-C
40
star
30

douban-sqlstore

Douban's MySQL lib.
Python
31
star
31

gpack

GIT Smart HTTP Server Rack Implementation, Python clone of https://github.com/schacon/grack
Python
30
star
32

douban-orz

The Missing Data Manager In Douban
Python
29
star
33

douban-mc

Douban's Memcached lib for python.
Python
27
star
34

charts

Helm charts from douban
Smarty
24
star
35

helpdesk

Yet another helpdesk based on multiple providers
Python
22
star
36

sina

A GIT Smart HTTP Server WSGI Implementation.
Python
21
star
37

sa-tools-core

Handy tools for sysadmin.
Python
18
star
38

graphite-kenshin

A plugin for using graphite-web with the kenshin-based storage backend.
Python
16
star
39

gobeansproxy

A proxy for GoBeansDB
Go
13
star
40

beansdbadmin

GoBeansDB Admin UI
Python
9
star
41

redarrow-rs

A command dispatcher to run executables remotely and safely.
Rust
6
star
42

MTURLProtocol

Multiple NSURLProtocol subclasses alternative solution.
Objective-C
4
star
43

python-libmagic

A wrapper for libmagic with static build.
Python
3
star
44

qiniu-exporter

Go
2
star
45

aliyun-exporter

Go
2
star
46

pyquicklz

C
1
star
47

upyun-exporter

Go
1
star
48

sa-tools-go

go version for sa-tools
Go
1
star