• Stars
    star
    347
  • Rank 122,141 (Top 3 %)
  • Language
    Python
  • Created over 12 years ago
  • Updated about 12 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

r³ is a map-reduce engine written in python using redis as a backend

r³ is a map reduce engine written in python using a redis backend. It's purpose is to be simple.

r³ has only three concepts to grasp: input streams, mappers and reducers.

The diagram below relates how they interact:

r³ components interaction

If the diagram above is a little too much to grasp right now, don't worry. Keep reading and use this diagram later for reference.

A fairly simple map-reduce example to solve is counting the number of occurrences of each word in an extensive document. We'll use this scenario as our example.

Installing

Installing r³ is as easy as:

pip install r3

After successful installation, you'll have three new commands: r3-app, r3-map and r3-web.

Running the App

In order to use r³ you must have a redis database running. Getting one up in your system is beyond the scope of this document.

We'll assume you have one running at 127.0.0.1, port 7778 and configured to require the password 'r3' using database 0.

The service that is at the heart of r³ is r3-app. It is the web-server that will receive requests for map-reduce jobs and return the results.

To run r3-app, given the above redis back-end, type:

r3-app --redis-port=7778 --redis-pass=r3 -c config.py

We'll learn more about the configuration file below.

Given that you have a proper configuration file, your r3 service will be available at http://localhost:9999.

As to how we actually perform a map-reduce operation, we'll see that after the Running Mappers section.

App Configuration

In the above section we specified a file called config.py as configuration. Now we'll see what that file contains.

The configuration file that we pass to the r3-app command is responsible for specifying input stream processors and reducers that should be enabled.

Let's see a sample configuration file:

INPUT_STREAMS = [
    'test.count_words_stream.CountWordsStream'
]

REDUCERS = [
    'test.count_words_reducer.CountWordsReducer'
]

This configuration specifies that there should be a CountWordsStream input stream processor and a CountWordsReducer reducer. Both will be used by the stream service to perform a map-reduce operation.

We'll learn more about input streams and reducers in the sections below.

The input stream

The input stream processor is the class responsible for creating the input streams upon which the mapping will occur.

In our counting words in a document sample, the input stream processor class should open the document, read the lines in the document and then return each line to r3-app.

Let's see a possible implementation:

from os.path import abspath, dirname, join

class CountWordsStream:
    job_type = 'count-words'
    group_size = 1000

    def process(self, app, arguments):
        with open(abspath(join(dirname(__file__), 'chekhov.txt'))) as f:
            contents = f.readlines()

        return [line.lower() for line in contents]

The job_type property is required and specifies the relationship that this input stream has with mappers and with a specific reducer.

The group_size property specifies how big is an input stream. In the above example, our input stream processor returns all the lines in the document, but r³ will group the resulting lines in batches of 1000 lines to be processed by each mapper. How big is your group size varies wildly depending on what your mapping consists of.

Running Mappers

Input stream processors and reducers are sequential and thus run in-process in the r³ app. Mappers, on the other hand, are inherently parallel and are run on their own as independent worker units.

Considering the above example of input stream and reducer, we'll use a CountWordsMapper class to run our mapper.

We can easily start the mapper with:

r3-map --redis-port=7778 --redis-pass=r3 --mapper-key=mapper-1 --mapper-class="test.count_words_mapper.CountWordsMapper"

The redis-port and redis-pass arguments require no further explanation.

The mapper-key argument specifies a unique key for this mapper. This key should be the same once this mapper restarts.

The mapper-class is the class r³ will use to map input streams.

Let's see what this map class looks like. If we are mapping lines (what we got out of the input stream steap), we should return each word and how many times it occurs.

from r3.worker.mapper import Mapper

class CountWordsMapper(Mapper):
    job_type = 'count-words'

    def map(self, lines):
        return list(self.split_words(lines))

    def split_words(self, lines):
        for line in lines:
            for word in line.split():
                yield word, 1

The job_type property is required and specifies the relationship that this mapper has with a specific input stream and with a specific reducer.

Reducing

After all input streams have been mapped, it is time to reduce our data to one coherent value. This is what the reducer does.

In the case of counting word occurrences, a sample implementation is as follows:

from collections import defaultdict

class CountWordsReducer:
    job_type = 'count-words'

    def reduce(self, app, items):
        word_freq = defaultdict(int)
        for line in items:
            for word, frequency in line:
                word_freq[word] += frequency

        return word_freq

The job_type property is required and specifies the relationship that this reducer has with mappers and with a specific input stream.

This reducer will return a dictionary that contains all the words and the frequency with which they occur in the given file.

Testing our Solution

To test the above solution, just clone r³'s repository and run the commands from the directory you just cloned.

Given that we have the above working, we should have r3-app running at http://localhost:9999. In order to access our count-words job we'll point our browser to:

http://localhost:9999/count-words

This should return a JSON document with the resulting occurrences of words in the sample document.

Creating my own Reducers

As you have probably guessed, creating new jobs of mapping and reducing is as simple as implementing your own input stream processor, mapper and reducer.

After they are implemented, just include the processor and reducer in the config file and fire up as many mappers as you want.

Monitoring r³

We talked about three available commands: r3-app, r3-map and r3-web.

The last one fires up a monitoring interface that helps you in understanding how your r³ farm is working.

Some screenshots of the monitoring application:

r³ web monitoring interface

Failed jobs monitoring:

r³ web monitoring interface

Stats:

r³ web monitoring interface

License

r³ is licensed under the MIT License:

The MIT License

Copyright (c) 2012 Bernardo Heynemann [email protected]

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

More Repositories

1

pyccuracy

Pyccuracy is a BDD-style Web Acceptance Testing framework in Python.
Python
226
star
2

motorengine

Motorengine is a port of MongoEngine for Tornado.
Python
204
star
3

pynq

Python implementation of Microsoft's .Net Language Integrated Query (LINQ)
Python
151
star
4

pyvows

Python implementation of Vows.js
Python
133
star
5

skink

Skink is a continuous-integration server in Python.
Python
44
star
6

generator-python-package

yeoman generator to create new python packages
JavaScript
41
star
7

cyrusbus

Cyrus bus is a pub/sub messaging system in python.
Python
34
star
8

octopus

octopus is a library to concurrently retrieve and report on the completion of http requests.
Python
23
star
9

cloudfront-log-parser

Parse cloudfront access log lines with some extra intelligence.
Python
23
star
10

preggy

preggy is a collection of expectations for python applications, extracted from the pyVows project.
Python
22
star
11

IWantToWorkAtGloboCom

Você é web dev? Quer trabalhar no maior portal da América Latina? Esse repo é pra você
18
star
12

cow

cow is a quick-start for tornado-powered projects (specially apis).
Python
16
star
13

ion-web

Ion is a VERY lightweight MVC framework based on CherryPy, Jinja2 and SQL Alchemy
Python
15
star
14

peon

Peon works for you while you are developing (think autotest).
Python
13
star
15

deego

deego is a vm manager for dummies (like me)
Python
11
star
16

swnamer

swnamer is a name generator that uses Star Wars characters, species and planets to create unique names.
Python
11
star
17

cleese

cleese is a shell command executer for python applications.
Python
10
star
18

pyccuracy-contrib

Contributions from the community to Pyccuracy that do not belong in the core.
9
star
19

wight

wight is a load test scheduler and reporter based on funkload.
Python
9
star
20

mi7

mi7 is a lightweight spying engine
Python
9
star
21

tornado-angular

tornado-angular is an opinionated way of distributing angular applications with tornado web server as api handler.
Python
9
star
22

kanji

kanji is a template language inspired in jinja2.
JavaScript
8
star
23

globodojo

Dojo realizado na globo.com
JavaScript
7
star
24

sheep

sheep is a worker console app generator.
Python
7
star
25

aero

aero is an extension to tornado web server that provides django-like app support
Python
7
star
26

generator-flask-app

Yeoman generator for flask applications
Python
7
star
27

flask-debugtb-elasticsearch

Flask Debug Toolbar panel for elasticsearch - Based on bobosss/flask_debugtoolbar_elasticsearch
Python
7
star
28

acme

Acme thumbs is a google app engine based thumbnailing service.
Python
7
star
29

serve.me

Serve the files in this folder as http.
6
star
30

dotfiles

my dotfiles
Shell
6
star
31

skink.vnext

next release of skink ci server
JavaScript
6
star
32

sub.bus

Sub buS is a VERY lightweight publish/subscribe event bus for JQuery.
JavaScript
6
star
33

delorean

delorean is a message-driven document database from the future
Ruby
5
star
34

go-cov-parser

go-cov-parser is a library to parse coverage.out files from go tests.
Go
5
star
35

mcfly

mcfly is the lib used to operate the Delorean database in python
Python
5
star
36

scribbler

Scribbler is a parallel test runner for python.
Python
5
star
37

pyoc

Python IoC container
Python
5
star
38

cloudfront-edge-codes

Cloudfront Edge Codes is a translator that returns information on the edge node based on its code. Information returned include geolocation of reference point, name of reference point, cloudfront node name and more.
Python
5
star
39

brasa-workshop

Workshop de Jamstack e GraphQL para Brasa
JavaScript
4
star
40

fakebook

fakebook is a Facebook API simulator meant for test purposes.
Python
4
star
41

skink-contrib

Skink contributions
Python
4
star
42

crane

crane is a fluent build system built in pure python.
Python
4
star
43

generator-gop

Go complete generator is an yeoman generator that will get you up and running with your go package in seconds.
JavaScript
4
star
44

myimg

myimg
JavaScript
4
star
45

hive

Hive is an API Server, responsible for authentication, throttling, mettering and more - not functional yet.
JavaScript
4
star
46

thumbor-heroku

a sample deployment of thumbor at heroku
4
star
47

tornado-geopy

Async version of the awesome library geopy.
Python
4
star
48

level

Level is a message-oriented game server.
Objective-C
4
star
49

skink-website

Website for skink
3
star
50

thumbor-enterprise-edition

thumbor package to generate URLs in Java
Java
3
star
51

provy-recipes

Recipes for creating infrastructure with provy
3
star
52

tornado-redis-sentinel

toredis with sentinel support.
Python
3
star
53

nearme

nearme is a geospatial referential search of cities near a given a latitude and longitude.
Python
3
star
54

libmagic

libmagic is a library to simulate magic: the gathering games
Python
3
star
55

django-tutorial

Code for the tutorial in django docs. This is used in provy to demo how to get a django website up and running.
Python
3
star
56

shamester

Shamester crawls your specified URLs and validates your websites for a number of rules that might improve your website's SEO, Accessibility and others
Python
3
star
57

pes

project euler solutions
C
2
star
58

rod

rod is an http server for your tests that is thread-safe
2
star
59

generator-tornado

generator-tornado is an yeoman generator for tornado applications.
JavaScript
2
star
60

flask-paypal

Flask integration with paypal, mainly focused on subscriptions.
Python
2
star
61

gaas

gaas is an acronym for Git as a Service.
Python
2
star
62

splitsecond

splitsecond is a smart static assets pre-processor.
JavaScript
2
star
63

ren

ren is a command for renaming files and directories without having to specify the path again.
Python
2
star
64

fisl-keynotes

fisl-keynotes
JavaScript
2
star
65

dojoscheduler

Scheduler and history for dojos.
Python
2
star
66

lscache

lscache is a local-storage based evented cache implementation in javascript.
JavaScript
2
star
67

r3-gh

an experiment
Python
2
star
68

configify

Configify is a hierarchical, code-first, contextual, distributed configuration manager. (Currently not even alpha)
TypeScript
2
star
69

frec

frec is a service that builds a database of recognizable faces upon time.
Python
2
star
70

tour

TOUR is a TOrnado User Registration framework to make it simple to keep track of user
2
star
71

mememe

Mememe é um website que reúne informacões geo-localizadas em formato de widgets.
JavaScript
2
star
72

vimsupport

My vim files and data.
Vim Script
2
star
73

lcthw

C
2
star
74

oz

oz is a JQuery plugin to do wizards.
JavaScript
2
star
75

whoami.js

whoami is an api to detect somethings about the browser the user is using and allow for extensible tests, so the user can include his own tests.
JavaScript
2
star
76

goanywhere

Keep track of who visits your URLs.
Go
2
star
77

colossus

colossus is a http proxy that understands environments and dependencies between micro-services.
Go
1
star
78

dotfiles-windows

Dotfiles for WSL on Windows.
Vim Script
1
star
79

nose-docker

nose-docker is a plugin to run each test as a container.
Python
1
star
80

full-vimclosure

full-vimclosure is a vim plugin for everyone out there that just wants to code with proper closure of (,[ and {, damn it!
1
star
81

tempus

tempus is a website that allows you to show a timer in full screen.
Python
1
star
82

loopback-manager

Manager for loopback enabled APIs using JSON Schema.
JavaScript
1
star
83

git-support-fish-bundle

github integration functions for fish shell.
Shell
1
star
84

heynemann.github.com

My github page
JavaScript
1
star
85

apis

apis
Python
1
star
86

hello-rails

Hello World app in rails
1
star
87

aquesta

fooling around with a top-down parser in rpython
Python
1
star
88

heat4us

heat4us
JavaScript
1
star
89

sandbox

Ruby
1
star
90

libsoccer

Library for simulating soccer games
Python
1
star
91

docker-alpine-pyvips

Docker image for pyvips development using alpine
Dockerfile
1
star
92

django-media-lint

A lint checker, joiner and compressor of CSS and JS for Django.
Python
1
star
93

fontain

fontain is a static file server that allows for smart Access-Control-Allow-Origin settings.
C++
1
star
94

mbay.js

Explodes a div using canvas.
JavaScript
1
star
95

importer

Importer is a library to do dynamic importing of modules in python
Python
1
star