• Stars
    star
    692
  • Rank 65,341 (Top 2 %)
  • Language
    Go
  • License
    MIT License
  • Created over 6 years ago
  • Updated about 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

๐ŸŽ A serverless MapReduce framework written for AWS Lambda

๐ŸŽ corral

Serverless MapReduce

Build Status Go Report Card codecov GoDoc

Corral is a MapReduce framework designed to be deployed to serverless platforms, like AWS Lambda. It presents a lightweight alternative to Hadoop MapReduce. Much of the design philosophy was inspired by Yelp's mrjob -- corral retains mrjob's ease-of-use while gaining the type safety and speed of Go.

Corral's runtime model consists of stateless, transient executors controlled by a central driver. Currently, the best environment for deployment is AWS Lambda, but corral is modular enough that support for other serverless platforms can be added as support for Go in cloud functions improves.

Corral is best suited for data-intensive but computationally inexpensive tasks, such as ETL jobs.

More details about corral's internals can be found in this blog post.

Contents:

Examples

Every good MapReduce framework needs a WordCountโ„ข example. Here's how to write a "word count" in corral:

type wordCount struct{}

func (w wordCount) Map(key, value string, emitter corral.Emitter) {
	for _, word := range strings.Fields(value) {
		emitter.Emit(word, "")
	}
}

func (w wordCount) Reduce(key string, values corral.ValueIterator, emitter corral.Emitter) {
	count := 0
	for range values.Iter() {
		count++
	}
	emitter.Emit(key, strconv.Itoa(count))
}

func main() {
	wc := wordCount{}
	job := corral.NewJob(wc, wc)

	driver := corral.NewDriver(job)
	driver.Main()
}

This can be invoked locally by building/running the above source and adding input files as arguments:

go run word_count.go /path/to/some_file.txt

By default, job output will be stored relative to the current directory.

We can also input/output to S3 by pointing to an S3 bucket/files for input/output:

go run word_count.go --out s3://my-output-bucket/ s3://my-input-bucket/*

More comprehensive examples can be found in the examples folder.

Deploying in Lambda

No formal deployment step needs run to deploy a corral application to Lambda. Instead, add the --lambda flag to an invocation of a corral app, and the project code will be automatically recompiled for Lambda and uploaded.

For example,

./word_count --lambda s3://my-input-bucket/* --out s3://my-output-bucket

Note that you must use s3 for input/output directories, as local data files will not be present in the Lambda environment.

NOTE: Due to the fact that corral recompiles application code to target Lambda, invocation of the command with the --lambda flag must be done in the root directory of your application's source code.

AWS Credentials

AWS credentials are automatically loaded from the environment. See this page for details.

As per the AWS documentation, AWS credentials are loaded in order from:

  1. Environment variables
  2. Shared credentials file
  3. IAM role (if executing in AWS Lambda or EC2)

In short, setup credentials in .aws/credentials as one would with any other AWS powered service. If you have more than one profile in .aws/credentials, make sure to set the AWS_PROFILE environment variable to select the profile to be used.

Configuration

There are a number of ways to specify configuraiton for corral applications. To hard-code configuration, there are a variety of Options that may be used when instantiating a Job.

Configuration values are used in the order, with priority given to whichever location is set first:

  1. Hard-coded job Options.
  2. Command line flags
  3. Environment variables
  4. Configuration file
  5. Default values

Configuration Settings

Below are the config settings that may be changed.

Framework Settings

  • splitSize (int64) - The maximum size (in bytes) of any single file input split. (Default: 100Mb)
  • mapBinSize (int64) - The maximum size (in bytes) of the combined input size to a mapper. (Default: 512Mb)
  • reduceBinSize (int64) - The maximum size (in bytes) of the combined input size to a reducer. This is an "expected" maximum, assuming uniform key distribution. (Default: 512Mb)
  • maxConcurrency (int) - The maximum number of executors (local, Lambda, or otherwise) that may run concurrently. (Default: 100)
  • workingLocation (string) - The location (local or S3) to use for writing intermediate and output data.
  • verbose (bool) - Enables debug logging if set to true

Lambda Settings

  • lambdaFunctionName (string) - The name to use for created Lambda functions. (Default: corral_function)
  • lambdaManageRole (bool) - Whether corral should manage creating an IAM role for Lambda execution. (Default: true)
  • lambdaRoleARN (string) - If lambdaManageRole is disabled, the ARN specified in lambdaRoleARN is used as the Lambda function's executor role.
  • lambdaTimeout (int64) - The timeout (maximum function duration) in seconds of created Lambda functions. See AWS lambda docs for details. (Default: 180)
  • lambdaMemory (int64) - The maximum memory that a Lambda function may use. See AWS lambda docs for details. (Default: 1500)

Command Line Flags

The following flags are available at runtime as command-line flags:

      --lambda            Use lambda backend
      --memprofile file   Write memory profile to file
  -o, --out directory     Output directory (can be local or in S3)
      --undeploy          Undeploy the Lambda function and IAM permissions without running the driver
  -v, --verbose           Output verbose logs

Environment Variables

Corral leverages Viper for specifying config. Any of the above configuration settings can be set as environment variables by upper-casing the setting name, and prepending CORRAL_.

For example, lambdaFunctionName can be configured using an env var by setting CORRAL_LAMBDAFUNCTIONNAME.

Config Files

Corral will read settings from a file called corralrc. Corral checks to see if this file exists in the current directory (.). It can also read global settings from $HOME/.corral/corralrc.

Reference the "Configuration Settings" section for the configuration keys that may be used.

Config files can be in JSON, YAML, or TOML format. See Viper for more details.

Architecture

Below is a high-level diagram describing the MapReduce architecture corral uses.

Input Files / Splits

Input files are split byte-wise into contiguous chunks of maximum size splitSize. These splits are packed into "input bins" of maximum size mapBinSize. The bin packing algorithm tries to assign contiguous chunks of a single file to the same mapper, but this behavior is not guaranteed.

There is a one-to-one correspondance between an "input bin" and the data that a mapper reads. i.e. Each mapper is assigned to process exactly 1 input bin. For jobs that run on Lambda, you should tune mapBinSize, splitSize, and lambdaTimeout accordingly so that mappers are able to process their entire input before timing out.

Input data is stramed into the mapper, so the entire input data needn't fit in memory.

Mappers

Input data is fed into the map function line-by-line. Input splits are calculated byte-wise, but this is rectified during the Map phase into a logical split "by line" (to prevent partial reads, or the loss of records that span input splits).

Mappers may maintain state if desired (though not encouraged).

Partition / Shuffle

Key/value pairs emitted during the map stage are written to intermediate files. Keys are partitioned into one N buckets, where N is the number of reducers. As a result, each mapper may write to as many as N separate files.

This results in a set of files labeled map-binX-Y where X is a number between 0 and N-1, and Y is the mapper's ID (a number between 0 and the number of mappers).

Reducers / Output

Currently, reducer input must be able to fit in memory. This is because keys are only partitioned, not sorted. The reducer performs an in-memory per-key partition.

Reducers receive per-key values in an arbitrary order. It is guaranteed that all values for a given key will be provided in a single call to Reduce by-key.

Values emitted from a reducer will be stored in tab separated format (i.e. KEY\tVALUE) in files labeled output-X where X is the reducer's ID (a number between 0 and the number of reducers).

Reducers may maintain state if desired (though not encouraged).

Contributing

Contributions to corral are more than welcomed! In general, the preference is to discuss potential changes in the issues before changes are made.

More information is included in the CONTRIBUTING.md

Running Tests

To run tests, run the following command in the root project directory:

go test ./...

Note that some tests (i.e. the tests of corfs) require AWS credentials to be present.

The main corral has TravisCI setup. If you fork this repo, you can enable TravisCI on your fork. You will need to set the following environment variables for all the tests to work:

  • AWS_ACCESS_KEY_ID: Credentials access key
  • AWS_SECRET_ACCESS_KEY: Credentials secret key
  • AWS_DEFAULT_REGION: Region to use for S3 tests
  • AWS_TEST_BUCKET: The S3 bucket to use for tests (just the name; i.e. testBucket instead of s3://testBucket)

License

This project is licensed under the MIT License - see the LICENSE file for details

Previous Work / Attributions

  • lambda-refarch-mapreduce - Python/Node.JS reference MapReduce Architecture
    • Uses a "recursive" style reducer instead of parallel reducers
    • Requires that all reducer output can fit in memory of a single lambda function
  • mrjob
    • Excellent Python library for writing MapReduce jobs for Hadoop, EMR/Dataproc, and others
  • dmrgo
    • mrjob-inspired Go MapReduce library
  • Zappa
    • Serverless Python toolkit. Inspired much of the way that corral does automatic Lambda deployment
  • Logo: Fence by Vitaliy Gorbachev from the Noun Project

More Repositories

1

awesome-lightning-network

โšก A curated list of awesome Lightning Network projects for developers and crypto enthusiasts
980
star
2

git-trophy

๐Ÿ† Create a 3D Printed Model of Your Github Contributions
JavaScript
87
star
3

bpy_lambda

๐ŸŽฅ A compiled binary of Blender-as-a-Python-Module (bpy) for use in AWS Lambda
Dockerfile
66
star
4

python-emojipedia

๐Ÿ˜Ž Emoji data from Emojipedia ๐ŸŽ‰
Python
44
star
5

rssfilter

๐Ÿ” Web service for filtering RSS articles
TypeScript
39
star
6

gdq-stats

๐Ÿ‘พ Live Data Visualizations for GamesDoneQuick Streams
HTML
31
star
7

ep

โ› A CLI Emoji Picker
Go
25
star
8

google-sheets-shortcut

Append to a Google sheet directly from iOS shortcuts
JavaScript
25
star
9

bolero

๐Ÿ’ƒ Construct your personal API
Python
18
star
10

colly-example

An example web crawler written with Colly and Goquery
Go
18
star
11

todoist-to-sqlite

Export your Todoist data to SQLite
Python
16
star
12

PyManifold

Python API client for Manifold Markets
Python
14
star
13

jupyterhub-sqlauthenticator

๐Ÿ” Authenticate Jupyterhub with a MySQL user DB
Python
14
star
14

gdq-collector

๐Ÿ“ฅ Data Collection Utilities for GamesDoneQuick
Python
11
star
15

EachDay

๐Ÿ“ A tiny micro-journaling service
JavaScript
10
star
16

git_lambda

๐Ÿ™ Git binary package for the Python AWS Lambda runtime
Python
9
star
17

PoliticalHeatMap-SIGIR

๐ŸŒ Data visualization of current Twitter sentiment towards 2016 US Presidential Candidates
Python
9
star
18

wayback-archiver

๐Ÿ—„ CLI archival tool for the Wayback Machine
Rust
8
star
19

generative-doodles

Experiments with p5.js and other graphics libraries
Go
7
star
20

github_contributions

:octocat: A Python interface for Github's contribution system
Python
7
star
21

Watch-Later-Tweaks

๐Ÿ•“ Chrome Extension for Adding UI Tweaks to Youtube's Watch Later System
JavaScript
6
star
22

gitfolio

:octocat: A Github client written in React Native
JavaScript
6
star
23

miniflux-substack-filter

Filter paywalled substack posts from miniflux
Go
6
star
24

aqi-esphome

5
star
25

AnyoneHere

๐Ÿ‘€ See who's home with a simple Flask API
JavaScript
4
star
26

TubeSync

๐Ÿ“บ A small Menu Bar OSX application to sync Youtube Playlists to your Mac for offline viewing
Swift
4
star
27

instapaper-to-sqlite

๐Ÿ“‘ Export your Instapaper bookmarks to SQLite
Python
4
star
28

leetcode

Leetcode Solutions
Python
3
star
29

overcast-to-sqlite

๐ŸŽง Download your Overcast listening history to sqlite
Rust
3
star
30

kindle-snippets

A small in-browser Kindle Clippings viewer
TypeScript
3
star
31

fn

Golang library for generating (mostly) unique date-sortable filenames
Go
3
star
32

agdq-2017-schedule-analysis

๐Ÿ“Š Visualizations of the AGDQ 2017 Marathon Schedule
Jupyter Notebook
2
star
33

travis-build-repeat

๐Ÿ‘ท Repeat TravisCI builds to avoid stale test results
Python
2
star
34

trek-limerick-bot

๐Ÿ–– Tweeting limericks from Star Trek dialogue
JavaScript
2
star
35

dashing-wunderlist-stats

โ˜‘๏ธ Wunderlist Stats in your Dashing Dashboard
Ruby
2
star
36

cs410-project

Python
2
star
37

instapaper-to-mdlog

Small CLI tool to keep a running log of archived Instapaper articles
Go
2
star
38

optimal_balancer

โš–๏ธ A simple tool for calculating the optimal number of shares to buy to maintain a proportional portfolio
Rust
2
star
39

beeminder-to-sqlite

Export your Beeminder data to SQLite
Python
2
star
40

advent_of_code_2017

๐ŸŽ… My solutions to the 3rd "Advent of Code"
Rust
2
star
41

many-subpackages-blueprint-example

1
star
42

advent_of_code_2019

๐ŸŽ… My solutions to the 2019 "Advent of Code"
Rust
1
star
43

feedly-to-sqlite

Export your Feedly data to SQLite
Python
1
star
44

f5tools

Automation and Configuration Orchestration for F5 Load Balancers
Ruby
1
star
45

advent_of_code_2020

๐ŸŽ… My solutions to the 2020 "Advent of Code"
Rust
1
star
46

HackerRank

๐Ÿ’ป My HackerRank solutions
Python
1
star
47

synacor_rs

โš™ Implementation of the Synacor Challenge virtual machine in Rust
Rust
1
star
48

sgdq-2016-schedule-analysis

๐Ÿ“Š Visualizations of the SGDQ 2016 Marathon
Jupyter Notebook
1
star
49

getting-started-with-golang-cloud-functions

Examples for using Cloud Functions w/ Golang
Go
1
star
50

advent_of_code_2021

About ๐ŸŽ… My solutions to the 2021 "Advent of Code"
Rust
1
star
51

Data-Science-Projects

๐Ÿ“ˆ Miscellaneous endeavours into Data Science
Jupyter Notebook
1
star
52

advent_of_code_2016

๐ŸŽ… My solutions to the second "Advent of Code"
Python
1
star
53

pyvethirtyeight

๐Ÿ‡บ๐Ÿ‡ธ Python wrapper for Fivethirtyeight's Election Forecasts
Python
1
star
54

CrowdShout

โ— A realtime tool for Twitch streamers to gauge the mood of their chat.
Python
1
star
55

archillect-frontend

A mini archillect.com friendend hosted at Glitch
JavaScript
1
star
56

youtube-patreon-finder

[WIP]
Go
1
star
57

bitbar-plugins

My bitbar plugins
Python
1
star
58

tampermonkey

๐Ÿ’ Tampermonkey Scripts
JavaScript
1
star
59

wordscapes

๐Ÿงฉ Experimental solvers for "Wordscape" puzzles
Rust
1
star
60

WunderSchedule

๐Ÿ•’ Enhanced Wunderlist task scheduling for power users
JavaScript
1
star
61

emoji-ordering

๐Ÿ“ถ Utilities for ordering emojis according to the official Unicode emoji sort order
Go
1
star
62

advent_of_code_2015

๐ŸŽ… My solutions to the first "Advent of Code"
Python
1
star