• Stars
    star
    551
  • Rank 76,131 (Top 2 %)
  • Language
    Clojure
  • License
    Apache License 2.0
  • Created about 10 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Map-Reduce for Clojure

PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig or Cascading but you don't need to know much about either of them to use it.

Getting Started, Tutorials & Documentation

Getting started with Clojure and PigPen is really easy.

Note: It is strongly recommended to familiarize yourself with Clojure before using PigPen.

Note: PigPen is not a Clojure wrapper for writing Pig scripts you can hand edit. While entirely possible, the resulting scripts are not intended for human consumption.

Questions & Complaints

Artifacts

pigpen is available from Maven:

With Leiningen:

;; core library
[com.netflix.pigpen/pigpen "0.3.3"]

;; pig support
[com.netflix.pigpen/pigpen-pig "0.3.3"]

;; cascading support
[com.netflix.pigpen/pigpen-cascading "0.3.3"]

;; rx support
[com.netflix.pigpen/pigpen-rx "0.3.3"]

The platform libraries all reference the core library, so you only need to reference the platform specific one that you require and the core library should be included transitively.

Note: PigPen requires Clojure 1.5.1 or greater

Parquet

To use the parquet loader, add this to your dependencies:

[com.netflix.pigpen/pigpen-parquet-pig "0.3.3"]

Here an example of how to write parquet data.

(require '[pigpen.core :as pig])
(require '[pigpen.parquet :as pqt])

;;
;; assuming that `data` is in tuples
;;
;; [["John" "Smith" 28]
;;  ["Jane" "Doe"   21]]

(defn save-to-parquet
  [output-file data]
  (->> data
       ;; turning tuples into a map
       (pig/map (partial zipmap [:firstname :lastname :age]))
       ;; then storing to Parquet files
       (pqt/store-parquet
        output-file
        (pqt/message "test-schema"
                     ;; the field names here MUST match the map's keys
                     (pqt/binary "firstname")
                     (pqt/binary "lastname")
                     (pqt/int64  "age")))))

And how to load the records back:

(defn load-from-parquet
  [input-file]
  ;; the output will be a sequence of maps
  (pqt/load-parquet
   input-file
   (pqt/message "test-schema"
                (pqt/binary "firstname")
                (pqt/binary "lastname")
                (pqt/int64  "age"))))

And check out the pigpen.parquet namespace for usage.

Note: Parquet is currently only supported by Pig

Avro

To use the avro loader (alpha), add this to your dependencies:

[com.netflix.pigpen/pigpen-avro-pig "0.3.3"]

And check out the pigpen.avro namespace for usage.

Note: Avro is currently only supported by Pig

Release Notes

  • 0.3.3 - 5/19/16

    • Explicitly disable *print-length* and *print-level* when generating scripts
    • Add a better error message for storage types that expect a map with keywords
  • 0.3.2 - 1/12/16

    • Allow more types in generated pig scripts
  • 0.3.1 - 10/19/15

    • Update cascading version to 2.7.0
    • Report correct pigpen version to concurrent
    • Update nippy to 2.10.0 & tune performance
  • 0.3.0 - 5/18/15

    • No changes
  • 0.3.0-rc-7 - 4/29/15

    • Fixed bug in local mode where nils weren't handled consistently
  • 0.3.0-rc.6 - 4/14/15

    • Add local mode code eval memoization to avoid thrashing permgen
    • Added pigpen.pig/set-options command to explicitly set pig options in a script. This was previously available (though undocumented) by setting {:pig-options {...}} in any options block, but is now official.
  • 0.3.0-rc.5 - 4/9/15

    • Update core.async version
  • 0.3.0-rc.4 - 4/8/15

    • Memoize code evaluation when run in the cluster
  • 0.3.0-rc.3 - 4/2/15

    • Bugfixes
  • 0.3.0-rc.2 - 3/30/15

    • Parquet refactor. Local parquet loading no longer depends on Pig. Parquet schemas are now defined using Parquet classes.
  • 0.3.0-rc.1 - 3/23/15

    • Added Cascading support
    • Added pigpen.core/keys-fn, a new convenience macro to support named anonymous functions. Like keys destructuring, but less verbose.
    • New function based operators to build more dynamic scripts. These are function versions of all the core pigpen macros, but you have to handle quoting user code manually. These were previously available, but not officially supported. Now they're alpha, but supported and documented. See pigpen.core.fn
    • New lower-level operators to build custom storage and commands. These were previously available, but not officially supported. Now they're alpha, but supported and documented. See pigpen.core.op
    • *** Breaking Changes ***
      • pigpen.core/script is now pigpen.core/store-many
      • pigpen.core/generate-script is now pigpen.pig/generate-script
      • pigpen.core/write-script is now pigpen.pig/write-script
      • pigpen.core/show is now pigpen.viz/show (requires dependency [com.netflix.pigpen/pigpen-viz "..."])
      • pig/dump has changed. The old version was based on rx-java, and still exists as pigpen.rx/dump. The replacement for pigpen.core/dump is now entirely Clojure based. The Clojure version is better for unit tests and small data. All stages are evaluated eagerly, so the stack traces are simpler to read. The rx version is lazy, including the load-* commands. This means that you can load a large file, take a few rows, and process them without loading the entire file into memory. The downside is confusing stack traces and extra dependencies. See here for more details.
      • The interface for building custom loaders and storage has changed. See here for more details. Please email [email protected] with any questions.
  • 0.2.15 - 2/20/15

    • Include sources in jars
  • 0.2.14 - 2/18/15

    • Avro updates
  • 0.2.13 - 1/19/15

    • Added load-avro in the pigpen-avro project: http://avro.apache.org/
    • Fixed the nRepl configuration; use gradlew nRepl to start an nRepl
    • Exclude nested relations from closures
  • 0.2.12 - 12/16/14

    • Added load-csv, which allows for quoting per RFC 4180
  • 0.2.11 - 10/24/14

    • Fixed a bug (feature?) introduced by new rx version. Also upgraded to rc7. This would have only affected local mode where the data being read was faster than the code consuming it.
  • 0.2.10 - 9/21/14

    • Removed load-pig and store-pig. The pig data format is very bad and should not be used. If you used these and want them back, email [email protected] and we'll put it into a separate jar. The jars required for this feature were causing conflicts elsewhere.
    • Upgraded the following dependencies:
      • org.clojure/clojure 1.5.1 -> 1.6.0 - this was also changed to a provided dependency, so you should be able to use any version greater than 1.5.1
      • org.clojure/data.json 0.2.2 -> 0.2.5
      • com.taoensso/nippy 2.6.0-RC1 -> 2.6.3
      • clj-time 0.5.0 - no longer needed
      • joda-time 2.2 -> 2.4 - pig needs this to run locally
      • instaparse 1.2.14 - no longer needed
      • io.reactivex/rxjava 0.9.2 -> 1.0.0-rc.1
    • Fixed the rx limit bug. pigpen.local/*max-load-records* is no longer required.
  • 0.2.9 - 9/16/14

    • Fix a local-mode bug in pigpen.fold/avg where some collections would produce a NPE.
    • Change fake pig delimiter to \n instead of \0. Allows for \0 to exist in input data.
    • Remove 1000 record limit for local-mode. This was originally introduced to mitigate an rx bug. Until #61 is fixed, bind pigpen.local/*max-load-records* to the maximum number of records you want to read locally when reading large files. This now defaults to nil (no limit).
    • Fix a local dispatch bug that would prevent loading folders locally
  • 0.2.8 - 7/31/14

    • Fix a bug in load-tsv and load-lazy
  • 0.2.7 - 7/31/14 *** Don't use ***

  • 0.2.6 - 6/17/14

    • Minor optimization for local mode. The creation of a UDF was occurring for every value processed, causing it to run out of perm-gen space when processing large collections locally.
    • Fix (pig/return [])
    • Fix (pig/dump (pig/reduce + (pig/return [])))
    • Fix Longs in scripts that are larger than an Integer
    • Memoize local UDF instances per use of pig/dump
    • The jar location in the generated script is now configurable. Use the :pigpen-jar-location option with pig/generate-script or pig/write-script.
  • 0.2.5 - 4/9/14

    • Remove dump&show and dump&show+ in favor of pigpen.oven/bake. Call bake once and pass to as many outputs as you want. This is a breaking change, but I didn't increment the version because dump&show was just a tool to be used in the REPL. No scripts should break because of this change.
    • Remove dymp-async. It appeared to be broken and was a bad idea from the start.
    • Fix self-joins. This was a rare issue as a self join (with the same key) just duplicates data in a very expensive way.
    • Clean up functional tests
    • Fix pigpen.oven/clean. When it was pruning the graph, it was also removing REGISTER commands.
  • 0.2.4 - 4/2/14

    • Fix arity checking bug (affected varargs fns)
    • Fix cases where an Algebraic fold function was falling back to the Accumulator interface, which was not supported. This affected using cogroup with fold over multiple relations.
    • Fix debug mode (broken in 0.1.5)
    • Change UDF initialization to not rely on memoization (caused stale data in REPL)
    • Enable AOT. Improves cluster perf
    • Add :partition-by option to distinct
  • 0.2.3 - 3/27/14

    • Added load-json, store-json, load-string, store-string
    • Added filter-by, and remove-by
  • 0.2.2 - 3/25/14

    • Fixed bug in pigpen.fold/vec. This would also cause fold/map and fold/filter to not work when run in the cluster.
  • 0.2.1 - 3/24/14

    • Fixed bug when using for to generate scripts
    • Fixed local mode bug with map followed by reduce or fold
  • 0.2.0 - 3/3/14

    • Added pigpen.fold - Note: this includes a breaking change in the join and cogroup syntax as follows:
    ; before
    (pig/join (foo on :f)
              (bar on :b optional)
              (fn [f b] ...))
    
    ; after
    (pig/join [(foo :on :f)
               (bar :on :b :type :optional)]
              (fn [f b] ...))

    Each of the select clauses must now be wrapped in a vector - there is no longer a varargs overload to either of these forms. Within each of the select clauses, :on is now a keyword instead of a symbol, but a symbol will still work if used. If optional or required were used, they must be updated to :type :optional and :type :required, respectively.

  • 0.1.5 - 2/17/14

    • Performance improvements
      • Implemented Pig's Accumulator interface
      • Tuned nippy
      • Reduced number of times data is serialized
  • 0.1.4 - 1/31/14

    • Fix sort bug in local mode
  • 0.1.3 - 1/30/14

    • Change Pig & Hadoop to be transitive dependencies
    • Add support for consuming user code via closure
  • 0.1.2 - 1/3/14

    • Upgrade instaparse to 1.2.14
  • 0.1.1 - 1/3/14

    • Initial Release

More Repositories

1

Hystrix

Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.
Java
23,594
star
2

chaosmonkey

Chaos Monkey is a resiliency tool that helps applications tolerate random instance failures.
Go
13,846
star
3

zuul

Zuul is a gateway service that provides dynamic routing, monitoring, resiliency, security, and more.
Java
12,993
star
4

conductor

Conductor is a microservices orchestration engine.
Java
12,943
star
5

eureka

AWS Service registry for resilient mid-tier load balancing and failover.
Java
11,991
star
6

falcor

A JavaScript library for efficient data fetching
JavaScript
10,338
star
7

pollyjs

Record, Replay, and Stub HTTP Interactions.
JavaScript
9,992
star
8

SimianArmy

Tools for keeping your cloud operating in top form. Chaos Monkey is a resiliency tool that helps applications tolerate random instance failures.
Java
7,951
star
9

metaflow

๐Ÿš€ Build and manage real-life ML, AI, and data science projects with ease!
Python
7,382
star
10

fast_jsonapi

No Longer Maintained - A lightning fast JSON:API serializer for Ruby Objects.
Ruby
5,092
star
11

ribbon

Ribbon is a Inter Process Communication (remote procedure calls) library with built in software load balancers. The primary usage model involves REST calls with various serialization scheme support.
Java
4,468
star
12

security_monkey

Security Monkey monitors AWS, GCP, OpenStack, and GitHub orgs for assets and their changes over time.
Python
4,343
star
13

dispatch

All of the ad-hoc things you're doing to manage incidents today, done for you, and much more!
Python
4,168
star
14

dynomite

A generic dynamo implementation for different k-v storage engines
C
4,104
star
15

vmaf

Perceptual video quality assessment based on multi-method fusion.
Python
4,048
star
16

vizceral

WebGL visualization for displaying animated traffic graphs
JavaScript
4,018
star
17

vector

Vector is an on-host performance monitoring framework which exposes hand picked high resolution metrics to every engineerโ€™s browser.
JavaScript
3,588
star
18

atlas

In-memory dimensional time series database.
Scala
3,331
star
19

consoleme

A Central Control Plane for AWS Permissions and Access
Python
3,055
star
20

concurrency-limits

Java
3,036
star
21

flamescope

FlameScope is a visualization tool for exploring different time ranges as Flame Graphs.
Python
2,922
star
22

dgs-framework

GraphQL for Java with Spring Boot made easy.
Kotlin
2,842
star
23

bless

Repository for BLESS, an SSH Certificate Authority that runs as a AWS Lambda function
Python
2,701
star
24

archaius

Library for configuration management API
Java
2,426
star
25

asgard

[Asgard is deprecated at Netflix. We use Spinnaker ( www.spinnaker.io ).] Web interface for application deployments and cloud management in Amazon Web Services (AWS). Binary download: http://github.com/Netflix/asgard/releases
Groovy
2,235
star
26

curator

ZooKeeper client wrapper and rich ZooKeeper framework
Java
2,138
star
27

titus

1,993
star
28

EVCache

A distributed in-memory data store for the cloud
Java
1,900
star
29

lemur

Repository for the Lemur Certificate Manager
Python
1,651
star
30

genie

Distributed Big Data Orchestration Service
Java
1,635
star
31

metacat

Java
1,487
star
32

netflix.github.com

HTML
1,419
star
33

servo

Netflix Application Monitoring Library
Java
1,400
star
34

mantis

A platform that makes it easy for developers to build realtime, cost-effective, operations-focused applications
Java
1,359
star
35

vectorflow

D
1,286
star
36

hubcommander

A Slack bot for GitHub organization management -- and other things too
Python
1,254
star
37

rend

A memcached proxy that manages data chunking and L1 / L2 caches
Go
1,175
star
38

hollow

Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-only access.
Java
1,098
star
39

repokid

AWS Least Privilege for Distributed, High-Velocity Deployment
Python
1,082
star
40

astyanax

Cassandra Java Client
Java
1,034
star
41

Priam

Co-Process for backup/recovery, Token Management, and Centralized Configuration management for Cassandra.
Java
1,024
star
42

aminator

A tool for creating EBS AMIs. This tool currently works for CentOS/RedHat Linux images and is intended to run on an EC2 instance.
Python
938
star
43

Turbine

SSE Stream Aggregator
Java
831
star
44

governator

Governator is a library of extensions and utilities that enhance Google Guice to provide: classpath scanning and automatic binding, lifecycle management, configuration to field mapping, field validation and parallelized object warmup.
Java
821
star
45

Fido

C#
816
star
46

suro

Netflix's distributed Data Pipeline
Java
783
star
47

security-bulletins

Security Bulletins that relate to Netflix Open Source
734
star
48

spectator

Client library for collecting metrics.
Java
713
star
49

Fenzo

Extensible Scheduler for Mesos Frameworks
Java
703
star
50

msl

Message Security Layer
C++
687
star
51

unleash

Professionally publish your JavaScript modules in one keystroke
JavaScript
589
star
52

denominator

Portably control DNS clouds using java or bash
Java
573
star
53

blitz4j

Logging framework for fast asynchronous logging
Java
559
star
54

edda

AWS API Read Cache
Scala
554
star
55

netflix-graph

Compact in-memory representation of directed graph data
Java
548
star
56

karyon

The nucleus or the base container for Applications and Services built using the NetflixOSS ecosystem
Java
495
star
57

go-env

a golang library to manage environment variables
Go
494
star
58

Prana

A sidecar for your NetflixOSS based services.
Java
492
star
59

Lipstick

Pig Visualization framework
JavaScript
464
star
60

iceberg

Iceberg is a table format for large, slow-moving tabular data
Java
455
star
61

Surus

Java
453
star
62

aws-autoscaling

Tools and Documentation about using Auto Scaling
Shell
429
star
63

nf-data-explorer

The Data Explorer gives you fast, safe access to data stored in Cassandra, Dynomite, and Redis.
TypeScript
409
star
64

go-expect

an expect-like golang library to automate control of terminal or console based programs.
Go
397
star
65

Workflowable

Ruby
370
star
66

vizceral-example

Example Vizceral app
JavaScript
361
star
67

osstracker

Github organization OSS metrics collector and metrics dashboard
Scala
359
star
68

ndbench

Netflix Data Store Benchmark
HTML
358
star
69

Raigad

Co-Process for backup/recovery, Auto Deployments and Centralized Configuration management for ElasticSearch
Java
346
star
70

recipes-rss

RSS Reader Recipes that uses several of the Netflix OSS components
Java
339
star
71

aegisthus

A Bulk Data Pipeline out of Cassandra
Java
323
star
72

titus-control-plane

Titus is the Netflix Container Management Platform that manages containers and provides integrations to the infrastructure ecosystem.
Java
320
star
73

weep

The ConsoleMe CLI utility
Go
307
star
74

metaflow-ui

๐ŸŽจ UI for monitoring your Metaflow executions!
TypeScript
297
star
75

dyno-queues

Dyno Queues is a recipe that provides task queues utilizing Dynomite.
Java
261
star
76

image_compression_comparison

Image Compression Comparison Framework
Python
251
star
77

falcor-express-demo

Demonstration Falcor end point for a Netflix-style Application using express
HTML
246
star
78

gradle-template

Java
244
star
79

ember-nf-graph

Composable graphing component library for EmberJS.
JavaScript
241
star
80

falcor-router-demo

A demonstration of how to build a Router for a Netflix-like application
JavaScript
236
star
81

titus-executor

Titus Executor is the container runtime/executor implementation for Titus
Go
233
star
82

photon

Photon is a Java implementation of the Interoperable Master Format (IMF) standard. IMF is a SMPTE standard whose core constraints are defined in the specification st2067-2:2013
Java
227
star
83

dial-reference

C
220
star
84

s3mper

s3mper - Consistent Listing for S3
Java
218
star
85

ReactiveLab

Experiments and prototypes with reactive application design.
Java
207
star
86

inviso

JavaScript
205
star
87

NfWebCrypto

Web Cryptography API Polyfill
C++
205
star
88

staash

A language-agnostic as well as storage-agnostic web interface for storing data into persistent storage systems, the metadata layer abstracts a lot of storage details and the pattern automation APIs take care of automating common data access patterns.
Java
204
star
89

zeno

Netflix's In-Memory Data Propagation Framework
Java
200
star
90

brutal

A multi-network asynchronous chat bot framework using twisted
Python
200
star
91

vizceral-react

JavaScript
198
star
92

pytheas

Web Resources and UI Framework
JavaScript
187
star
93

dispatch-docker

Shell
187
star
94

dyno

Java client for Dynomite
Java
184
star
95

hal-9001

Hal-9001 is a Go library that offers a number of facilities for creating a bot and its plugins.
Go
176
star
96

Nicobar

Java
171
star
97

yetch

Yet-another-fetch polyfill library. Supports AbortController/AbortSignal
JavaScript
168
star
98

lemur-docker

Docker files for the Lemur certificate orchestration tool
Python
168
star
99

metaflow-service

๐Ÿš€ Metadata tracking and UI service for Metaflow!
Python
168
star
100

Cloud-Prize

Description and terms for the Netflix Cloud Prize, which runs from March-September 2013. Read the rules, fork to your GitHub account to create a Submission, then send us your email address.
165
star