• This repository has been archived on 14/Feb/2018
  • Stars
    star
    176
  • Rank 216,987 (Top 5 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created about 9 years ago
  • Updated over 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Serving system for batch generated data sets

Note: This project is no longer actively maintained by Pinterest.


Terrapin: Serving system for Hadoop generated data

Terrapin is a low latency serving system providing random access over large data sets, generated by Hadoop jobs and stored on HDFS clusters.

Terrapin can ingest data from S3, HDFS or directly from a mapreduce job. Terrapin is elastic, fault tolerant and performant enough to be used for various web scale applications (such as serving personalized recommendations on a website). Terrapin exposes a key-value data model.

Terrapin achieves these goals by storing the output of mapreduce jobs on HDFS in a file format that allows fast random access. A Terrapin server process runs on every data node and serves the files stored on that data node. With this design, we get the scalability of HDFS and Hadoop and also, achieve low latencies since the data is being served from local disk. HDFS optimizations such as short-circuit local reads, OS page cache and possibly mmap reduce the tail latency by avoiding round trips over a TCP socket or the network for HDFS reads. A Terrapin controller is responsible for ensuring data locality.

If you already have an HDFS cluster running, very little needs to be done to setup Terrapin. If you are interested in the detailed design, check out DESIGN.md

Key Features

  • Filesets: Data on Terrapin is namespaced in filesets. New data is loaded/reloaded into a fileset. A Terrapin cluster can host multiple filesets.
  • Live Swap and Multiple versions: New data is loaded into an existing fileset with a live swap and there's no disruption of service. We also support keeping multiple versions for critical filesets. This allows quick rollback in case of bad data load. Old versions are garbage collected.
  • S3/HDFS/Hadoop/Hive: A Hadoop job can directly write data to Terrapin. Otherwise, the Hadoop job can write data to HDFS/S3 and it can be ingested by Terrapin in a subsequent step. Terrapin can also ingest tables on Hive and provide random access based on a certain column, marked as the key.
  • Easy to change number of output shards: It is easy to change the number of output shards across different versions of data loaded for the same fileset. This gives developers the flexibility of tuning their mapreduce job by tweaking the number of reducers.
  • Extensible serving/storage formats: It is possible to plug in other (more efficient) serving formats such as rocksdb .sst etc. Currently, Terrapin uses HFiles as the serving file format.
  • Monitoring: Terrapin exports latency and value size quantiles as well as cluster health stats through an HTTP interface. The metrics are exported through Ostrich.
  • Speculative Execution: Terrapin comes with a client abstraction which can issue concurrent requests against two terrapin clusters serving the same fileset and pick the one that is satisfied earlier. This functionality is pretty handy for increased availability and lower latency.

Getting Started

Java 7 is required in order to build terrapin. Currently Terrapin supports Hadoop 2. In order to build, run the following commands from the root of the git repository (note that hbase compiled with Hadoop 2 is not available in the central maven repo but is required for using HFiles).

git clone [terrapin-repo-url]
cd terrapin

# Install HBase 0.94 artifacts compiled against Hadoop 2.
mvn install:install-file \
  -Dfile=thirdparty/hbase-hadoop2-0.94.7.jar \
  -DgroupId=org.apache.hbase \
  -DartifactId=hbase-hadoop2 \
  -Dversion=0.94.7 \
  -Dpackaging=jar

# Building against default Hadoop version - 2.7.1
mvn package

# Building against custom Hadoop version you are using (if different from 2.7.1)
mvn [-Dhadoop.version=X -Dhadoop.client.version=X] package

To setup a terrapin cluster, follow the instructions at SETUP.md.

Usage

Terrapin can ingest data written to S3/HDFS or it can directly ingest data from a MapReduce job.

Once you have your cluster up and running, you can find several usage examples at USAGE.md.

Tools

Terrapin has tools for querying filesets and performing administrative operations such as deleting, rolling back a fileset and diagnosing cluster health.

Querying a Fileset

Run the following command from the root of your git repo.

java -cp client/target/*:client/target/lib/*        \
    -Dterrapin.config={properties_file}             \
    com.pinterest.terrapin.client.ClientTool {fileset} {key}
Deleting a Fileset
ssh ${CONTROLLER_HOST}
cd ${TERRAPIN_HOME_DIR}
scripts/terrapin-admin.sh deleteFileSet {PROPERTIES_FILE} {FILE_SET}

Note that the deletion of a Fileset is asynchronous. The Fileset is marked for deletion and is later garbage collected by the controller.

Rolling back a Fileset
ssh ${CONTROLLER_HOST}
cd ${TERRAPIN_HOME_DIR}
scripts/terrapin-admin.sh rollbackFileSet {PROPERTIES_FILE} {FILE_SET}

The tool will display the different versions (as HDFS directories), you can rollback to. Select the appropriate version and confirm. To utilize this functionality, the fileset must have been uploaded with multiple versions as described in USAGE.md.

Checking health of a cluster
ssh ${CONTROLLER_HOST}
cd ${TERRAPIN_HOME_DIR}
scripts/terrapin-admin.sh {PROPERTIES_FILE} checkClusterHealth

The tool will display any inconsistencies in ZooKeeper state or any filesets not serving properly.

Monitoring/Diagnostics

You can access the controllers web UI at http://{controller_host}:50030/status. The port can be modified by setting the "status_server_binding_port" property. It will show all the filesets on the cluster and their serving health. You can click a fileset to get more information about it (like the current serving version and partition assignment).

You can also retrieve detailed metrics by running curl localhost:9999/stats.txt on the controller or the server. These metrics are exported using Twitter's Ostrich library and are easy to parse. The port can be modified by setting the "ostrich_metrics_port" property. The controller will export serving health across the whole cluster (percentage of online shards for each fileset) amongst other useful metrics. The server will export latency and value size percentiles for each fileset.

Maintainers

Help

If you have any questions or comments, you can reach us at [email protected]

More Repositories

1

ktlint

An anti-bikeshedding Kotlin linter with built-in formatter
Kotlin
6,192
star
2

gestalt

A set of React UI components that supports Pinterest’s design language
TypeScript
4,240
star
3

PINRemoteImage

A thread safe, performant, feature rich image fetcher
Objective-C
4,009
star
4

PINCache

Fast, non-deadlocking parallel object cache for iOS, tvOS and OS X
Objective-C
2,660
star
5

querybook

Querybook is a Big Data Querying UI, combining collocated table metadata and a simple notebook interface.
TypeScript
1,923
star
6

secor

Secor is a service implementing Kafka log persistence
Java
1,845
star
7

teletraan

Teletraan is Pinterest's deploy system.
Java
1,807
star
8

knox

Knox is a secret management service
Go
1,229
star
9

pinball

Pinball is a scalable workflow manager
JavaScript
1,048
star
10

mysql_utils

Pinterest MySQL Management Tools
Python
883
star
11

snappass

Share passwords securely
Python
837
star
12

elixometer

A light Elixir wrapper around exometer.
Elixir
827
star
13

pymemcache

A comprehensive, fast, pure-Python memcached client.
Python
771
star
14

bonsai

Understand the tree of dependencies inside your webpack bundles, and trim away the excess.
JavaScript
738
star
15

rocksplicator

RocksDB Replication
C++
662
star
16

esprint

Fast eslint runner
JavaScript
661
star
17

bender

An easy-to-use library for creating load testing applications
Go
658
star
18

DoctorK

DoctorK is a service for Kafka cluster auto healing and workload balancing
Java
633
star
19

plank

A tool for generating immutable model objects
Swift
469
star
20

riffed

Provides idiomatic Elixir bindings for Apache Thrift
Elixir
307
star
21

thrift-tools

thrift-tools is a library and a set of tools to introspect Apache Thrift traffic.
Python
233
star
22

elixir-thrift

A Pure Elixir Thrift Implementation
Elixir
214
star
23

widgets

JavaScript widgets, including the Pin It button.
JavaScript
210
star
24

singer

A high-performance, reliable and extensible logging agent for uploading data to Kafka, Pulsar, etc.
Java
178
star
25

git-stacktrace

Easily figure out which git commit caused a given stacktrace
Python
158
star
26

jbender

An easy-to-use library for creating load testing applications.
Java
156
star
27

ptracer

A library for ptrace-based tracing of Python programs
Python
155
star
28

react-pinterest

JavaScript
151
star
29

pinlater

PinLater is a Thrift service to manage scheduling and execution of asynchronous jobs.
Java
135
star
30

memq

MemQ is an efficient, scalable cloud native PubSub system
Java
129
star
31

api-quickstart

Code that makes it easy to get started with the Pinterest API.
Python
122
star
32

it-cpe-cookbooks

A suite of Chef cookbooks that we use to manage our fleet of client devices
Ruby
118
star
33

psc

PubSubClient (PSC)
Java
117
star
34

pinterest-api-demo

JavaScript
106
star
35

PINOperation

Objective-C
104
star
36

orion

Management and automation platform for Stateful Distributed Systems
Java
101
star
37

soundwave

A searchable EC2 Inventory store
Java
96
star
38

PINFuture

An Objective-C future implementation that aims to provide maximal type safety
Objective-C
83
star
39

kingpin

KingPin is the toolset used at Pinterest for service discovery and application configuration.
Python
69
star
40

arcanist-linters

A collection of custom Arcanist linters
PHP
63
star
41

pagerduty-monit

Wrapper scripts to integrate monit and PagerDuty.
Shell
60
star
42

pinrepo

Pinrepo is a highly scalable solution for storing and serving build artifacts such as debian packages, maven jars and pypi packages.
Python
58
star
43

transformer_user_action

Transformer-based Realtime User Action Model for Recommendation at Pinterest
Python
49
star
44

quasar-thrift

A Thrift server that uses Quasar's lightweight threads to handle connections.
Java
47
star
45

pinterest-python-sdk

An SDK that makes it quick and easy to build applications with Pinterest API.
Python
47
star
46

yuvi

Yuvi is an in-memory storage engine for recent time series metrics data.
Java
45
star
47

atg-research

Python
41
star
48

slackminion

A python bot framework for slack
Python
22
star
49

api-description

OpenAPI descriptions for Pinterest's REST API
18
star
50

l10nmessages

L10nMessages is a library that makes internationalization (i18n) and localization (l10n) of Java applications easy and safe.
Java
17
star
51

thriftcheck

A linter for Thrift IDL files
Go
16
star
52

arcanist-owners

An Arcanist extension for displaying file ownership information
PHP
16
star
53

tiered-storage

Pinterest's simplified and efficient Tiered Storage implementation for Kafka
Java
13
star
54

.github

Pinterest's Open Source Project Template
12
star
55

homebrew-tap

macOS Homebrew formulas to install Pinterest open source software
Ruby
12
star
56

pinterest-python-generated-api-client

This is the auto-generated code using OpenAPI generator. Generated code comprises HTTP requests to various v5 API endpoints.
Python
12
star
57

vscode-gestalt

Visual Studio Code extension for Gestalt, Pinterest's design system
TypeScript
9
star
58

wheeljack

Work with interdependent python repositories seemlessly.
Python
8
star
59

ffffound

FFFFOUND Import tool for Pinterest
HTML
8
star
60

vscode-package-watcher

Watch package lock files and suggest to re-run npm or yarn.
TypeScript
6
star
61

graphql-lint-rules

Pinterest GraphQL Lint Rules
TypeScript
6
star
62

ss-gtm-template

This is a repository to implement the Google Tag Manager server-side tag template for Pinterest Conversions API to be deployed into Google Community Template Gallery.
Smarty
5
star
63

pinterest-magento2-extension

PHP
4
star
64

Pinterest-Salesforce-Commerce-Cartridge

JavaScript
4
star
65

figma-calculations

TypeScript
2
star
66

slate

Resource Lifecycle Management framework
Java
1
star