• This repository has been archived on 16/Dec/2021
  • Stars
    star
    278
  • Rank 142,838 (Top 3 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created about 8 years ago
  • Updated about 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

ReAir is a collection of easy-to-use tools for replicating tables and partitions between Hive data warehouses.

ReAir

ReAir is a collection of easy-to-use tools for replicating tables and partitions between Hive data warehouses. These tools are targeted at developers that already have some familiarity with operating warehouses based on Hadoop and Hive.

Overview

The replication features in ReAir are useful for the following use cases:

  • Migration of a Hive warehouse
  • Incremental replication between two warehouses
  • Disaster recovery

When migrating a Hive warehouse, ReAir can be used to copy over existing data to the new warehouse. Because ReAir copies both data and metadata, datasets are ready to query as soon as the copy completes.

While many organizations start out with a single Hive warehouse, they often want better isolation between production and ad hoc workloads. Two isolated Hive warehouses accommodate this need well, and with two warehouses, there is a need to replicate evolving datasets. ReAir can be used to replicate data from one warehouse to another and propagate updates incrementally as they occur.

Lastly, ReAir can be used to replicated datasets to a hot-standby warehouse for fast failover in disaster recovery scenarios.

To accommodate these use cases, ReAir includes both batch and incremental replication tools. Batch replication executes a one-time copy of a list of tables. Incremental replication is a long-running process that copies objects as they are created or changed on the source warehouse.

Additional Documentation

Batch Replication

Prerequisites:

  • Hadoop (Most, but tested with 2.5.0)
  • Hive (Most, but tested with 0.13)

Run Batch Replication

  • Read through and fill out the configuration from the template.
  • Switch to the repo directory and build the JAR.
cd reair
./gradlew shadowjar -p main -x test
  • Create a local text file containing the tables that you want to copy. A row in the text file should consist of the DB name and the table name separated by a period. e.g.
my_db1.my_table1
my_db2.my_table2
  • Launch the job using the hadoop jar command on the destination, specifying the config file and the list of tables to copy. A larger heap for the client may be needed for large batches, so set HADOOP_HEAPSIZE appropriately. Also, depending on how the warehouse is set up, you may need to run the process as a different user (e.g. hive).
export HADOOP_OPTS="-Dlog4j.configuration=file://<path to log4j.properties>"
export HADOOP_HEAPSIZE=8096
sudo -u hive hadoop jar main/build/libs/airbnb-reair-main-1.0.0-all.jar com.airbnb.reair.batch.hive.MetastoreReplicationJob --config-files my_config_file.xml --table-list my_tables_to_copy.txt
  • Additional CLI Options: --step, --override-input. These arguments are useful if want to run one of the three MR job individually for faster failure recovery. --step indicates which step to run. --override-input provides the path for the input when running the second and third stage MR jobs. The input path will usually be the output for the first stage MR job.

Incremental Replication

Prerequisites:

  • Hadoop (Most, but tested with 2.5.0)
  • Hive (Most, but tested with 0.13)

Audit Log Hook Setup

Incremental replication relies on recording changes in the source Hive warehouse to figure out what needs to be replicated. These changes can be recorded in two different ways. In the first method, the hook is added to the Hive CLI and runs after a query is successful. In the other method, the hook is added as a listener in the Hive remote metastore server. This method requires that you have the metastore server deployed and used by Hive, but it will work when systems other than Hive (e.g. Spark) make calls to the metastore server to create tables. The steps to deploy either hook are similar:

Build and deploy the JAR containing the audit log hook

  • Switch to the repository directory and build the JAR.
cd reair
./gradlew shadowjar -p hive-hooks -x test
  • Once built, the JAR for the audit log hook can be found under hive-hooks/build/libs/airbnb-reair-hive-hooks-1.0.0-all.jar.

  • Copy the JAR to the Hive auxiliary library path. The specifics of the path depending on your setup. Generally, the auxiliary library path can be configured using the configuration parameter hive.aux.jars.path. If you're deploying the hook for the CLI, you only have to deploy the JAR on the hosts where the CLI will be run, and likewise, if you're deploying the hook for the metastore server, you only have to deploy the JAR on the server host.

  • Create and setup the tables on MySQL required for the audit log. You can create the tables by running the create table commands in all of the .sql files here. If you're planning to use the same DB to store the tables for incremental replication, also run the create table commands here.

  • If you want to add the hook for the Hive CLI, change the configuration for the Hive CLI (in the source warehouse) to use the audit log hook by adding the following sections to hive-site.xml from the audit log configuration template after replacing with appropriate values.

  • If you want to add the hook for the metastore server, change the configuration for the Hive metastore server (in the source warehouse) to use the hook by adding the following sections to hive-site.xml from the metastore audit log configuration template after replacing with appropriate values.

  • Run a test query and verify that you see the appropriate rows in the audit_log and audit_objects tables.

Process Setup

  • If the MySQL tables for incremental replication were not set up while setting up the audit log, create the state tables for incremental replication on desired MySQL instance by running the create table commands listed here.

  • Read through and fill out the configuration from the template. You might want to deploy the file to a widely accessible location.

  • Switch to the repo directory and build the JAR. You can skip the unit tests if no changes have been made (via the '-x test' flag).

cd reair
./gradlew shadowjar -p main -x test

Once the build finishes, the JAR to run the incremental replication process can be found under main/build/libs/airbnb-reair-main-1.0.0-all.jar

  • To start replicating, set options to point to the appropriate logging configuration and kick off the replication launcher by using the hadoop jar command on the destination cluster. An example log4j.properties file is provided here. Be sure to specify the configuration file that was filled out in the prior step. As with batch replication, you may need to run the process as a different user.
export HADOOP_OPTS="-Dlog4j.configuration=file://<path to log4j.properties>"
sudo -u hive hadoop jar airbnb-reair-main-1.0.0-all.jar com.airbnb.reair.incremental.deploy.ReplicationLauncher --config-files my_config_file.xml

If you use the recommended log4j.properties file that is shipped with the tool, messages with the INFO level will be printed to stderr, but more detailed logging messages with >= DEBUG logging level will be recorded to a log file in the current working directory.

When the incremental replication process is launched for the first time, it will start replicating entries after the highest numbered ID in the audit log. Because the process periodically checkpoints progress to the DB, it can be killed and will resume from where it left off when restarted. To override this behavior, please see the additional options section.

  • Verify that entries are replicated properly by creating a test table on the source warehouse and checking to see if it appears on the destination warehouse.

For production deployment, an external process should monitor and restart the replication process if it exits. The replication process will exit if the number of consecutive failures while making RPCs or DB queries exceed the configured number of retries.

Additional CLI options:

To force the process to start replicating entries after a particular audit log ID, you can pass the --start-after-id parameter:

export HADOOP_OPTS="-Dlog4j.configuration=file://<path to log4j.properties>"
hadoop jar main/build/libs/airbnb-reair-main-1.0.0-all.jar com.airbnb.reair.replication.deploy.ReplicationLauncher --config-files my_config_file.xml --start-after-id 123456

Replication entries that were started but not completed on the last invocation will be marked as aborted when you use --start-after-id to restart the process.

Monitoring / Web UI:

The incremental replication process starts a Thrift server that can be used to get metrics and view progress. The Thrift definition is provided here. A simple web server that displays progress has been included in the web-server module. To run the web server:

  • Switch to the repo directory and build the JAR's. You can skip the unit tests if no changes have been made.
cd reair
gradlew shadowjar -p web-server -x test
  • The JAR containing the web server can be found at
web-server/build/libs/airbnb-reair-web-server-1.0.0-all.jar
  • Start the web server, specifying the appropriate Thrift host and port where the incremental replication process is running.
java -jar airbnb-reair-web-server-1.0.0-all.jar --thrift-host localhost --thrift-port 9996 --http-port 8080
  • Point your browser to the appropriate URL e.g. http://localhost:8080 to view the active and retired replication jobs.

Discussion Group

A discussion group is available here.

In the wild

If you find ReAir useful, please list yourself on this page!

More Repositories

1

javascript

JavaScript Style Guide
JavaScript
141,845
star
2

lottie-android

Render After Effects animations natively on Android and iOS, Web, and React Native
Java
34,600
star
3

lottie-web

Render After Effects animations natively on Web, Android and iOS, and React Native. http://airbnb.io/lottie/
JavaScript
29,564
star
4

lottie-ios

An iOS library to natively render After Effects vector animations
Swift
24,897
star
5

visx

🐯 visx | visualization components
TypeScript
18,609
star
6

react-sketchapp

render React components to Sketch ⚛️💎
TypeScript
14,951
star
7

react-dates

An easily internationalizable, mobile-friendly datepicker library for the web
JavaScript
11,630
star
8

epoxy

Epoxy is an Android library for building complex screens in a RecyclerView
Java
8,426
star
9

css

A mostly reasonable approach to CSS and Sass.
6,869
star
10

hypernova

A service for server-side rendering your JavaScript views
JavaScript
5,824
star
11

mavericks

Mavericks: Android on Autopilot
Kotlin
5,741
star
12

knowledge-repo

A next-generation curated knowledge sharing platform for data scientists and other technical professions.
Python
5,432
star
13

ts-migrate

A tool to help migrate JavaScript code quickly and conveniently to TypeScript
TypeScript
5,307
star
14

aerosolve

A machine learning package built for humans.
Scala
4,790
star
15

DeepLinkDispatch

A simple, annotation-based library for making deep link handling better on Android
Java
4,356
star
16

lottie

Lottie documentation for http://airbnb.io/lottie.
HTML
4,289
star
17

ruby

Ruby Style Guide
Ruby
3,711
star
18

polyglot.js

Give your JavaScript the ability to speak many languages.
JavaScript
3,644
star
19

MagazineLayout

A collection view layout capable of laying out views in vertically scrolling grids and lists.
Swift
3,232
star
20

native-navigation

Native navigation library for React Native applications
Java
3,127
star
21

streamalert

StreamAlert is a serverless, realtime data analysis framework which empowers you to ingest, analyze, and alert on data from any environment, using datasources and alerting logic you define.
Python
2,825
star
22

infinity

UITableViews for the web (DEPRECATED)
JavaScript
2,809
star
23

airpal

Web UI for PrestoDB.
Java
2,760
star
24

HorizonCalendar

A declarative, performant, iOS calendar UI component that supports use cases ranging from simple date pickers all the way up to fully-featured calendar apps.
Swift
2,656
star
25

swift

Airbnb's Swift Style Guide
Markdown
2,239
star
26

synapse

A transparent service discovery framework for connecting an SOA
Ruby
2,067
star
27

Showkase

🔦 Showkase is an annotation-processor based Android library that helps you organize, discover, search and visualize Jetpack Compose UI elements
Kotlin
2,018
star
28

paris

Define and apply styles to Android views programmatically
Kotlin
1,894
star
29

AirMapView

A view abstraction to provide a map user interface with various underlying map providers
Java
1,861
star
30

react-with-styles

Use CSS-in-JavaScript with themes for React without being tightly coupled to one implementation
JavaScript
1,697
star
31

rheostat

Rheostat is a www, mobile, and accessible slider component built with React
JavaScript
1,692
star
32

binaryalert

BinaryAlert: Serverless, Real-time & Retroactive Malware Detection.
Python
1,382
star
33

epoxy-ios

Epoxy is a suite of declarative UI APIs for building UIKit applications in Swift
Swift
1,142
star
34

nerve

A service registration daemon that performs health checks; companion to airbnb/synapse
Ruby
942
star
35

okreplay

📼 Record and replay OkHttp network interaction in your tests.
Groovy
775
star
36

RxGroups

Easily group RxJava Observables together and tie them to your Android Activity lifecycle
Java
693
star
37

prop-types

Custom React PropType validators that we use at Airbnb.
JavaScript
672
star
38

react-outside-click-handler

OutsideClickHandler component for React.
JavaScript
603
star
39

ResilientDecoding

This package makes your Decodable types resilient to decoding errors and allows you to inspect those errors.
Swift
580
star
40

babel-plugin-dynamic-import-node

Babel plugin to transpile import() to a deferred require(), for node
JavaScript
575
star
41

kafkat

KafkaT-ool
Ruby
504
star
42

babel-plugin-dynamic-import-webpack

Babel plugin to transpile import() to require.ensure, for Webpack
JavaScript
500
star
43

chronon

Chronon is a data platform for serving for AI/ML applications.
Scala
479
star
44

babel-plugin-inline-react-svg

A babel plugin that optimizes and inlines SVGs for your React Components.
JavaScript
474
star
45

lunar

🌗 React toolkit and design language for Airbnb open source and internal projects.
TypeScript
461
star
46

BuckSample

An example app showing how Buck can be used to build a simple iOS app.
Objective-C
459
star
47

SpinalTap

Change Data Capture (CDC) service
Java
428
star
48

artificial-adversary

🗣️ Tool to generate adversarial text examples and test machine learning models against them
Python
390
star
49

dynein

Airbnb's Open-source Distributed Delayed Job Queueing System
Java
383
star
50

hammerspace

Off-heap large object storage
Ruby
364
star
51

trebuchet

Trebuchet launches features at people
Ruby
313
star
52

zonify

a command line tool for generating DNS records from EC2 instances
Ruby
270
star
53

ottr

Serverless Public Key Infrastructure Framework
Python
266
star
54

omniduct

A toolkit providing a uniform interface for connecting to and extracting data from a wide variety of (potentially remote) data stores (including HDFS, Hive, Presto, MySQL, etc).
Python
249
star
55

hypernova-react

React bindings for Hypernova.
JavaScript
248
star
56

smartstack-cookbook

The chef recipes for running and testing Airbnb's SmartStack
Ruby
244
star
57

interferon

Signaling you about infrastructure or application issues
Ruby
239
star
58

prop-types-exact

For use with React PropTypes. Will error on any prop not explicitly specified.
JavaScript
237
star
59

backpack

A pack of UI components for Backbone projects. Grab your backpack and enjoy the Views.
HTML
223
star
60

babel-preset-airbnb

A babel preset for transforming your JavaScript for Airbnb
JavaScript
222
star
61

goji-js

React ❤️ Mini Program
TypeScript
213
star
62

react-with-direction

Components to provide and consume RTL or LTR direction in React
JavaScript
192
star
63

stemcell

Airbnb's EC2 instance creation and bootstrapping tool
Ruby
185
star
64

hypernova-ruby

Ruby client for Hypernova.
Ruby
141
star
65

kafka-statsd-metrics2

Send Kafka Metrics to StatsD.
Java
135
star
66

optica

A tool for keeping track of nodes in your infrastructure
Ruby
134
star
67

sparsam

Fast Thrift Bindings for Ruby
C++
125
star
68

js-shims

JS language shims used by Airbnb.
JavaScript
123
star
69

browser-shims

Browser and JS shims used by Airbnb.
JavaScript
118
star
70

bossbat

Stupid simple distributed job scheduling in node, backed by redis.
JavaScript
118
star
71

nimbus

Centralized CLI for JavaScript and TypeScript developer tools.
TypeScript
118
star
72

lottie-spm

Swift Package Manager support for Lottie, an iOS library to natively render After Effects vector animations
Ruby
106
star
73

twitter-commons-sample

A sample REST service based on Twitter Commons
Java
103
star
74

is-touch-device

Is the current JS environment a touch device?
JavaScript
90
star
75

rudolph

A serverless sync server for Santa, built on AWS
Go
73
star
76

hypernova-node

node.js client for Hypernova
JavaScript
73
star
77

plog

Fire-and-forget UDP logging service with custom Netty pipelines & extensive monitoring
Java
72
star
78

cloud-maker

Building castles in the sky
Ruby
67
star
79

react-create-hoc

Create a React Higher-Order Component (HOC) following best practices.
JavaScript
66
star
80

vulnture

Python
65
star
81

deline

An ES6 template tag that strips unwanted newlines from strings.
JavaScript
63
star
82

react-with-styles-interface-react-native

Interface to use react-with-styles with React Native
JavaScript
63
star
83

sputnik

Scala
61
star
84

mocha-wrap

Fluent pluggable interface for easily wrapping `describe` and `it` blocks in Mocha tests.
JavaScript
54
star
85

react-with-styles-interface-aphrodite

Interface to use react-with-styles with Aphrodite
JavaScript
54
star
86

eslint-plugin-react-with-styles

ESLint plugin for react-with-styles
JavaScript
49
star
87

sssp

Software distribution by way of S3 signed URLs
Haskell
47
star
88

alerts

An example alerts repo, for use with airbnb/interferon.
Ruby
46
star
89

apple-tv-auth

Example application to demonstrate how to build Apple TV style authentication.
Ruby
44
star
90

airbnb-spark-thrift

A library for loadling Thrift data into Spark SQL
Scala
43
star
91

jest-wrap

Fluent pluggable interface for easily wrapping `describe` and `it` blocks in Jest tests.
JavaScript
39
star
92

billow

Query AWS data without API credentials. Don't wait for a response.
Java
38
star
93

gosal

A Sal client written in Go
Go
36
star
94

backbone.baseview

DEPRECATED: A simple base view class for Backbone.View
JavaScript
34
star
95

anotherlens

News Deeply X Airbnb.Design - Another Lens
HTML
33
star
96

eslint-plugin-miniprogram

TypeScript
33
star
97

react-component-variations

JavaScript
33
star
98

react-with-styles-interface-css

📃 CSS interface for react-with-styles
JavaScript
31
star
99

appear

reveal terminal programs in the gui
Ruby
29
star
100

puppet-munki

Puppet
29
star