• Stars
    star
    1,915
  • Rank 23,208 (Top 0.5 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created about 11 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

🐘 Elasticsearch real-time search and analytics natively integrated with Hadoop

Elasticsearch Hadoop Build Status

Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Apache Hive, Apache Pig, Apache Spark and Apache Storm.

See project page and documentation for detailed information.

Requirements

Elasticsearch (1.x or higher (2.x highly recommended)) cluster accessible through REST. That's it! Significant effort has been invested to create a small, dependency-free, self-contained jar that can be downloaded and put to use without any dependencies. Simply make it available to your job classpath and you're set. For a certain library, see the dedicated chapter.

ES-Hadoop 6.x and higher are compatible with Elasticsearch 1.X, 2.X, 5.X, and 6.X

ES-Hadoop 5.x and higher are compatible with Elasticsearch 1.X, 2.X and 5.X

ES-Hadoop 2.2.x and higher are compatible with Elasticsearch 1.X and 2.X

ES-Hadoop 2.0.x and 2.1.x are compatible with Elasticsearch 1.X only

Installation

Stable Release (currently 8.4.0)

Available through any Maven-compatible tool:

<dependency>
  <groupId>org.elasticsearch</groupId>
  <artifactId>elasticsearch-hadoop</artifactId>
  <version>8.4.0</version>
</dependency>

or as a stand-alone ZIP.

Development Snapshot

Grab the latest nightly build from the repository again through Maven:

<dependency>
  <groupId>org.elasticsearch</groupId>
  <artifactId>elasticsearch-hadoop</artifactId>
  <version>8.9.0-SNAPSHOT</version>
</dependency>
<repositories>
  <repository>
    <id>sonatype-oss</id>
    <url>http://oss.sonatype.org/content/repositories/snapshots</url>
    <snapshots><enabled>true</enabled></snapshots>
  </repository>
</repositories>

or build the project yourself.

We do build and test the code on each commit.

Supported Hadoop Versions

Running against Hadoop 1.x is deprecated in 5.5 and will no longer be tested against in 6.0. ES-Hadoop is developed for and tested against Hadoop 2.x and YARN. More information in this section.

Feedback / Q&A

We're interested in your feedback! You can find us on the User mailing list - please append [Hadoop] to the post subject to filter it out. For more details, see the community page.

Online Documentation

The latest reference documentation is available online on the project home page. Below the README contains basic usage instructions at a glance.

Usage

Configuration Properties

All configuration properties start with es prefix. Note that the es.internal namespace is reserved for the library internal use and should not be used by the user at any point. The properties are read mainly from the Hadoop configuration but the user can specify (some of) them directly depending on the library used.

Required

es.resource=<ES resource location, relative to the host/port specified above>

Essential

es.query=<uri or query dsl query>              # defaults to {"query":{"match_all":{}}}
es.nodes=<ES host address>                     # defaults to localhost
es.port=<ES REST port>                         # defaults to 9200

The full list is available here

Map/Reduce

For basic, low-level or performance-sensitive environments, ES-Hadoop provides dedicated InputFormat and OutputFormat that read and write data to Elasticsearch. To use them, add the es-hadoop jar to your job classpath (either by bundling the library along - it's ~300kB and there are no-dependencies), using the DistributedCache or by provisioning the cluster manually. See the documentation for more information.

Note that es-hadoop supports both the so-called 'old' and the 'new' API through its EsInputFormat and EsOutputFormat classes.

'Old' (org.apache.hadoop.mapred) API

Reading

To read data from ES, configure the EsInputFormat on your job configuration along with the relevant properties:

JobConf conf = new JobConf();
conf.setInputFormat(EsInputFormat.class);
conf.set("es.resource", "radio/artists");
conf.set("es.query", "?q=me*");             // replace this with the relevant query
...
JobClient.runJob(conf);

Writing

Same configuration template can be used for writing but using EsOuputFormat:

JobConf conf = new JobConf();
conf.setOutputFormat(EsOutputFormat.class);
conf.set("es.resource", "radio/artists"); // index or indices used for storing data
...
JobClient.runJob(conf);

'New' (org.apache.hadoop.mapreduce) API

Reading

Configuration conf = new Configuration();
conf.set("es.resource", "radio/artists");
conf.set("es.query", "?q=me*");             // replace this with the relevant query
Job job = new Job(conf)
job.setInputFormatClass(EsInputFormat.class);
...
job.waitForCompletion(true);

Writing

Configuration conf = new Configuration();
conf.set("es.resource", "radio/artists"); // index or indices used for storing data
Job job = new Job(conf)
job.setOutputFormatClass(EsOutputFormat.class);
...
job.waitForCompletion(true);

Apache Hive

ES-Hadoop provides a Hive storage handler for Elasticsearch, meaning one can define an external table on top of ES.

Add es-hadoop-.jar to hive.aux.jars.path or register it manually in your Hive script (recommended):

ADD JAR /path_to_jar/es-hadoop-<version>.jar;

Reading

To read data from ES, define a table backed by the desired index:

CREATE EXTERNAL TABLE artists (
    id      BIGINT,
    name    STRING,
    links   STRUCT<url:STRING, picture:STRING>)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'radio/artists', 'es.query' = '?q=me*');

The fields defined in the table are mapped to the JSON when communicating with Elasticsearch. Notice the use of TBLPROPERTIES to define the location, that is the query used for reading from this table.

Once defined, the table can be used just like any other:

SELECT * FROM artists;

Writing

To write data, a similar definition is used but with a different es.resource:

CREATE EXTERNAL TABLE artists (
    id      BIGINT,
    name    STRING,
    links   STRUCT<url:STRING, picture:STRING>)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'radio/artists');

Any data passed to the table is then passed down to Elasticsearch; for example considering a table s, mapped to a TSV/CSV file, one can index it to Elasticsearch like this:

INSERT OVERWRITE TABLE artists
    SELECT NULL, s.name, named_struct('url', s.url, 'picture', s.picture) FROM source s;

As one can note, currently the reading and writing are treated separately but we're working on unifying the two and automatically translating HiveQL to Elasticsearch queries.

Apache Pig

ES-Hadoop provides both read and write functions for Pig so you can access Elasticsearch from Pig scripts.

Register ES-Hadoop jar into your script or add it to your Pig classpath:

REGISTER /path_to_jar/es-hadoop-<version>.jar;

Additionally one can define an alias to save some chars:

%define ESSTORAGE org.elasticsearch.hadoop.pig.EsStorage()

and use $ESSTORAGE for storage definition.

Reading

To read data from ES, use EsStorage and specify the query through the LOAD function:

A = LOAD 'radio/artists' USING org.elasticsearch.hadoop.pig.EsStorage('es.query=?q=me*');
DUMP A;

Writing

Use the same Storage to write data to Elasticsearch:

A = LOAD 'src/artists.dat' USING PigStorage() AS (id:long, name, url:chararray, picture: chararray);
B = FOREACH A GENERATE name, TOTUPLE(url, picture) AS links;
STORE B INTO 'radio/artists' USING org.elasticsearch.hadoop.pig.EsStorage();

Apache Spark

ES-Hadoop provides native (Java and Scala) integration with Spark: for reading a dedicated RDD and for writing, methods that work on any RDD. Spark SQL is also supported

Scala

Reading

To read data from ES, create a dedicated RDD and specify the query as an argument:

import org.elasticsearch.spark._

..
val conf = ...
val sc = new SparkContext(conf)
sc.esRDD("radio/artists", "?q=me*")

Spark SQL

import org.elasticsearch.spark.sql._

// DataFrame schema automatically inferred
val df = sqlContext.read.format("es").load("buckethead/albums")

// operations get pushed down and translated at runtime to Elasticsearch QueryDSL
val playlist = df.filter(df("category").equalTo("pikes").and(df("year").geq(2016)))

Writing

Import the org.elasticsearch.spark._ package to gain savetoEs methods on your RDDs:

import org.elasticsearch.spark._

val conf = ...
val sc = new SparkContext(conf)

val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("OTP" -> "Otopeni", "SFO" -> "San Fran")

sc.makeRDD(Seq(numbers, airports)).saveToEs("spark/docs")

Spark SQL

import org.elasticsearch.spark.sql._

val df = sqlContext.read.json("examples/people.json")
df.saveToEs("spark/people")

Java

In a Java environment, use the org.elasticsearch.spark.rdd.java.api package, in particular the JavaEsSpark class.

Reading

To read data from ES, create a dedicated RDD and specify the query as an argument.

import org.apache.spark.api.java.JavaSparkContext;
import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;

SparkConf conf = ...
JavaSparkContext jsc = new JavaSparkContext(conf);

JavaPairRDD<String, Map<String, Object>> esRDD = JavaEsSpark.esRDD(jsc, "radio/artists");

Spark SQL

SQLContext sql = new SQLContext(sc);
DataFrame df = sql.read().format("es").load("buckethead/albums");
DataFrame playlist = df.filter(df.col("category").equalTo("pikes").and(df.col("year").geq(2016)))

Writing

Use JavaEsSpark to index any RDD to Elasticsearch:

import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;

SparkConf conf = ...
JavaSparkContext jsc = new JavaSparkContext(conf);

Map<String, ?> numbers = ImmutableMap.of("one", 1, "two", 2);
Map<String, ?> airports = ImmutableMap.of("OTP", "Otopeni", "SFO", "San Fran");

JavaRDD<Map<String, ?>> javaRDD = jsc.parallelize(ImmutableList.of(numbers, airports));
JavaEsSpark.saveToEs(javaRDD, "spark/docs");

Spark SQL

import org.elasticsearch.spark.sql.api.java.JavaEsSparkSQL;

DataFrame df = sqlContext.read.json("examples/people.json")
JavaEsSparkSQL.saveToEs(df, "spark/docs")

Apache Storm

ES-Hadoop provides native integration with Storm: for reading a dedicated Spout and for writing a specialized Bolt

Reading

To read data from ES, use EsSpout:

import org.elasticsearch.storm.EsSpout;

TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("es-spout", new EsSpout("storm/docs", "?q=me*"), 5);
builder.setBolt("bolt", new PrinterBolt()).shuffleGrouping("es-spout");

Writing

To index data to ES, use EsBolt:

import org.elasticsearch.storm.EsBolt;

TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", new RandomSentenceSpout(), 10);
builder.setBolt("es-bolt", new EsBolt("storm/docs"), 5).shuffleGrouping("spout");

Building the source

Elasticsearch Hadoop uses Gradle for its build system and it is not required to have it installed on your machine. By default (gradlew), it automatically builds the package and runs the unit tests. For integration testing, use the integrationTests task. See gradlew tasks for more information.

To create a distributable zip, run gradlew distZip from the command line; once completed you will find the jar in build/libs.

To build the project, JVM 8 (Oracle one is recommended) or higher is required.

License

This project is released under version 2.0 of the Apache License

Licensed to Elasticsearch under one or more contributor
license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright
ownership. Elasticsearch licenses this file to you under
the Apache License, Version 2.0 (the "License"); you may
not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied.  See the License for the
specific language governing permissions and limitations
under the License.

More Repositories

1

elasticsearch

Free and Open, Distributed, RESTful Search Engine
Java
65,029
star
2

kibana

Your window into the Elastic Stack
TypeScript
19,124
star
3

logstash

Logstash - transport and process your logs, events, or other data
Java
13,615
star
4

beats

🐠 Beats - Lightweight shippers for Elasticsearch & Logstash
Go
11,967
star
5

elasticsearch-php

Official PHP client for Elasticsearch.
PHP
5,190
star
6

elasticsearch-js

Official Elasticsearch client library for Node.js
TypeScript
5,174
star
7

go-elasticsearch

The official Go client for Elasticsearch
Go
4,933
star
8

elasticsearch-py

Official Python client for Elasticsearch
Python
4,034
star
9

elasticsearch-dsl-py

High level Python client for Elasticsearch
Python
3,695
star
10

elasticsearch-definitive-guide

The Definitive Guide to Elasticsearch
HTML
3,521
star
11

elasticsearch-net

This strongly-typed, client library enables working with Elasticsearch. It is the official client maintained and supported by Elastic.
C#
3,469
star
12

curator

Curator: Tending your Elasticsearch indices
Python
3,020
star
13

elasticsearch-rails

Elasticsearch integrations for ActiveModel/Record and Ruby on Rails
Ruby
3,017
star
14

examples

Home for Elasticsearch examples available to everyone. It's a great way to get started.
Jupyter Notebook
2,587
star
15

cloud-on-k8s

Elastic Cloud on Kubernetes
Go
2,461
star
16

elasticsearch-ruby

Ruby integrations for Elasticsearch
Ruby
1,928
star
17

helm-charts

You know, for Kubernetes
Python
1,807
star
18

search-ui

Search UI. Libraries for the fast development of modern, engaging search experiences.
TypeScript
1,796
star
19

logstash-forwarder

An experiment to cut logs in preparation for processing elsewhere. Replaced by Filebeat: https://github.com/elastic/beats/tree/master/filebeat
Go
1,788
star
20

detection-rules

Python
1,751
star
21

ansible-elasticsearch

Ansible playbook for Elasticsearch
Ruby
1,567
star
22

otel-profiling-agent

The production-scale datacenter profiler
Go
1,231
star
23

stack-docker

Project no longer maintained.
Shell
1,189
star
24

apm-server

APM Server
Go
1,100
star
25

ecs

Elastic Common Schema
Python
920
star
26

protections-artifacts

Elastic Security detection content for Endpoint
YARA
848
star
27

ember

Elastic Malware Benchmark for Empowering Researchers
Jupyter Notebook
799
star
28

elasticsearch-docker

Official Elasticsearch Docker image
Python
790
star
29

elasticsearch-rs

Official Elasticsearch Rust Client
Rust
612
star
30

elasticsearch-cloud-aws

AWS Cloud Plugin for Elasticsearch
580
star
31

apm-agent-dotnet

Elastic APM .NET Agent
C#
540
star
32

apm-agent-nodejs

Elastic APM Node.js Agent
JavaScript
540
star
33

apm-agent-java

Elastic APM Java Agent
Java
536
star
34

eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
Python
516
star
35

elasticsearch-mapper-attachments

Mapper Attachments Type plugin for Elasticsearch
Java
503
star
36

elasticsearch-servicewrapper

A service wrapper on top of elasticsearch
Shell
489
star
37

apm-agent-go

Official Go agent for Elastic APM
Go
390
star
38

sense

A JSON aware developer's interface to Elasticsearch. Comes with handy machinery such as syntax highlighting, autocomplete, formatting and code folding.
JavaScript
382
star
39

apm-agent-python

Official Python agent for Elastic APM
Python
381
star
40

elastic-charts

πŸ“Š Elastic Charts library
TypeScript
362
star
41

stream2es

Stream data into ES (Wikipedia, Twitter, stdin, or other ESes)
Clojure
356
star
42

timelion

Timelion was absorbed into Kibana 5. Don't use this. Time series composer for Elasticsearch and beyond.
JavaScript
347
star
43

elasticsearch-labs

Notebooks & Example Apps for Search & AI Applications with Elasticsearch
Jupyter Notebook
341
star
44

apm

Elastic Application Performance Monitoring - resources and general issue tracking for Elastic APM.
Gherkin
317
star
45

elasticsearch-net-example

A tutorial repository for Elasticsearch and NEST
305
star
46

elasticsearch-migration

This plugin will help you to check whether you can upgrade directly to the next major version of Elasticsearch, or whether you need to make changes to your data and cluster before doing so.
291
star
47

logstash-docker

Official Logstash Docker image
Python
286
star
48

elasticsearch-py-async

Backend for elasticsearch-py based on python's asyncio module.
Python
283
star
49

support-diagnostics

Support diagnostics utility for elasticsearch and logstash
Java
278
star
50

elasticsearch-java

Official Elasticsearch Java Client
Java
274
star
51

es2unix

Command-line ES
Clojure
274
star
52

elasticsearch-analysis-smartcn

Smart Chinese Analysis Plugin for Elasticsearch
268
star
53

dockerfiles

Dockerfiles for the official Elastic Stack images
Shell
253
star
54

go-sysinfo

go-sysinfo is a library for collecting system information.
Go
249
star
55

kibana-docker

Official Kibana Docker image
Python
243
star
56

elasticsearch-metrics-reporter-java

Metrics reporter, which reports to elasticsearch
Java
232
star
57

apm-agent-php

Elastic APM PHP Agent
PHP
229
star
58

docs

Ruby
229
star
59

elasticsearch-river-twitter

Twitter River Plugin for elasticsearch (STOPPED)
Java
202
star
60

elasticsearch-formal-models

Formal models of core Elasticsearch algorithms
Isabelle
200
star
61

rally-tracks

Track specifications for the Elasticsearch benchmarking tool Rally
Python
197
star
62

beats-dashboards

DEPRECATED. Moved to https://github.com/elastic/beats. Please use the new repository to add new issues.
Shell
192
star
63

elasticsearch-analysis-icu

ICU Analysis plugin for Elasticsearch
189
star
64

elasticsearch-river-rabbitmq

RabbitMQ River Plugin for elasticsearch (STOPPED)
Java
173
star
65

elasticsearch-analysis-kuromoji

Japanese (kuromoji) Analysis Plugin
168
star
66

terraform-provider-ec

Terraform provider for the Elasticsearch Service and Elastic Cloud Enterprise
Go
165
star
67

beats-docker

Official Beats Docker images
Python
165
star
68

elasticsearch-river-couchdb

CouchDB River Plugin for elasticsearch (STOPPED)
Java
163
star
69

apm-agent-ruby

Elastic APM agent for Ruby
Ruby
156
star
70

integrations

Elastic Integrations
Handlebars
155
star
71

require-in-the-middle

Module to hook into the Node.js require function
JavaScript
149
star
72

harp

Secret management by contract toolchain
Go
143
star
73

dorothy

Dorothy is a tool to test security monitoring and detection for Okta environments
Python
141
star
74

ml-cpp

Machine learning C++ code
C++
139
star
75

ecs-logging-java

Centralized logging for Java applications with the Elastic stack made easy
Java
137
star
76

SWAT

Simple Workspace Attack Tool (SWAT) is a tool for simulating malicious behavior against Google Workspace in reference to the MITRE ATT&CK framework.
Python
135
star
77

go-libaudit

go-libaudit is a library for communicating with the Linux Audit Framework.
Go
133
star
78

ansible-beats

Ansible Beats Role
Ruby
131
star
79

logstash-contrib

THIS REPOSITORY IS NO LONGER USED.
Ruby
128
star
80

elasticsearch-analysis-phonetic

Phonetic Analysis Plugin for Elasticsearch
127
star
81

azure-marketplace

Elasticsearch Azure Marketplace offering + ARM template
Shell
122
star
82

bpfcov

Source-code based coverage for eBPF programs actually running in the Linux kernel
C
115
star
83

anonymize-it

a general utility for anonymizing data
Python
114
star
84

windows-installers

Windows installers for the Elastic stack
C#
113
star
85

terraform-provider-elasticstack

Terraform provider for Elastic Stack
Go
111
star
86

makelogs

JavaScript
108
star
87

golang-crossbuild

Shell
107
star
88

elasticsearch-lang-python

Python language Plugin for elasticsearch
104
star
89

elastic-agent

Elastic Agent - single, unified way to add monitoring for logs, metrics, and other types of data to a host.
Go
102
star
90

go-freelru

GC-less, fast and generic LRU hashmap library for Go
Go
101
star
91

elasticsearch-lang-javascript

JavaScript language Plugin for elasticsearch
93
star
92

stack-docs

Elastic Stack Documentation
Java
92
star
93

elasticsearch-specification

Elasticsearch full specification
TypeScript
89
star
94

elasticsearch-perl

Official Perl low-level client for Elasticsearch.
Perl
87
star
95

next-eui-starter

Start building Kibana protoypes quickly with the Next.js EUI Starter
TypeScript
87
star
96

vue-search-ui-demo

A demo of implementing Elastic's Search UI and App Search using Vue.js
Vue
87
star
97

elasticsearch-transport-thrift

Thrift Transport for elasticsearch (STOPPED)
Java
84
star
98

ecs-dotnet

.NET integrations that use the Elastic Common Schema (ECS)
HTML
82
star
99

generator-kibana-plugin

DEPRECATED Yeoman Generator for Kibana Plugins, please use https://github.com/elastic/template-kibana-plugin/
JavaScript
79
star
100

hipio

A DNS server that parses a domain for an IPv4 Address
Haskell
76
star