• Stars
    star
    105
  • Rank 328,196 (Top 7 %)
  • Language
    Shell
  • License
    Apache License 2.0
  • Created about 9 years ago
  • Updated 12 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Dockerfiles for StreamSets Data Collector

Data Collector Splash Image

StreamSets Data Collector allows building dataflows quickly and easily, spanning on-premises, multi-cloud and edge infrastructure.

It has an advanced and easy to use User Interface that allows data scientists, developers and data infrastructure teams easily create data pipelines in a fraction of the time typically required to create complex ingest scenarios.

To learn more, check out http://streamsets.com

You must accept the Oracle Binary Code License Agreement for Java SE to use this image.

Getting Help

Connect with the StreamSets Community to discover ways to reach the team.

If you need help with production systems, you can check out the variety of support options offered on our support page.

Basic Usage

docker run --restart on-failure -p 18630:18630 -d --name streamsets-dc streamsets/datacollector

The default login is: admin / admin.

Detailed Usage

  • You can specify a custom configs by mounting them as a volume to /etc/sdc or /etc/sdc/<configuration file>
  • Configuration properties in sdc.properties and dpm.properties can also be overridden at runtime by specifying them env vars prefixed with SDC_CONF or DPM_CONF
    • For example http.port would be set as SDC_CONF_HTTP_PORT=12345
  • You should at a minimum specify a data volume for the data directory unless running as a stateless service integrated with StreamSets Control Hub. The default configured location for SDC_DATA is /data. You can override this location by passing a different value to the environment variable SDC_DATA.
  • You can also specify your own explicit port mappings, or arguments to the streamsets command.
  • When building the image yourself, files or directories placed in the "resources" directory at the project root will be copied to the image's SDC_RESOURCES directory.
  • When building the image yourself, files or directories placed in the "sdc-extras" directory at the project root will be copied to the image's STREAMSETS_LIBRARIES_EXTRA_DIR. See the Dockerfile for details

For example to run with a customized sdc.properties file, a local filsystem path to store pipelines, and statically map the default UI port you could use the following:

docker run --restart on-failure -v $PWD/sdc.properties:/etc/sdc/sdc.properties:ro -v $PWD/sdc-data:/data:rw -p 18630:18630 -d streamsets/datacollector

Creating Data Volumes

To create a dedicated data volume for the pipeline store issue the following command:

docker volume create --name sdc-data

You can then use the -v (volume) argument to mount it when you start the data collector.

docker run -v sdc-data:/data -P -d streamsets/datacollector

Note: There are two different methods for managing data in Docker. The above is using data volumes which are empty when created. You can also use data containers which are derived from an image. These are useful when you want to modify and persist a path starting with existing files from a base container, such as for configuration files. We'll use both in the example below. See Manage data in containers for more detailed documentation.

Pre-configuring Data Collector

The simplest and recommended way is to derive your own custom image.

For example, create a new file named Dockerfile with the following contents:

ARG SDC_VERSION=3.9.1
FROM streamsets/datacollector:${SDC_VERSION}

ARG SDC_LIBS
RUN "${SDC_DIST}/bin/streamsets" stagelibs -install="${SDC_LIBS}"

To create a derived image that includes the Jython stage library for SDC version 3.9.1, you can run the following command:

docker build -t mycompany/datacollector:3.9.1 --build-arg SDC_VERSION=3.9.1 --build-arg SDC_LIBS=streamsets-datacollector-jython_2_7-lib .

Option 2 - Volumes

First we create a data container for our configuration. We'll call ours sdc-conf

docker create -v /etc/sdc --name sdc-conf streamsets/datacollector docker run --rm -it --volumes-from sdc-conf ubuntu bash

Tip: You can substitute ubuntu for your favorite base image. This is only a temporary container for editing the base configuration files.

Edit the configuration of SDC to your liking by modifying the files in /etc/sdc

You can choose to create separate data containers using the above procedure for $SDC_DATA (/data) and other locations, or you can add all of the volumes to the same container. For multiple volumes in a single data container you could use the following syntax:

docker create -v /etc/sdc -v /data -v --name sdc-volumes streamsets/datacollector

If you find it easier to edit the configuration files locally you can, instead of starting the temporary container above, use the docker cp command to copy the configuration files back and forth from the data container.

To install stage libs using the CLI or Package Manager UI you'll need to create a volume for the stage libs directory. It's also recommended to use a volume for the data directory at a minimum.

docker volume create --name sdc-stagelibs (If you didn't create a data container for /data then run the command below) docker volume create --name sdc-data

The volume needs to then be mounted to the correct directory when launching the container. The example below is for Data Collector version .1.

docker run --name sdc -d -v sdc-stagelibs:/opt/streamsets-datacollector-3.9.1/streamsets-libs -v sdc-data:/data -P streamsets/datacollector dc -verbose

To get a list of available libs you could do:

docker run --rm streamsets/datacollector:3.9.1 stagelibs -list

For example, to install the JDBC lib into the sdc-stagelibs volume you created above, you would run:

docker run --rm -v sdc-stagelibs:/opt/streamsets-datacollector-3.9.1/streamsets-libs streamsets/datacollector:3.9.1 stagelibs -install=streamsets-datacollector-jdbc-lib

More Repositories

1

tutorials

StreamSets Tutorials
Java
345
star
2

datacollector-oss

datacollector-oss
Java
87
star
3

pipeline-library

Pipeline library for StreamSets Data Collector and Transformer
29
star
4

datacollector-kubernetes

Shell
17
star
5

datacollector-tests

StreamSets Test Framework-based tests for StreamSets Data Collector
Python
17
star
6

helm-charts

Helm Charts
Mustache
11
star
7

control-agent-quickstart

Shell
8
star
8

ansible-datacollector

Shell
7
star
9

support

Support Tooling / Example Pipelines and more.....
Shell
5
star
10

datacollector-edge-oss

datacollector-edge-oss
Go
5
star
11

transformer-sample-processor

Sample custom processor tutorial for StreamSets Transformer
Java
4
star
12

sample-pipelines

4
star
13

ansible-datacollector-sample-playbook

Sample Ansible playbook for use with StreamSets Data Collector
Python
4
star
14

academy-devops

academy-devops
Shell
2
star
15

datacollector-plugin-api-oss

datacollector-plugin-api-oss
Java
2
star
16

streamsetsday

Shell
2
star
17

slot_car_demo

TypeScript
2
star
18

PythonSequenceFile

Pyhton library to read Hadoop Sequence Files
Python
2
star
19

streamsets-cloud-pipelines

Library of pipelines exported from StreamSets Cloud
1
star
20

streamsets.github.io

HTML
1
star
21

test

Python
1
star
22

clabot-config

cla-bot configuration
1
star
23

streamsets-scripts

Support and maintenance scripts for StreamSets products
Shell
1
star
24

adls-gen2-python

Python
1
star
25

Pipeline-Examples

Python
1
star
26

java-utils

Repository for independent, lightweight utility libraries to be used for any Java product
Java
1
star
27

datacollector-api-oss

datacollector-api-oss
Java
1
star
28

streamsets-sdk-k8s-deployment-with-ingress

Example of how to deploy one or more SDCs on k8s using the StreamSets SDK when ingress is needed
Python
1
star