Apache Beam Example Code

An example Apache Beam project.

Description

This example can be used with conference talks and self-study. The base of the examples are taken from Beam's example directory. They are modified to use Beam as a dependency in the pom.xml instead of being compiled together. The example code is changed to output to local directories.

How to clone and run

Open a terminal window.
Run git clone [email protected]:eljefe6a/beamexample.git
Run cd beamexample/BeamTutorial
Run mvn compile
Create local output directory: mkdir output
Run mvn compile exec:java -Dexec.mainClass="org.apache.beam.examples.tutorial.game.solution.Exercise1" -Pdirect-runner
Run cat output/user_score to verify the program ran correctly and the output file was created.

Using a Java IDE

Follow the IDE Setup instructions on the Apache Beam Contribution Guide.

Other Runners

Apache Flink

Follow the first steps from Flink's Quickstart to download Flink.
Create the output directory.
To run on a JVM-local cluster: mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.tutorial.game.solution.Exercise1 -Dexec.args='--runner=FlinkRunner --flinkMaster=[local]' -Pflink-runner
To run on an out-of-process local cluster (note that the steps below should also work on a real cluster if you have one running):
1. Start a local Flink cluster.
2. Navigate to the WebUI (typically http://localhost:8081), click JobManager, and note the value of jobmanager.rpc.port. The default is probably 6123.
3. Run mvn package -Pflink-runner to generate a JAR file. Note the location of the generated JAR (probably ./target/BeamTutorial-bundled-flink.jar)
4. Run mvn -X -e compile exec:java -Dexec.mainClass=org.apache.beam.examples.tutorial.game.solution.Exercise1 -Dexec.args='--runner=FlinkRunner --flinkMaster=localhost:6123 --filesToStage=./target/BeamTutorial-bundled-flink.jar' -Pflink-runner, replacing the defaults for port and JAR file if they differ.
5. Check in the WebUI to see the job listed.
Run cat output/user_score to verify the pipeline ran correctly and the output file was created.

Apache Spark

Create the output directory.
Allow all users (Spark may run as a different user) to write to the output directory. chmod 1777 output.
Change the output file to a fully-qualified path. For example, this("output/user_score"); to this("/home/vmuser/output/user_score");
Run mvn package -Pspark-runner
Run spark-submit --jars ./target/BeamTutorial-bundled-spark.jar --class org.apache.beam.examples.tutorial.game.solution.Exercise2 --master yarn-client ./target/BeamTutorial-bundled-spark.jar --runner=SparkRunner

Google Cloud Dataflow

Follow the steps in either of the Java quickstarts for Cloud Dataflow to initialize your Google Cloud setup.
Create a bucket on Google Cloud Storage for staging and output.
Run mvn -X compile exec:java -Dexec.mainClass="org.apache.beam.examples.tutorial.game.solution.Exercise1" -Dexec.args='--runner=DataflowRunner --project=<YOUR-GOOGLE-CLOUD-PROJECT> --gcpTempLocation=gs://<YOUR-BUCKET-NAME> --outputPrefix=gs://<YOUR-BUCKET-NAME>/output/' -Pdataflow-runner, after replacing <YOUR-GCP-PROJECT> and <YOUR-BUCKET-NAME> with the appropriate values.
Check the Cloud Dataflow Console to see the job running.
Check the output bucket to see the generated output: https://console.cloud.google.com/storage/browser/<YOUR-BUCKET-NAME>/

eljefe6a/beamexample

eljefe6a

Reviews

Repository Details