Apache Beam Example Code
An example Apache Beam project.
Description
This example can be used with conference talks and self-study. The base of the examples are taken from Beam's example
directory. They are modified to use Beam as a dependency in the pom.xml
instead of being compiled together. The example code is changed to output to local directories.
How to clone and run
- Open a terminal window.
- Run
git clone [email protected]:eljefe6a/beamexample.git
- Run
cd beamexample/BeamTutorial
- Run
mvn compile
- Create local output directory:
mkdir output
- Run
mvn compile exec:java -Dexec.mainClass="org.apache.beam.examples.tutorial.game.solution.Exercise1" -Pdirect-runner
- Run
cat output/user_score
to verify the program ran correctly and the output file was created.
Using a Java IDE
- Follow the IDE Setup instructions on the Apache Beam Contribution Guide.
Other Runners
Apache Flink
- Follow the first steps from Flink's Quickstart to download Flink.
- Create the
output
directory. - To run on a JVM-local cluster:
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.tutorial.game.solution.Exercise1 -Dexec.args='--runner=FlinkRunner --flinkMaster=[local]' -Pflink-runner
- To run on an out-of-process local cluster (note that the steps below should also work on a real cluster if you have one running):
- Start a local Flink cluster.
- Navigate to the WebUI (typically http://localhost:8081), click JobManager, and note the value of
jobmanager.rpc.port
. The default is probably 6123. - Run
mvn package -Pflink-runner
to generate a JAR file. Note the location of the generated JAR (probably./target/BeamTutorial-bundled-flink.jar
) - Run
mvn -X -e compile exec:java -Dexec.mainClass=org.apache.beam.examples.tutorial.game.solution.Exercise1 -Dexec.args='--runner=FlinkRunner --flinkMaster=localhost:6123 --filesToStage=./target/BeamTutorial-bundled-flink.jar' -Pflink-runner
, replacing the defaults for port and JAR file if they differ. - Check in the WebUI to see the job listed.
- Run
cat output/user_score
to verify the pipeline ran correctly and the output file was created.
Apache Spark
- Create the
output
directory. - Allow all users (Spark may run as a different user) to write to the
output
directory.chmod 1777 output
. - Change the output file to a fully-qualified path. For example,
this("output/user_score");
tothis("/home/vmuser/output/user_score");
- Run
mvn package -Pspark-runner
- Run
spark-submit --jars ./target/BeamTutorial-bundled-spark.jar --class org.apache.beam.examples.tutorial.game.solution.Exercise2 --master yarn-client ./target/BeamTutorial-bundled-spark.jar --runner=SparkRunner
Google Cloud Dataflow
- Follow the steps in either of the Java quickstarts for Cloud Dataflow to initialize your Google Cloud setup.
- Create a bucket on Google Cloud Storage for staging and output.
- Run
mvn -X compile exec:java -Dexec.mainClass="org.apache.beam.examples.tutorial.game.solution.Exercise1" -Dexec.args='--runner=DataflowRunner --project=<YOUR-GOOGLE-CLOUD-PROJECT> --gcpTempLocation=gs://<YOUR-BUCKET-NAME> --outputPrefix=gs://<YOUR-BUCKET-NAME>/output/' -Pdataflow-runner
, after replacing<YOUR-GCP-PROJECT>
and<YOUR-BUCKET-NAME>
with the appropriate values. - Check the Cloud Dataflow Console to see the job running.
- Check the output bucket to see the generated output:
https://console.cloud.google.com/storage/browser/<YOUR-BUCKET-NAME>/
Further Reading
- The World Beyond Batch Streaming Part 1 and Part 2
- Future-Proof Your Big Data Processing with Apache Beam
- Future-proof and scale-proof your code
- Question and Answers with the Apache Beam Team