AWS Glue ETL Code Samples
This repository has samples that demonstrate various aspects of the new AWS Glue service, as well as various AWS Glue utilities.
You can find the AWS Glue open-source Python libraries in a separate repository at: awslabs/aws-glue-libs.
Content
-
Helps you get started using the many ETL capabilities of AWS Glue, and answers some of the more common questions people have.
Examples
You can run these sample job scripts on any of AWS Glue ETL jobs, container, or local environment.
-
Join and Relationalize Data in S3
This sample ETL script shows you how to use AWS Glue to load, transform, and rewrite data in AWS S3 so that it can easily and efficiently be queried and analyzed.
-
This sample ETL script shows you how to take advantage of both Spark and AWS Glue features to clean and transform data for efficient analysis.
-
This sample explores all four of the ways you can resolve choice types in a dataset using DynamicFrame's
resolveChoice
method. -
This sample ETL script shows you how to use AWS Glue job to convert character encoding.
-
Notebook using open data dake formats
The sample iPython notebook files show you how to use open data dake formats; Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue Interactive Sessions and AWS Glue Studio Notebook.
-
The sample Glue Blueprints show you how to implement blueprints addressing common use-cases in ETL. The samples are located under aws-glue-blueprint-libs repository.
Utilities
-
This utility can help you migrate your Hive metastore to the AWS Glue Data Catalog.
-
These scripts can undo or redo the results of a crawl under some circumstances.
-
You can use this Dockerfile to run Spark history server in your container. See details: Launching the Spark History Server and Viewing the Spark UI Using Docker
-
AWS Lake Formation applies its own permission model when you access data in Amazon S3 and metadata in AWS Glue Data Catalog through use of Amazon EMR, Amazon Athena and so on. If you currently use Lake Formation and instead would like to use only IAM Access controls, this tool enables you to achieve it.
-
This utility enables you to synchronize your AWS Glue resources (jobs, databases, tables, and partitions) from one environment (region, account) to another.
-
Glue Job Version Deprecation Checker
This command line utility helps you to identify the target Glue jobs which will be deprecated per AWS Glue version support policy.
GlueCustomConnectors
AWS Glue provides built-in support for the most commonly used data stores such as Amazon Redshift, MySQL, MongoDB. Powered by Glue ETL Custom Connector, you can subscribe a third-party connector from AWS Marketplace or build your own connector to connect to data stores that are not natively supported.
-
Development guide with examples of connectors with simple, intermediate, and advanced functionalities. These examples demonstrate how to implement Glue Custom Connectors based on Spark Data Source or Amazon Athena Federated Query interfaces and plug them into Glue Spark runtime.
-
This user guide describes validation tests that you can run locally on your laptop to integrate your connector with Glue Spark runtime.
-
This user guide shows how to validate connectors with Glue Spark runtime in a Glue job system before deploying them for your workloads.
-
Python scripts examples to use Spark, Amazon Athena and JDBC connectors with Glue Spark runtime.
-
Create and Publish Glue Connector to AWS Marketplace
If you would like to partner or publish your Glue custom connector to AWS Marketplace, please refer to this guide and reach out to us at [email protected] for further details on your connector.
License Summary
This sample code is made available under the MIT-0 license. See the LICENSE file.