• Stars
    star
    1,199
  • Rank 37,416 (Top 0.8 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 6 years ago
  • Updated 10 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A registry of publicly available datasets on AWS

Registry of Open Data on AWS

A repository of publicly available datasets that are available for access from AWS resources. Note that datasets in this registry are available via AWS resources, but they are not provided by AWS; these datasets are owned and maintained by a variety of government organizations, researchers, businesses, and individuals.

What is this for?

When data is shared on AWS, anyone can analyze it and build services on top of it using a broad range of compute and data analytics products, including Amazon EC2, Amazon Athena, AWS Lambda, and Amazon EMR. Sharing data in the cloud lets data users spend more time on data analysis rather than data acquisition. This repository exists to help people promote and discover datasets that are available via AWS resources.

How are datasets added to the registry?

Each dataset in this repository is described with metadata saved in a YAML file in the /datasets directory. We use these YAML files to provide three services:

The YAML files use this structure:

Name:
Description:
Documentation:
Contact:
ManagedBy:
UpdateFrequency:
Tags:
  -
License:
Citation:
Resources:
  - Description:
    ARN:
    Region:
    Type:
    Explore:
DataAtWork:
  Tutorials:
    - Title:
      URL:
      NotebookURL:
      AuthorName:
      AuthorURL:
      Services:
  Tools & Applications:
    - Title:
      URL:
      AuthorName:
      AuthorURL:
  Publications:
    - Title:
      URL:
      AuthorName:
      AuthorURL:
DeprecatedNotice:
ADXCategories:
  -    

The metadata required for each dataset entry is as follows:

Field Type Description & Style
Name String The public facing name of the dataset. Spell out acronyms and abbreviations. We do not require "AWS" or "Open Data" to be in the dataset name. Must be between 5 and 130 characters.
Description String A high-level description of the dataset. Only the first 600 characters will be displayed on the homepage of the Registry of Open Data on AWS
Documentation URL A link to documentation of the dataset, preferably hosted on the data provider's website or Github repository.
Contact String May be an email address, a link to contact form, a link to GitHub issues page, or any other instructions to contact the producer of the dataset
ManagedBy String The name of the laboratory, institution, or organization who is responsible for the data ingest process. Avoid using individuals. If your institution manages several datasets hosted by the Public Dataset Program, please list the managing institution identically. For an example why, check out the Managed By section of the TARGET dataset
UpdateFrequency String An explanation of how frequently the dataset is updated
Tags List of strings Select tags that are related to an intrinsic property or descriptor of the dataset. A list of supported tags is maintained in the tags.yaml file in this repo. If you want to recommend a tag that is not included in tags.yaml, please submit a pull request to add it to that file.
License String An explanation of the dataset license and/or a URL to more information about data terms of use of the dataset
Citation (Optional) String Custom citation language to be used when citing this dataset, which will be appended to the default citation used for all datasets. Default citation language is as follows: "[DATASET NAME] was accessed on [DATE] at registry.opendata.aws/[dataset]"
Resources List of lists A list of AWS resources that users can use to consume the data. Each resource entry requires the metadata below:
Resources > Description String A technical description of the data available within the AWS resource, including information about file formats and scope.
Resources > ARN String Amazon Resource Name for resource, e.g. arn:aws:s3:::commoncrawl
Resources > Region String AWS region unique identifier, e.g. us-east-1
Resources > Type String Can be CloudFront Distribution, DB Snapshot, S3 Bucket, or SNS Topic. A list of supported resources is maintained in the resources.yaml file in this repo. If you want to recommend a resource that is not included in resources.yaml, please submit a pull request to add it to that file.
Resources > RequesterPays (Optional) Boolean Only appropriate for Amazon S3 buckets, indicates whether the bucket has Requester Pays enabled or not.
Resources > AccountRequired (Optional) String Is an AWS account required to access this data? Note that while Requester Pays means you will need an account, this is meant for cases where an account is required outside of that scenario.
Resources > ControlledAccess (Optional) String Only appropriate for Amazon S3 buckets with controlled access. Please provide a URL to instructions on how to request and gain access to the S3 bucket.
Resources > Explore (Optional) List of strings Additional links that can be used to explore the bucket resource, i.e. links to S3 JS Explorer index.html for the bucket or the AWS S3 console.
DataAtWork [> Tutorials, Tools & Applications, Publications] (Optional) List of lists A list of links to example tutorials, tools & applications, publications that use the data.
DataAtWork [> Tutorials, Tools & Applications, Publications] > Title String The title of the tutorial, tool, application, or publication that uses the data.
DataAtWork [> Tutorials, Tools & Applications, Publications] > URL URL A link to the tutorial, tool, application, or publication that uses the data.
DataAtWork [> Tutorials, Tools & Applications, Publications] > AuthorName String Name(s) of person or entity that created the tutorial, tool, application, or publication. Limit scientific publication author lists to the first six authors in the format Last Name First Initial, followed by 'et al'.
DataAtWork [> Tutorials, Tools & Applications, Publications] > AuthorURL (Optional) String URL for person or entity that created the tutorial, tool, application, or publication.
DataAtWork [> Tutorials] > NotebookURL (Optional) URL A link to a Jupyter notebook (.ipynb) on GitHub that shows how this data can be used.
DataAtWork [> Tutorials] > Services (Optional) String For tutorials only. List AWS Services applied in your tutorial. A list of supported AWS services is maintained in the services.yaml file in this repo. If you want to recommend a resource that is not included in services.yaml, please submit a pull request to add it to that file.
DeprecatedNotice (Optional) String Only appropriate for datasets that are being retired, indicates to users that the dataset will soon be deprecated and should include the date that the dataset will no longer be available.
ADXCategories List of strings Allowed categories can be found in adx_categories.yaml, at most, 2 can be added. Adding categories to your listing will improve searchability within the AWS Data Exchange.

Note also that we use the name of each YAML file as the URL slug for each dataset on the Registry of Open Data on AWS website. E.g. the metadata from 1000-genomes.yaml is listed at https://registry.opendata.aws/1000-genomes/

Example entry

Here is an example of the metadata behind this dataset registration: https://registry.opendata.aws/noaa-nexrad/

Name: NEXRAD on AWS
Description: Real-time and archival data from the Next Generation Weather Radar (NEXRAD) network.
Documentation: https://github.com/awslabs/open-data-docs/tree/main/docs/noaa/noaa-nexrad
Contact: [email protected]
ManagedBy: "[NOAA](http://www.noaa.gov/)"
UpdateFrequency: New Level II data is added as soon as it is available.
Tags:
  - aws-pds
  - earth observation
  - natural resource
  - weather
  - meteorological
  - sustainability
License: There are no restrictions on the use of this data.
Resources:
  - Description: NEXRAD Level II archive data
    ARN: arn:aws:s3:::noaa-nexrad-level2
    Region: us-east-1
    Type: S3 Bucket
    Explore:
    - '[Browse Bucket](https://noaa-nexrad-level2.s3.amazonaws.com/index.html)'
  - Description: NEXRAD Level II real-time data
    ARN: arn:aws:s3:::unidata-nexrad-level2-chunks
    Region: us-east-1
    Type: S3 Bucket
  - Description: "[Rich notifications](https://github.com/awslabs/open-data-docs/tree/main/docs/noaa/noaa-nexrad#subscribing-to-nexrad-data-notifications) for real-time data with filterable fields"
    ARN: arn:aws:sns:us-east-1:684042711724:NewNEXRADLevel2ObjectFilterable
    Region: us-east-1
    Type: SNS Topic
  - Description: Notifications for archival data
    ARN: arn:aws:sns:us-east-1:811054952067:NewNEXRADLevel2Archive
    Region: us-east-1
    Type: SNS Topic
DataAtWork:
  Tutorials:
    - Title: NEXRAD on EC2 tutorial
      URL: https://github.com/openradar/AMS_radar_in_the_cloud
      Services: EC2
      AuthorName: openradar
      AuthorURL: https://github.com/openradar
    - Title: Using Python to Access NCEI Archived NEXRAD Level 2 Data (Jupyter notebook)
      URL: http://nbviewer.jupyter.org/gist/dopplershift/356f2e14832e9b676207
      AuthorName: Ryan May
      AuthorURL: http://dopplershift.github.io
    - Title: Mapping Noaa Nexrad Radar Data With CARTO
      URL: https://carto.com/blog/mapping-nexrad-radar-data/
      AuthorName: Stuart Lynn
      AuthorURL: https://carto.com/blog/author/stuart-lynn/
  Tools & Applications:
    - Title: nexradaws on pypi.python.org - python module to query and download Nexrad data from Amazon S3
      URL: https://pypi.org/project/nexradaws/
      AuthorName: Aaron Anderson
      AuthorURL: https://github.com/aarande
    - Title: WeatherPipe - Amazon EMR based analysis tool for NEXRAD data stored on Amazon S3
      URL: https://github.com/stephenlienharrell/WeatherPipe
      AuthorName: Stephen Lien Harrell
      AuthorURL: https://github.com/stephenlienharrell
  Publications:
    - Title: Seasonal abundance and survival of North America’s migratory avifauna determined by weather radar
      URL: https://www.nature.com/articles/s41559-018-0666-4
      AuthorName: Adriaan M. Dokter, Andrew Farnsworth, Daniel Fink, Viviana Ruiz-Gutierrez, Wesley M. Hochachka, Frank A. La Sorte, Orin J. Robinson, Kenneth V. Rosenberg & Steve Kelling
    - Title: Unlocking the Potential of NEXRAD Data through NOAA’s Big Data Partnership
      URL: https://journals.ametsoc.org/doi/full/10.1175/BAMS-D-16-0021.1
      AuthorName: Steve Ansari and Stephen Del Greco

How can I contribute?

You are welcome to contribute dataset entries or usage examples to the Registry of Open Data on AWS. Please review our contribution guidelines.

New to Github and contributing via pull requests?

In addition to Github's getting started documentation, this video tutorial shows the complete end-to-end process of contributing to this repository.

More Repositories

1

git-secrets

Prevents you from committing secrets and credentials into git repositories
Shell
11,616
star
2

llrt

LLRT (Low Latency Runtime) is an experimental, lightweight JavaScript runtime designed to address the growing demand for fast and efficient Serverless applications.
JavaScript
7,555
star
3

aws-shell

An integrated shell for working with the AWS CLI.
Python
7,116
star
4

autogluon

AutoGluon: AutoML for Image, Text, and Tabular Data
Python
4,348
star
5

aws-cloudformation-templates

A collection of useful CloudFormation templates
Python
4,302
star
6

mountpoint-s3

A simple, high-throughput file client for mounting an Amazon S3 bucket as a local file system.
Rust
3,986
star
7

gluonts

Probabilistic time series modeling in Python
Python
3,686
star
8

deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
Scala
2,871
star
9

aws-lambda-rust-runtime

A Rust runtime for AWS Lambda
Rust
2,829
star
10

aws-sdk-rust

AWS SDK for the Rust Programming Language
2,754
star
11

amazon-redshift-utils

Amazon Redshift Utils contains utilities, scripts and view which are useful in a Redshift environment
Python
2,643
star
12

diagram-maker

A library to display an interactive editor for any graph-like data.
TypeScript
2,359
star
13

amazon-ecr-credential-helper

Automatically gets credentials for Amazon ECR on docker push/docker pull
Go
2,261
star
14

amazon-eks-ami

Packer configuration for building a custom EKS AMI
Shell
2,164
star
15

aws-lambda-powertools-python

A developer toolkit to implement Serverless best practices and increase developer velocity.
Python
2,148
star
16

aws-well-architected-labs

Hands on labs and code to help you learn, measure, and build using architectural best practices.
Python
1,834
star
17

aws-config-rules

[Node, Python, Java] Repository of sample Custom Rules for AWS Config.
Python
1,473
star
18

smithy

Smithy is a protocol-agnostic interface definition language and set of tools for generating clients, servers, and documentation for any programming language.
Java
1,356
star
19

aws-support-tools

Tools and sample code provided by AWS Premium Support.
Python
1,290
star
20

sockeye

Sequence-to-sequence framework with a focus on Neural Machine Translation based on PyTorch
Python
1,181
star
21

aws-lambda-powertools-typescript

Powertools is a developer toolkit to implement Serverless best practices and increase developer velocity.
TypeScript
1,179
star
22

dgl-ke

High performance, easy-to-use, and scalable package for learning large-scale knowledge graph embeddings.
Python
1,144
star
23

aws-sdk-ios-samples

This repository has samples that demonstrate various aspects of the AWS SDK for iOS, you can get the SDK source on Github https://github.com/aws-amplify/aws-sdk-ios/
Swift
1,038
star
24

aws-sdk-android-samples

This repository has samples that demonstrate various aspects of the AWS SDK for Android, you can get the SDK source on Github https://github.com/aws-amplify/aws-sdk-android/
Java
1,018
star
25

aws-solutions-constructs

The AWS Solutions Constructs Library is an open-source extension of the AWS Cloud Development Kit (AWS CDK) that provides multi-service, well-architected patterns for quickly defining solutions
TypeScript
1,013
star
26

aws-cfn-template-flip

Tool for converting AWS CloudFormation templates between JSON and YAML formats.
Python
981
star
27

amazon-kinesis-video-streams-webrtc-sdk-c

Amazon Kinesis Video Streams Webrtc SDK is for developers to install and customize realtime communication between devices and enable secure streaming of video, audio to Kinesis Video Streams.
C
975
star
28

aws-lambda-go-api-proxy

lambda-go-api-proxy makes it easy to port APIs written with Go frameworks such as Gin (https://gin-gonic.github.io/gin/ ) to AWS Lambda and Amazon API Gateway.
Go
967
star
29

eks-node-viewer

EKS Node Viewer
Go
947
star
30

multi-model-server

Multi Model Server is a tool for serving neural net models for inference
Java
936
star
31

ec2-spot-labs

Collection of tools and code examples to demonstrate best practices in using Amazon EC2 Spot Instances.
Jupyter Notebook
905
star
32

aws-mobile-appsync-sdk-js

JavaScript library files for Offline, Sync, Sigv4. includes support for React Native
TypeScript
902
star
33

aws-saas-boost

AWS SaaS Boost is a ready-to-use toolset that removes the complexity of successfully running SaaS workloads in the AWS cloud.
Java
901
star
34

fargatecli

CLI for AWS Fargate
Go
891
star
35

aws-api-gateway-developer-portal

A Serverless Developer Portal for easily publishing and cataloging APIs
JavaScript
879
star
36

ecs-refarch-continuous-deployment

ECS Reference Architecture for creating a flexible and scalable deployment pipeline to Amazon ECS using AWS CodePipeline
Shell
842
star
37

fortuna

A Library for Uncertainty Quantification.
Python
836
star
38

dynamodb-data-mapper-js

A schema-based data mapper for Amazon DynamoDB.
TypeScript
818
star
39

goformation

GoFormation is a Go library for working with CloudFormation templates.
Go
812
star
40

flowgger

A fast data collector in Rust
Rust
796
star
41

aws-js-s3-explorer

AWS JavaScript S3 Explorer is a JavaScript application that uses AWS's JavaScript SDK and S3 APIs to make the contents of an S3 bucket easy to browse via a web browser.
HTML
771
star
42

aws-icons-for-plantuml

PlantUML sprites, macros, and other includes for Amazon Web Services services and resources
Python
737
star
43

aws-devops-essential

In few hours, quickly learn how to effectively leverage various AWS services to improve developer productivity and reduce the overall time to market for new product capabilities.
Shell
674
star
44

aws-apigateway-lambda-authorizer-blueprints

Blueprints and examples for Lambda-based custom Authorizers for use in API Gateway.
C#
660
star
45

amazon-ecs-nodejs-microservices

Reference architecture that shows how to take a Node.js application, containerize it, and deploy it as microservices on Amazon Elastic Container Service.
Shell
650
star
46

amazon-kinesis-client

Client library for Amazon Kinesis
Java
621
star
47

aws-deployment-framework

The AWS Deployment Framework (ADF) is an extensive and flexible framework to manage and deploy resources across multiple AWS accounts and regions based on AWS Organizations.
Python
617
star
48

aws-lambda-web-adapter

Run web applications on AWS Lambda
Rust
610
star
49

dgl-lifesci

Python package for graph neural networks in chemistry and biology
Python
594
star
50

aws-security-automation

Collection of scripts and resources for DevSecOps and Automated Incident Response Security
Python
585
star
51

aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Python
565
star
52

python-deequ

Python API for Deequ
Python
535
star
53

aws-athena-query-federation

The Amazon Athena Query Federation SDK allows you to customize Amazon Athena with your own data sources and code.
Java
507
star
54

data-on-eks

DoEKS is a tool to build, deploy and scale Data & ML Platforms on Amazon EKS
HCL
504
star
55

shuttle

Shuttle is a library for testing concurrent Rust code
Rust
465
star
56

ami-builder-packer

An example of an AMI Builder using CI/CD with AWS CodePipeline, AWS CodeBuild, Hashicorp Packer and Ansible.
465
star
57

route53-dynamic-dns-with-lambda

A Dynamic DNS system built with API Gateway, Lambda & Route 53.
Python
461
star
58

aws-servicebroker

AWS Service Broker
Python
461
star
59

amazon-ecs-local-container-endpoints

A container that provides local versions of the ECS Task Metadata Endpoint and ECS Task IAM Roles Endpoint.
Go
456
star
60

datawig

Imputation of missing values in tables.
JavaScript
454
star
61

aws-jwt-verify

JS library for verifying JWTs signed by Amazon Cognito, and any OIDC-compatible IDP that signs JWTs with RS256, RS384, and RS512
TypeScript
452
star
62

amazon-dynamodb-lock-client

The AmazonDynamoDBLockClient is a general purpose distributed locking library built on top of DynamoDB. It supports both coarse-grained and fine-grained locking.
Java
447
star
63

ecs-refarch-service-discovery

An EC2 Container Service Reference Architecture for providing Service Discovery to containers using CloudWatch Events, Lambda and Route 53 private hosted zones.
Go
444
star
64

ssosync

Populate AWS SSO directly with your G Suite users and groups using either a CLI or AWS Lambda
Go
443
star
65

handwritten-text-recognition-for-apache-mxnet

This repository lets you train neural networks models for performing end-to-end full-page handwriting recognition using the Apache MXNet deep learning frameworks on the IAM Dataset.
Jupyter Notebook
442
star
66

awscli-aliases

Repository for AWS CLI aliases.
437
star
67

aws-config-rdk

The AWS Config Rules Development Kit helps developers set up, author and test custom Config rules. It contains scripts to enable AWS Config, create a Config rule and test it with sample ConfigurationItems.
Python
436
star
68

snapchange

Lightweight fuzzing of a memory snapshot using KVM
Rust
427
star
69

aws-security-assessment-solution

An AWS tool to help you create a point in time assessment of your AWS account using Prowler and Scout as well as optional AWS developed ransomware checks.
423
star
70

lambda-refarch-mapreduce

This repo presents a reference architecture for running serverless MapReduce jobs. This has been implemented using AWS Lambda and Amazon S3.
JavaScript
422
star
71

aws-lambda-cpp

C++ implementation of the AWS Lambda runtime
C++
409
star
72

aws-cloudsaga

AWS CloudSaga - Simulate security events in AWS
Python
389
star
73

amazon-kinesis-producer

Amazon Kinesis Producer Library
C++
385
star
74

soci-snapshotter

Go
383
star
75

pgbouncer-fast-switchover

Adds query routing and rewriting extensions to pgbouncer
C
381
star
76

serverless-photo-recognition

A collection of 3 lambda functions that are invoked by Amazon S3 or Amazon API Gateway to analyze uploaded images with Amazon Rekognition and save picture labels to ElasticSearch (written in Kotlin)
Kotlin
378
star
77

amazon-sagemaker-workshop

Amazon SageMaker workshops: Introduction, TensorFlow in SageMaker, and more
Jupyter Notebook
378
star
78

serverless-rules

Compilation of rules to validate infrastructure-as-code templates against recommended practices for serverless applications.
Go
378
star
79

logstash-output-amazon_es

Logstash output plugin to sign and export logstash events to Amazon Elasticsearch Service
Ruby
374
star
80

kinesis-aggregation

AWS libraries/modules for working with Kinesis aggregated record data
Java
370
star
81

smithy-rs

Code generation for the AWS SDK for Rust, as well as server and generic smithy client generation.
Rust
369
star
82

syne-tune

Large scale and asynchronous Hyperparameter and Architecture Optimization at your fingertips.
Python
363
star
83

aws-sdk-kotlin

Multiplatform AWS SDK for Kotlin
Kotlin
359
star
84

dynamodb-transactions

Java
354
star
85

amazon-kinesis-client-python

Amazon Kinesis Client Library for Python
Python
354
star
86

aws-serverless-data-lake-framework

Enterprise-grade, production-hardened, serverless data lake on AWS
Python
349
star
87

threat-composer

A simple threat modeling tool to help humans to reduce time-to-value when threat modeling
TypeScript
346
star
88

amazon-kinesis-agent

Continuously monitors a set of log files and sends new data to the Amazon Kinesis Stream and Amazon Kinesis Firehose in near-real-time.
Java
342
star
89

rds-snapshot-tool

The Snapshot Tool for Amazon RDS automates the task of creating manual snapshots, copying them into a different account and a different region, and deleting them after a specified number of days
Python
337
star
90

amazon-kinesis-scaling-utils

The Kinesis Scaling Utility is designed to give you the ability to scale Amazon Kinesis Streams in the same way that you scale EC2 Auto Scaling groups – up or down by a count or as a percentage of the total fleet. You can also simply scale to an exact number of Shards. There is no requirement for you to manage the allocation of the keyspace to Shards when using this API, as it is done automatically.
Java
333
star
91

amazon-kinesis-video-streams-producer-sdk-cpp

Amazon Kinesis Video Streams Producer SDK for C++ is for developers to install and customize for their connected camera and other devices to securely stream video, audio, and time-encoded data to Kinesis Video Streams.
C++
332
star
92

landing-zone-accelerator-on-aws

Deploy a multi-account cloud foundation to support highly-regulated workloads and complex compliance requirements.
TypeScript
330
star
93

route53-infima

Library for managing service-level fault isolation using Amazon Route 53.
Java
326
star
94

aws-automated-incident-response-and-forensics

326
star
95

mxboard

Logging MXNet data for visualization in TensorBoard.
Python
326
star
96

aws-sigv4-proxy

This project signs and proxies HTTP requests with Sigv4
Go
325
star
97

statelint

A Ruby gem that provides a command-line validator for Amazon States Language JSON files.
Ruby
324
star
98

graphstorm

Enterprise graph machine learning framework for billion-scale graphs for ML scientists and data scientists.
Python
317
star
99

ecs-nginx-reverse-proxy

Reference architecture for deploying Nginx on ECS, both as a basic static resource server, and as a reverse proxy in front of a dynamic application server.
Nginx
317
star
100

simplebeerservice

Simple Beer Service (SBS) is a cloud-connected kegerator that streams live sensor data to AWS.
JavaScript
316
star