Almost a decade ago, my colleague Deepak Singh introduced the AWS Public Datasets in his post Paging Researchers, Analysts, and Developers. I’m happy to report that Deepak is still an important part of the AWS team and that the Public Datasets program is still going strong!
Today we are announcing a new take on open and public data, the Registry of Open Data on AWS, or RODA. This registry includes existing Public Datasets and allows anyone to add their own datasets so that they can be accessed and analyzed on AWS.
Inside the Registry
The home page lists all of the datasets in the registry:
Each dataset has an associated detail page, including usage examples, license info, and the information needed to locate and access the dataset on AWS:
In this case, I can access the data with a simple CLI command:
I could also access it programmatically, or download data to my EC2 instance.
Adding to the Repository
If you have a dataset that is publicly available and would like to add it to RODA , you can simply send us a pull request. Head over to the open-data-registry repo, read the CONTRIBUTING document, and create a YAML file that describes your dataset, using one of the existing files in the datasets directory as a model:
We’ll review pull requests regularly; you can “star” or watch the repo in order to track additions and changes.
I am looking forward to an inrush of new datasets, along with some blog posts and apps that show how to to use the data in powerful and interesting ways. Let me know what you come up with.