Open Data Resources For Data Science Research: Part-1
One of the hardest problems to solve in machine learning, deep learning and data science has nothing to do with network architecture i.e. Neural Nets, but, it’s the problem of getting the right data in the right format. Therefore, Data is a fuel in this data science era.
Datasets are an integral part of the field of machine learning, deep learning and data science. Major advances in these fields can result from advances in learning
algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets.
High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and
expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled,
high-quality datasets for unsupervised learning can also be difficult and costly to produce.
Deep learning, and machine learning more generally, wants a good training dataset to work properly. Collecting and constructing the training dataset – a sizable body of known data – takes time and domain-specific knowledge of where and how to gather relevant domain information. The training dataset acts as the benchmark against which machine learning and deep-learning algorithms are trained. That is what they learn to reconstruct before they’re allow to run free on data they haven’t seen before.
At this stage, knowledgeable humans domains need to find the right raw data and transform it into a numerical representation so that the deep-learning and machine learning algorithm can understand it. It is usually given to learner as a vector form or a matrix form.
Training datasets that require much time or expertise can serve as a proprietary edge in the world of data science, machine learning and deep learning. The nature of the expertise is largely in telling your algorithm what matters to you by selecting what goes into the training dataset. It involves telling a story and data distribuation – through the initial data you select – that will guide your deep-learning nets and machine learning algorithms as they generalize the significant features, both in the training set and in the raw data they’ve been created to study. To create a useful training set, you have to understand the problem you’re solving; i.e. what you want your machine learning algorithms and deep-learning nets to pay attention to.
Machine learning and deep learning algorithms typically work with three data sets, known as training set, dev set and test set.
All these three should be randomly selected from a larger body of data i.e. population.
The first set you use is the training set, which is the largest of the three sets. Running a training
dataset through a machine learning and deep learning algorithm teaches the learner how to weigh different features,
assigning them coefficients according to their likelihood of minimizing errors in your results. Those coefficients,
also known as predictors, will be contained in vectors or matrices.
The second set is your dev set, which much smaller than training set. It’s used for tuning
hyper-parameters and generalization of machine learning model.
The third set is your test set. It functions as a model testing, and you don’t use it during training
learning model. After you’ve trained and optimized your data, you test your learning algorithm against this test set.
The results it produces should validate that your algorithm accurately recognizes given test instance.
If you don’t get accurate predictions, go back to the training set, look at the hyper-parameters you used to tune
the learner, as well as the quality of your data and look at your pre-processing techniques.
So, this blog post is written to introduce various datasets among different human domains to trigger data science, machine learning and deep learning research. Here are a few cool public open data sources you can use for your projects related to Data Science, Machine Learning and Deep Learning research:
- Kaggle becomes the the place for Open Data – I think this is big news! Kaggle just announced Kaggle Datasets which aims to be a repository for publicly available datasets. This is great for organizations that want to release data, but do not necessarily want the overhead of running an open data portal. Hopefully it will gain some traction and become an exceptional resource for open data.
- NASA Opens Research – NASA just announced all research papers funded by NASA will be publicly available. It appears the research articles will all be available at PubMed Central, and the data available at NASA’s Data Portal.
- Google Robotics Data – Google continues to do interesting things, and this topic is definitely that. It is a dataset about how robots grasp objects (Google Brain Robot Data). I am not overly familiar with this topic, so if you want to know more, see their blog post, Deep Learning for Robots.
- Economic Data:
- Publically Traded Market Data: Quandl is an amazing source of finance data. Google Finance and Yahoo Finance are additional good sources of data. Corporate filings with the SEC are available on Edgar.
- Housing Price Data: You can use the Trulia API or the Zillow API.
- Lending data: You can find student loan defaults by university and the complete collection of peer-to-peer loans from Lending Club and Prosper, the two largest platforms in the space.
- Home mortgage data: There is data made available by the Home Mortgage Disclosure Act and there’s a lot of data from the Federal Housing Finance Agency available here.
- Content Data:
- Review Content: You can get reviews of restaurant and physical venues from Foursquare and Yelp (see geodata). Amazon has a large repository of Product Reviews. Beer reviews from Beer Advocate can be found here. Rotten Tomatoes Movie Reviews are available from Kaggle.
- Web Content: Looking for web content? Wikipedia provides dumps of their articles. Common Crawl has a large corpus of the internet available. ArXiv maintains all their data available via Bulk Download from AWS S3. Want to know which URLs are malicious? There’s a dataset for that. Music data is available from the Million Songs Database. You can analyze the Q&A patterns on sites like Stack Exchange (including Stack Overflow).
- Media Data: There’s open annotated articles form the New York Times, Reuters Dataset, and GDELT project (a consolidation of many different news sources). Google Books has published NGrams for books going back to past 1800.
- Communications Data: There’s access to public messages of the Apache Software Foundation and communications amongst former execs Enron
- Government Data:
- Municipal Data: Crime Data is available for City of Chicago, and Washington DC. Restaurant Inspection Data is available for Chicago and New York City.
- Transportation Data: NYC Taxi Trips in 2013 are available courtesy of the Freedom of Information Act. There’s bikesharing data from NYC, Washington DC, and SF. There’s also Flight Delay Data from the FAA
- Census Data: Japanese Census Data. US Census data from 2010, 2000, 1990. From census data, the government has also derived time use data. EU Census Data. Checkout popular male / female baby names going back to the 19th Century from the Social Security Administration.
- World Bank: they have a lot of data available on their website.
- Election Data: Political contribution data for the last few US elections can be downloaded from the FEC here and here. Polling data is available from Real Clear Politics.
- Data With a Cause:
- Environmental Data: Data on household energy usage is available as well as NASA Climate Data.
- Medical and biological Data: You can get anything from anonymous medical records, to remote sensor reading for individuals, to data of the Genomes of 1000 individuals.
- Miscellaneous:
- Geo Data: Try looking at these Yelp Datasets for venues near major universities and one for major cities in the Southwest. The Foursquare API is another good source. Open Street Map has open data on venues as well.
- Twitter Data: you can get access to Twitter Data used for sentiment analysis, network Twitter Data, social Twitter data, on top of their API.
- Games Data: Datasets for games, including a large dataset of Poker hands, dataset of online Domion Games, and datasets of Chess Games are available.
- Web Usage Data: Web usage data is a common dataset that companies look at to understand engagement. Available datasets include Anonymous usage data for MSNBC, Amazon purchase history (also anonymized), and Wikipedia traffic.
- Awesome Public Datasets:
- Awesome Public Datasets: An awesome list of high-quality open datasets in public domains.
- Metasources: these are great sources for other web pages.
- Stanford Network Data: http://snap.stanford.edu/index.html
- Every year, the ACM holds a competition for machine learning called the KDD Cup. Their data is available online.
- UCI maintains archives of data for machine learning.
- US Census Data
- Amazon is hosting Public Datasets on s3
- Kaggle hosts machine-learning challenges and many of their datasets are publicly available
- The cities of Chicago, New York, Washington DC, and SF maintain public data warehouses.
- Yahoo maintains a lot of data on its web properties which can be obtained by writing them.
- BigML is a blog that maintains a list of public datasets for the machine learning community.
- Finally, if there’s a website with data you are interested in, crawl for it!
- Great statistical analysis: forecasting meteorite hits:
- Great statistical analysis: forecasting meteorite hits
- 53.5 Billion Clicks Dataset
- 53.5 billion clicks dataset: 53.5 billion clicks dataset available for benchmarking and testing
- 3.5 Billion Web Pages- Big Dataset:
- 3.5 Billion Web Pages- Big Dataset: made available for all of us
- 125 Years of Public Health Dataset:
- 125 Years of Public Health Dataset: 125 Years of Public Health Data Available for Download to help fight contagious Diseases
- Over 5,000,000 financial, economic and social datasets:
- Two big datasets to challenge your data science expertise:
- Data Sources for Cool Data Science Projects
- YouTube-8M Dataset:
- YouTube-8M Dataset:YouTube-8M is a large-scale labeled video dataset that consists of 8 million YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities.
- Pakistan Suicide Bombing Attacks Dataset:
- Pakistan Suicide Bombing Attacks (1995-2016): The dataset contains detailed information of 475 suicide bombing attacks in Pakistan that killed an estimated 6,982 and injured 17,624 people.
- KDnuggets Dataset:
- List of Public Data Sources Fit for Machine Learning:
- 19 Free Public Data Sets For Your First Data Science Project:
- 19 Free Public Data Sets: These data-sets cover a variety of sources i.e., demographic data, economic data, text data, and corporate data.
- List of datasets for machine learning research:
- Datasets for ML research: These datasets are used for machine learning research and have been cited in peer-reviewed academic journals and other publications.
- Open Images dataset:
- Open Image dataset: Open Images is a dataset of ~9 million URLs to images that have been annotated with labels spanning over 6000 categories.
- Google releases massive visual databases for machine learning:
- Visual databases for machine learning: Joining other high-quality datasets, Open Images and YouTube8-M provide millions of annotated links for researchers to train their processes on.
- Fun Datasets:
- Fun Datasets: Some fun datasets are available collected by Mr. Vincent for practicing analysis skills.
- TriviaQA: A Large Scale Dataset for Reading Comprehension and Question Answering
- TriviaQA: TriviaQA is a reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions.
- Financial data for studies and research
- LIBSVM Datasets: Classification, Regression, and Multi-label
- LIBSVM Datasets: Classification, Regression, and Multi-label: This page contains many classification, regression, multi-label and string data sets stored in LIBSVM format. Many are from UCI, Statlog, StatLib and other collections.
- Facebook Shares Large Data Sets to Help Improve its AI and Data Science Algorithms
- Facebook Datasets: Facebook published resources related to its AI Research project, organized towards the goal of automatic text understanding and reasoning.
- Public Bioinformatics Datasets
- Public Bioinformatics Datasets: Publicly available dataset for bioinformatics and integrative -omics research.
- Bioinformatics Datasets
Some initial links are directly taken from the blog post on The Data Science 101, which can be found here.
Note: If you know about any other dataset for triggering data science, machine learning and deep learning research, which is not listed here, please, feel free to comment below with downloadable link. I would like to add it in above list.