Open Data Resources For Data Science Research: Part-2

This blog post is continuing the pervious post on open datasets. Here are more cool public open data sources you can use for your projects related to Data Science, Machine Learning and Deep Learning research:

ImageNet Dataset
- ImageNet Dataset: ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images.
Open Data for Deep Learning
- Open Data for Deep Learning: Here you’ll find an organized list of interesting, high-quality datasets for machine learning and deep learning research.
List of datasets for machine learning research
- List of datasets for machine learning research: These datasets are used for machine learning research and have been cited in peer-reviewed academic journals and other publications.
NEXET Dataset
- NEXET Dataset: The Nexar dataset is a massive set consisting of 50,000 images from all over the world with bounding box annotations of the rear of vehicles collected from a variety of locations, lighting, and weather conditions. We are releasing this dataset to you, our challengers, to empower you to build a truly smart collision prevention system that can work extremely well anywhere and at any time.
Kinect Gesture Data Set
- Kinect Gesture Data Set: The Microsoft Research Cambridge-12 Kinect gesture data set consists of sequences of human movements, represented as body-part locations, and the associated gesture to be recognized by the system.
MSRA-CFW: Data Set of Celebrity Faces on the Web
- Data Set of Celebrity Faces on the Web: The dataset includes image URLs for 202792 faces. The labels of the faces are automatically generated by the algorithm, with high accuracy. To facilitate downloading the images, we provide a number of URLs for the near-duplicates of each face. Besides, the thumbnail images and facial features(LBP) are also provided for visualization and benchmarking purposes.
Image Cropping Dataset
- Image Cropping Dataset: The Image Cropping Dataset contains the cropping parameters for 1000 images that were manually cropped by an experienced photographer.
Visual Question Generation dataset
- Visual Question Generation dataset: We introduce this dataset in order to support the novel task of Visual Question Generation (VQG), where, given an image, the system should ‘ask a natural and engaging question’. This dataset can be used to support research on common sense reasoning and compute-human conversational systems.
Smart Selection Dataset
- Smart Selection Dataset: Smart selection is the task of predicting the span of text that a user intended to select after they touched on a single word on a touch-enabled device.
MSR Demosaicing Dataset
- MSR Demosaicing Dataset: The Microsoft Research Cambridge demosaicing data set consists of set of raw images, and their downscaled versions which can be used for learning and evaluating demosaicing (and possibly other tasks like denoising), both in linear-space and color-space.
Abstract Scene Dataset
- Abstract Scene Dataset: This dataset contains clip art related to the academic paper Bringing Semantics Into Focus Using Visual Abstraction.
FingerPaint Dataset
- FingerPaint Dataset: The FingerPaint Dataset contains video-sequences of several individuals performing hand gestures, as captured by a depth camera.
Microsoft Document Aboutness Dataset
- Microsoft Document Aboutness Dataset: The Microsoft Document Aboutness Dataset consists of randomly sampled URLs (from a HEAD and TAIL distribution), all entities recognized in those documents, and a relevance assessment for each entity/URL pair as to whether or not the entity is salient to the content of the URL.
MSR Abstractive Text Compression Dataset
- MSR Abstractive Text Compression Dataset: This dataset contains sentences and short paragraphs with corresponding shorter (compressed) versions. There are up to five compressions for each input text, together with quality judgements of their meaning preservation and grammaticality.
WebQuestions Semantic Parses Dataset
- WebQuestions Semantic Parses Dataset: The WebQuestionsSP dataset is released as part of our ACL-2016 paper “The Value of Semantic Parse Labeling for Knowledge Base Question Answering” [Yih, Richardson, Meek, Chang & Suh, 2016], in which we evaluated the value of gathering semantic parses, vs. answers, for a set of questions that originally comes from WebQuestions [Berant et al., 2013].
Election 2012 Tweet ID dataset
- Election 2012 Tweet ID dataset: This data set identifies 38M tweets collected for the analysis of social media messages related to the 2012 U.S.
MSR 3D Video Dataset
- MSR 3D Video Dataset: This data includes a sequence of 100 images captured from 8 cameras showing the breakdancing and ballet scenes from the paper “High-quality video view interpolation using a layered representation”, Zitnick et al., SIGGRAPH 2004.
MSR GPS Privacy Dataset 2009
- MSR GPS Privacy Dataset 2009: The table below contains pointers to text files with GPS data taken in the region of Seattle, Washington USA. Each file contains data from one of 21 volunteers who carried a GPS logger with them for approximately eight weeks in the fall of 2009. This set of 21 volunteers is a subset of 37 people who participated in the survey.
FB15K-237 Knowledge Base Completion Dataset
- FB15K-237 Knowledge Base Completion Dataset: This dataset contains knowledge base relation triples and textual mentions of Freebase entity pairs, as used in the work published in (Toutanova and Chen CVSM-2015) and (Toutanova et al).
Avatar Dataset
- Avatar Dataset: MRS introduce a new corpus of descriptions of Xbox avatars created by actual gamers.
Diverse Algebra Word Problem Dataset with Derivation Annotations
- Diverse Algebra Word Problem Dataset with Derivation Annotations: This dataset provides training and testing examples for solving algebra word problems automatically.
NCI-PID-PubMed Genomics Knowledge Base Completion Dataset
- NCI-PID-PubMed Genomics Knowledge Base Completion Dataset: This dataset includes a database of regulation relationships among genes and corresponding textual mentions of pairs of genes in PubMed article abstracts.
Longitudinal Tweet ID dataset for a selection of Health, Social, and Business Experiences
- Longitudinal Tweet ID dataset for a selection of Health, Social, and Business Experiences: This data set consists of the tweet IDs collected for the propensity-score analysis of longitudinal social media messages posted by people who mention specific health, social and business domains. This data set accompanies the paper, “Distilling the Outcomes of Personal Experiences: A Propensity-scored Analysis of Social Media.
Dataset for Inferring Missing Entity Type Instances for Knowledge Base Completion
- Dataset for Inferring Missing Entity Type Instances for Knowledge Base Completion: This is a dataset that can be used for training and evaluating knowledge base completion approaches for inferring missing entity type instances.
Tweet Entity Linking Dataset: IE-driven and IR-driven sets
- Tweet Entity Linking Dataset: IE-driven and IR-driven sets: In this dataset, we release the labeled data for people to evaluate and compare entity linking systems on tweets.
Learning from Everyday Analog Pen Use to Improve Digital Ink Experiences Dataset
- Learning from Everyday Analog Pen Use to Improve Digital Ink Experiences Dataset: This is the data released with the CHI 2017 paper: As We May Ink? Learning from Everyday Analog Pen Use to Improve Digital Ink Experiences. It contains the 493 entries of a diary study with 26 participants on their use of analog pen and the 178 entries of a follow-up diary study with 30 participants on their use of digital pen.
Optical Data
- Optical Data: This dataset includes 14 months of optical data from Microsoft’s wide-area backbone network in North America.
Learning from Explicit and Implicit Supervision Jointly For Algebra Word Problems
- Learning from Explicit and Implicit Supervision Jointly For Algebra Word Problems: This is a public release of the dataset corresponding the paper "Learning from Explicit and Implicit Supervision Jointly For Algebra Word Problems" that will appear in EMNLP 2016.
Microsoft Research Sequential Question Answering (SQA) Dataset
- Microsoft Research Sequential Question Answering (SQA) Dataset: The SQA dataset was created to explore the task of answering sequences of inter-related questions on HTML tables. It has 6,066 sequences with 17,553 questions in total.
Microsoft Cognitive Toolkit Dataset
- Microsoft Cognitive Toolkit Dataset: The Microsoft Cognitive Toolkit empowers you to harness the intelligence within massive datasets through deep learning by providing uncompromised scaling, speed and accuracy with commercial-grade quality and compatibility with the programming languages and algorithms you already use.
GeoLife GPS Trajectories Dataset
- GeoLife GPS Trajectories Dataset: This GPS trajectory dataset was collected in (Microsoft Research Asia) Geolife project by 182 users in a period of over three years (from April 2007 to August 2012).
Wikipedia Biography Dataset
- Wikipedia Biography Dataset: This dataset gathers 728,321 biographies from wikipedia. It aims at evaluating text generation algorithms.
SpineWeb Dataset
- SpineWeb Dataset: This dataset are collected for Research on Spine Imaging and Image Analysis.
Stack Exchange Dataset
- Stack Exchange Dataset: This is an anonymized dump of all user-contributed content on the Stack Exchange network. Each site is formatted as a separate archive consisting of XML files zipped via 7-zip using bzip2 compression. Each site archive includes Posts, Users, Votes, Comments, PostHistory and PostLinks.
The Twitter Stream Grab Dataset
- The Twitter Stream Grab Dataset: A simple collection of JSON grabbed from the general twitter stream, for the purposes of research, history, testing and memory.
StarData
- StarData Dataset: We release the largest StarCraft: Brood War replay dataset yet, with 65646 games. The full dataset after compression is 365 GB, 1535 million frames, and 496 million player actions. The entire frame data was dumped out at 8 frames per second. We made a big effort to ensure this dataset is clean and has mostly high quality replays. You can access it with TorchCraft in C++, Python, and Lua. The replays are in an AWS S3 bucket at s3://stardata.
Phrasal Recognition Dataset
- Phrasal Recognition Dataset: This dataset contains 8 object categories from Pascal VOC that are suitable for studying the interactions between objects.
UIUC Pascal Sentence Dataset
- UIUC Pascal Sentence Dataset
Cross Category Object Recognition Dataset (CORE)
- Cross Category Object Recognition Dataset: The CORE dataset is intended to help learn more detailed models and for exploring cross-category generalization in object recognition.
Attribute Dataset (aPascal, aYahoo)
- Attribute Dataset (aPascal, aYahoo): There are three components to the dataset:
  Annotations: The attribute annotations for the aPascal train and test sets, and aYahoo test set.
  aYahoo images: Our images collected from Yahoo
  aPascal images: These are the images from the Pascal VOC 2008.
MIMIC Critial Care Dataset
- MIMIC Critial Care Dataset: MIMIC is an openly available dataset developed by the MIT Lab for Computational Physiology, comprising deidentified health data associated with ~40,000 critical care patients. It includes demographics, vital signs, laboratory tests, medications, and more.
Fashion-mnist Dataset
- Fashion-mnist Dataset: A dataset of Zalando's article images consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. Fashion-MNIST is intended to serve as a direct drop-in replacement of the original MNIST dataset for benchmarking machine learning algorithms.
Rdatasets
- Rdatasets: Rdatasets is a collection of 1173 datasets that were originally distributed alongside the statistical software environment R and some of its add-on packages. The goal is to make these data more broadly accessible for teaching and statistical software development.
bAbI Project Datasets
- bAbI Project Datasets: The bAbI project of Facebook AI Research contains datasets, which is organized towards the goal of automatic text understanding and reasoning.
Reddit Datasets
- Reddit Datasets: Reddit datasets repository.
Awesome Data Science repository Datasets
- Awesome Data Science repository Datasets
DiaRetDB1 Datasets
- DiaRetDB1 Datasets:The DiaRetDB1 is a public database for evaluating and benchmarking diabetic retinopathy detection algorithms. The database contains digital images of eye fundus and expert annotated ground truth for several well-known diabetic fundus lesions (hard exudates, soft exudates, microaneurysms and hemorrhages). The original images and the raw ground truth are both available.
SBM-RGBD dataset
- SBM-RGBD dataset:The SBM-RGBD dataset has been created in order to evaluate and compare scene background modelling methods for moving object detection on RGBD videos. It provides all facilities (data, ground truths, and evaluation scripts) for the SBM-RGBD Challenge.
Classification datasets
- Classification datasets:Discover the current state of the art in objects classification datasets.
MIT Places datasets
- MIT Places datasets:The Places database contains 205 scene categories and 2,5 millions of images.
3D IKEA dataset
- 3D IKEA dataset:Dataset contains about 759 images and 219 3D models. All 759 images are annotated using available models (about 90 different models).
SUN Database
- SUN Database:Scene UNderstanding Database. A database for scene recognition (900 scene categories) and multiclass object detection (>15000 fully segmented images).
360-SUN Database
- 360-SUN Database:A database of 360 degrees panoramas organized along the SUN categories.
Out of context objects dataset
- Out of context objects dataset:The database contains 218 fully annotated images with at least one object out-of-context. Can you detect the out of context object?
Tiny Images Dataset
- Tiny Images Dataset:Dataset consists of 79,302,017 images, each being a 32x32 color image.
Indoor Scene Recognition Database
- Indoor Scene Recognition Database:The database contains 67 Indoor categories, and a total of 15620 images. The number of images varies across categories, but there are at least 100 images per category. All images are in jpg format.
LabelMe Database
- LabelMe Database:Each database is composed of a few hundred images of scenes belonging to the same semantic category. All of the images are in color, in jpeg format, and are 256 x 256 pixels. The sources of the images vary (from commercial databases, websites, digital cameras).
8 Scene Categories Dataset
- 8 Scene Categories Dataset:This dataset contains 8 outdoor scene categories: coast, mountain, forest, open country, street, inside city, tall buildings and highways.
GazeFollow Dataset
- GazeFollow Dataset
Full-sized images and segmentations Dataset
- Full-sized images and segmentations Dataset:All images are fully annotated with objects and, many of the images have parts too.
MovieBook Dataset
- MovieBook Dataset: Ground-truth alignments for 11 movie/book pairs, with shot, subtitle and book data.
Medical Data for Machine Learning
- Medical Datasets: This is a curated list of medical data for machine learning.
Land Cover Datasets
- Land Cover Datasets: This is a curated list of land covers datasets.
EuroSAT: A Novel Dataset
- EuroSAT Dataset: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification.
Data Set Repositories
- Data Set Repositories These 19 'sets of data sets' cover free or public data from various industries, including small and large, structured and unstructured data sets.
DSC Datasets
- DSC Datasets links
Sebastian Raschka Datasets list
- Sebastian Raschka Datasets list:A collection of links to various free and open-source datasets.
Dataset of 230,000 3D facial landmarks
- Facial landmarks Dataset:LS3D-W is a large-scale 3D face alignment dataset constructed by annotating the images from AFLW, 300VW, 300W and FDDB in a consistent manner with 68 points using the automatic method.
Deep learning Publicly available Datasets
- Deep learning Publicly available Datasets:A collection of datasets collected by LISA lab.
MILA PUBLIC DATASETS
- PUBLIC DATASETS:A collection of datasets collected by MILA lab.
10 Great Healthcare Data Sets
- 10 Great Healthcare Data Sets: Here are 10 great data sets to start playing around with & improve your healthcare data analytics chops.
The Holy Quran Dataset
- The Holy Quran Dataset: The data contains complete Holy Quran in different 21 languages.
Victoria's Datasets
- Victoria's Datasets Collection
ML-friendly Public Datasets
- ML-friendly Public Datasets Collection:There are lots of machine learning ready datasets available to use for fun or practice on Kaggle's Public Datasets platform.
Chest X-ray dataset
- Chest X-ray dataset:The dataset, released by the NIH, contains 112,120 frontal-view X-ray images of 30,805 unique patients, annotated with up to 14 different thoracic pathology labels using NLP methods on radiology reports.
UT-Austin Computer Vision Group Datasets
- UT-Austin Computer Vision Group Datasets
Lymph Node Detection and Segmentation datasets
- Lymph Node Detection and Segmentation datasets:This collection consists of Computed Tomography (CT) images of the mediastinum and abdomen in which lymph node positions are marked by radiologists at the National Institutes of Health, Clinical Center. Radiologists at the Imaging Biomarkers and Computer-Aided Diagnosis Laboratory labeled a total of 388 mediastinal lymph nodes in CT images of 90 patients and a total of 595 abdominal lymph nodes in 86 patients.
Pancreas Segmentation datasets
- Pancreas Segmentation datasets:The National Institutes of Health Clinical Center performed 82 abdominal contrast enhanced 3D CT scans (~70 seconds after intravenous contrast injection in portal-venous) from 53 male and 27 female subjects. Seventeen of the subjects are healthy kidney donors scanned prior to nephrectomy. The remaining 65 patients were selected by a radiologist from patients who neither had major abdominal pathologies nor pancreatic cancer lesions. Subjects' ages range from 18 to 76 years with a mean age of 46.8 ± 16.7. The CT scans have resolutions of 512x512 pixels with varying pixel sizes and slice thickness between 1.5 − 2.5 mm, acquired on Philips and Siemens MDCT scanners (120 kVp tube voltage).
MLDATA repository
- MLDATA repository datasets collection:This web site is as a repository for your machine learning data.
RENOIR - A Dataset of Real Low-Light Images
- RENOIR - A Dataset of Real Low-Light Images:The first publicly available dataset of images corrupted by real low-light noise together with pixel and intensity aligned clean images. It contains about 500 images of 120 scenes that have been collected in low-light setting using three cameras: Cannon T3i, Cannon S90 and a Xiaomi MI3 mobile phone. The dataset is quite large since the images have the original sensor resolution, so each image has about 10 megapixels.
Pakistan Intellectual Capital dataset
- Pakistan Intellectual Capital dataset:The dataset contains list of computer science/IT professors from 89 different universities of Pakistan.
DeepMind Open Source Datasets
- DeepMind Open Source Datasets
AVA Dataset
- AVA Dataset: The AVA dataset densely annotates 80 atomic visual actions in 57.6k movie clips with actions localized in space and time, resulting in 210k action labels with multiple labels per human occurring frequently.
The NSynth Dataset
- The NSynth Dataset: A large-scale and high-quality dataset of annotated musical notes.
Google Speech Commands Dataset
- Google Speech Commands Dataset: The dataset has 65,000 one-second long utterances of 30 short words, by thousands of different people, contributed by members of the public through the AIY website.
70 Amazing Free Data Sources by KDnuggests
- 70 Amazing Free Data Sources by KDnuggests: 70 free data sources for 2017 on government, crime, health, financial and economic data, marketing and social media, journalism and media, real estate, company directory and review, and more to start working on your data projects.
List of datasets for machine learning research
- List of datasets for machine learning research: This list aggregates high-quality datasets that have been shown to be of value to the machine learning research community from multiple different data repositories to provide greater coverage of the topic than is otherwise available.
data.world
- data.world datasets collection: data.world is designed for data and the people who work with data. From professional projects to open data, data.world helps you host and share your data, collaborate with your team, and capture context and conclusions as you work.
Openml data repositories
- Openml datasets collection
25 Datasets for Deep Learning in IoT
- 25 Datasets for Deep Learning in IoT
Computer Vision Datasets
- Computer Vision Datasets
deeplearning.net Data collections
- deeplearning.net Data collections
Medical Segmentation dataset
- Medical Segmentation dataset
Berkeley Deep Drive dataset
- Berkeley Deep Drive dataset: Explore 100,000 HD video sequences of over 1,100-hour driving experience across many different times in the day, weather conditions, and driving scenarios. Our video sequences also include GPS locations, IMU data, and timestamps.
Predict Pakistan Elections 2018 dataset
- Predict Pakistan Elections 2018 dataset: The dataset contains complete election results for the national assembly of Pakistan for 2002, 2008 and 2013. The file contains Seat, Constituency, Candidates Name, Party Affiliation, Votes, TotalValidVotes, TotalRejectedVotes, TotalVotes, TotalRegisteredVoters and Turnout variables for each seat.
Medical imaging datasets
- Medical imaging datasets
Open Images Dataset
- Open Images Dataset
DeepLesion Dataset
- DeepLesion: a large-scale and diverse CT lesion dataset.
Deep Learning Datasets
- Deep Learning Datasets: a curated list of deep learning datasets.
10 Open-Sourced AI Datasets
- 10 Open-Sourced AI Datasets: Ten open-source of large datasets by popular AI research projects.
Fruit Images Dataset
- Fruit Images Dataset: A dataset of images containing fruits.

Note: If you know about any other dataset for triggering data science, machine learning and deep learning research, which is not listed here, please, feel free to comment below with downloadable link. I would like to add it in above list.

Subscribe to my mailing list