in Open Data Resources For Data Science Research: Part-1 ~ read.
Open Data Resources For Data Science Research: Part-1

Open Data Resources For Data Science Research: Part-1

One of the hardest problems to solve in machine learning, deep learning and data science has nothing to do with network architecture i.e. Neural Nets, but, it’s the problem of getting the right data in the right format. Therefore, Data is a fuel in this data science era.
Datasets are an integral part of the field of machine learning, deep learning and data science. Major advances in these fields can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets. High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.

Deep learning, and machine learning more generally, wants a good training dataset to work properly. Collecting and constructing the training dataset – a sizable body of known data – takes time and domain-specific knowledge of where and how to gather relevant domain information. The training dataset acts as the benchmark against which machine learning and deep-learning algorithms are trained. That is what they learn to reconstruct before they’re allow to run free on data they haven’t seen before.

At this stage, knowledgeable humans domains need to find the right raw data and transform it into a numerical representation so that the deep-learning and machine learning algorithm can understand it. It is usually given to learner as a vector form or a matrix form.

Training datasets that require much time or expertise can serve as a proprietary edge in the world of data science, machine learning and deep learning. The nature of the expertise is largely in telling your algorithm what matters to you by selecting what goes into the training dataset. It involves telling a story and data distribuation – through the initial data you select – that will guide your deep-learning nets and machine learning algorithms as they generalize the significant features, both in the training set and in the raw data they’ve been created to study. To create a useful training set, you have to understand the problem you’re solving; i.e. what you want your machine learning algorithms and deep-learning nets to pay attention to.

Machine learning and deep learning algorithms typically work with three data sets, known as training set, dev set and test set. All these three should be randomly selected from a larger body of data i.e. population.
The first set you use is the training set, which is the largest of the three sets. Running a training dataset through a machine learning and deep learning algorithm teaches the learner how to weigh different features, assigning them coefficients according to their likelihood of minimizing errors in your results. Those coefficients, also known as predictors, will be contained in vectors or matrices.
The second set is your dev set, which much smaller than training set. It’s used for tuning hyper-parameters and generalization of machine learning model.
The third set is your test set. It functions as a model testing, and you don’t use it during training learning model. After you’ve trained and optimized your data, you test your learning algorithm against this test set. The results it produces should validate that your algorithm accurately recognizes given test instance. If you don’t get accurate predictions, go back to the training set, look at the hyper-parameters you used to tune the learner, as well as the quality of your data and look at your pre-processing techniques.

So, this blog post is written to introduce various datasets among different human domains to trigger data science, machine learning and deep learning research. Here are a few cool public open data sources you can use for your projects related to Data Science, Machine Learning and Deep Learning research:

Some initial links are directly taken from the blog post on The Data Science 101, which can be found here.

Note: If you know about any other dataset for triggering data science, machine learning and deep learning research, which is not listed here, please, feel free to comment below with downloadable link. I would like to add it in above list.


Flag Counter

Subscribe to my mailing list

* indicates required
comments powered by Disqus