Machine Learning Basics

Understanding Machine Learning:

Machine Learning is about developing customized algorithms that may be used for various purposes which may include predictions, and identifying patterns such that a machine can perform tasks that humans are able to conduct. While a model that could explain the provided dataset effectively gets customized by fine-tuning the parameters and hyperparameters, it is important that the model can generalize the provided dataset in such a way that a model shall be able to explain if new data were to be added on top.

The parameters of these algorithms heavily depend on the provided dataset. This means it is possible that while the algorithm explains the current dataset fairly well, such an algorithm may not work well for the other dataset with the relevant context. This “over-fitting” can be prevented by splitting the dataset into three categories that may serve different purposes during the learning process.

The first set of data is called the training dataset. As the name suggests, this dataset is used to “train” the algorithm. When we say training, we mean tuning the parameters of the algorithm to the given dataset. As this will determine the “optimal” algorithm that would solve the provided tasks, it is typical to have over 50% of the dataset as a training dataset. We may set hyperparameters to decide how the algorithm will learn.

The second set of data serves as a verifier thus called the validation dataset. In order to avoid over-fitting, this set of data will serve as an unknown yet reusable data to verify the generalization of the fitted model. Each time a model gets trained, it will try to explain the task on the validation dataset to let the Machine Learning practitioner know its performance. While the accuracy measurement metrics may vary (i.e. one example would be a simple proportion of correct predictions), it is important to have a metric that is appropriate for the given task. Ideally, we would want to make sure the model has the minimum error on the validation set while having a relatively low error on the training set.

Image Credit: The Elements of Statistical Learning

The last set of data can be used for the testing purpose. It is important to note that while this dataset is similar to validation dataset such that it serves as an unknown data to be tested on, it can only be used once for the final testing stage. The purpose of this dataset remains to demonstrate the performance of the algorithm in general.

It can be easily seen that the existence of a large set of data is essential to develop Machine Learning algorithms for a given problem. Where the definition of “large” may depend on the context and the algorithms that are trying to be implemented, more data would usually result in a better fitted and generalized model. Fundamentally, since the algorithm goes through this provided “large” dataset to be trained on, Machine Learning algorithms go through computationally heavy training process which may not be feasible for certain situations. As a Machine Learning practioner and a Software developer, it is important to idenfity when to use these expensive processes and when to not. This primarily depends on which task an application needs to resolve.

Basic Machine Learning Paradigms

Supervised Learning

Supervised Learning involves a prediction or classification of a label using provided information. An easy example would be to idenfity whether a provided object is an orange or apple based on qualitative/quantitative information such as dimensions, shape, colour, etc or predicting a person’s height based on other personal traits. It is important to note that the dataset contains the target variable to the tasks such that the machine can learn the relationship between the predictors (variables used for prediction/classification purposes) and the targets (labels/outcomes) that need to get predicted.

Algorithms may include: k-Nearest Neighbours, Decision Tree, Logistic Regression, Neural Network, etc.

Unsupervised Learning

If the nature of tasks involves a dataset which does not contain the target variables (that is, we do not know what the data is supposed to represent beforehand), it is impossible to “supervise” the machine’s training process. However, a machine may be able to develop an algorithm such that patterns that may exist in the dataset can be indentified. For example, considering if we want to determine preferences on movies for the user. Given a dataset which contains the user’s netflix history, an algorithm may be able to determine the categories of the movies that the user may prefer.

Algorithms may include: k-means, Principal Component Analysis

Reinforcement Learning

Reinforement Learning involves tasks such that past actions may affect the future actions. Unlike supervised learning, it does not require predictor-target paired dataset, but rather it requires a defined reward signal such that how performing a certain action could be beneficial over the others. An example would be predicting the best move for the given chess board.

Algorithms may include: Q-learning, Actor-Critic methods

Implementation in Software Engineering

The biggest drawbacks of implementing Machine Learning algorithms are the requirements of large datasets and expensive computation time. Implementations may not be feasible if none of the mentioned two can be achieved as the accuracy may be poor and the correct representation of the dataset may not be possible. As a developer, one should consider collecting data upon the user’s consent if the quality and volume of the data become a problem.

Another major issue of implementing Machine Learning algorithms may lie in ethics. Due to the bias that may exist in the provided dataset, there have been numerous incidents where the algorithm resulted in discriminative behaviour. A developer should always know how the fitted model behaves to prevent this issue and make sure that the quality of the data provided is adequate. The process of data collection along with how those data get used is another ethical concern that a developer should have in mind. It is important to note that having the users’ consent is a must and not an option when using their data.

However, if the task involves resolving tasks that involve non-deterministic behaviours, Machine Learning algorithms can be extremely powerful. That is, if there are no feasible rule-based algorithms to correctly resolve the task, it is worth considering the implementation of Machine Learning algorithms as none of the deterministic algorithms would do a better job.

If implemented correctly, a Machine Learning algorithm could yield effective results in both generalization and specification which may be necessary to make better software.

Appendix:

References