General introduction

Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to effectively perform a specific task without using explicit instructions, relying on patterns and inference instead. Machine learning algorithms build a mathematical model of sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task.

Machine learning algorithms are used in a wide variety of applications, such as email filtering, detection of network intruders and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. It is closely related to computational statistics, which focuses on making predictions using computers.

The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. An important subfield of machine learning is data mining, which focuses on exploratory analysis through unsupervised learning. In its application across business problems, machine learning is also referred to as predictive analytics.

Types of ML tasks

  1. Unsupervised learning: In unsupervised learning, the algorithm builds a mathematical model from a set of unlabeled input data, meaning there is no knowledge about a desired target variable. Unsupervised learning algorithms are used to find structure in the data, for example by grouping or clustering of data points. In this way, unsupervised learning can discover patterns in the data and can group the inputs into categories, as in feature learning. Feature learning can be done through dimensionality reduction; the process of reducing the number of ‘features’, or inputs, in a set of data.

  2. Supervised learning: In supervised learning, the algorithm builds a mathematical model from a set of data that contains both the inputs (predictors) and the desired outputs (target variables). For example, when trying to determine whether an image contains a certain object, the training data for a supervised learning algorithm would include images with and without that object (the input) and each image would have a label (the desired output or target variable) designating whether it contained the object. In special cases, the input may be only partially available or restricted to special feedback. This category falls into two sub-categories:

    • Classification: Classification algorithms are used when the outputs are restricted to a limited set of values. An example is a classification algorithm that filters emails, in which case the input would be an incoming email and the output would be the name of the folder in which to file the email. For an algorithm that identifies spam emails, the output would be the prediction of either "spam" or "not spam", represented by the Boolean values true and false.
    • Regression: Regression algorithms are named for their continuous outputs, meaning they may have any value, possibly within a range. Examples of a continuous value are the temperature, length or price of an object.
  3. Reinforcement learning: Reinforcement learning is an area of ML concerned with how software agents ought to take actions in an environment to maximize some notion of cumulative reward. It differs from supervised learning in that correct input/output pairs need not be presented and sub-optimal actions need not be explicitly corrected. A typical example of these types of problems is teaching an agent how to play games like chess or Tetris, where it is difficult to say which action influenced the outcome of the game in what manner (therefore the "correct" output labels are missing).


  1. Time series forecasting: Time series forecasting covers a subset of regression problems in which the ordering of the data is important, as each of the data samples is connected to a specific date and time. A time series is a sequence of data points relating to successive – usually equally spaced - points in time. Examples of time series are heights of ocean tides, closing electricity price and counts of sunspots. Time series problems usually require a different algorithmic approach than standard regression problems, as it is necessary to exploit the inner correlation of successive data points. Unlike in other regression problems, in time series problems data cannot be reordered arbitrarily without losing information.

  2. Anomaly detection: Anomaly detection (also called outlier detection) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically, the anomalous events will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions. Three broad categories of anomaly detection techniques exist:

  • Unsupervised anomaly detection techniques detect anomalies in an unlabeled test dataset under the assumption that the majority of the instances in the dataset are normal. It does so by looking for the instances that seem to fit the least to the remainder of the data set and are in this way the most abnormal.
  • Supervised anomaly detection techniques require a dataset in which all observations have been labeled as either "normal" or "abnormal". This involves training a classifier. (The key difference to many other classification problems is the inherent unbalanced nature of outlier detection.)
  • Semi-supervised anomaly detection techniques construct a model representing normal behavior from a given normal training data set and then test the likelihood that a test instance is generated by the learned model.