Correlation between variables

After the previously discussed process of expansion, many new predictors have been created. It is not optimal to retain all of these new predictors in our final model, as many of them will be highly correlated. Having highly correlated predictors in a model makes it highly unstable: a small change in a predictor could result in a structural change of the whole model, because the model's results are influenced by many similar variables (correlated with the one that initially changed). This is not desirable. Therefore, only the most important subset of variables will be retained for the final model, eliminating the high correlation between various variables.


Increasing model stability by eliminating inner correlation is similar to fighting a phenomenon called overfitting. A model that overfits the data tries to memorize the individual data points, whereas it should be recognizing patterns in the data. A model that overfits, produces worse results - i.e. worse forecasts - than a model without overfitting, because new data points are often slightly different than those previously seen in the dataset and the model can therefore not remember them. A model with overfitting produces great results when deployed on the data used for training the model (the training set), but performs significantly worse when deployed on new data (the test set).



Choosing a smaller subset of a set variables - i.e. the process of reduction - is a widely researched topic; algorithms like LASSO, PCA and forward regression are well known to the public. TIM uses a similar technique that heavily relies on a geometrical perspective and incorporates a tweaked Bayseian Information Criterion.