M3 competition¶

Problem description¶

The so called M-competitions are among the most popular forecasting competitions in the world, bringing together the best forecasters from both industry and academia. Their popularity stems from the proper comparison of many different state of the art algorithms, be it neural networks or classical statistical methods like ARIMA and from the fact, that there are usually thousands of different datasets which prevents contestants to fine-tune their algorithms for a specific case.

These competitions also tend to cause a lot of controversy in the forecasting world because most of the time, the simpler statistical methods outperform the more sophisticated ones, that are hot in the academia. As you can read in this paper, "After comparing the post-sample accuracy of popular ML methods with that of eight traditional statistical ones, we found that the former are dominated across both accuracy measures used and for all forecasting horizons examined. Moreover, we observed that their computational requirements are considerably greater than those of statistical methods".

TIM is built with deep statistical foundations focusing more on the fundamentals than the algorithm choice. See How it works section to learn more about the TIM Forecasting. This template will guide you through the third M-competition M3 with 3003 different datasets and will show you how to use TIM to outperform competing methods in a fully automatic mode.

Data recommendation template¶

Download the data by clicking on the download link. It consists of many different time series chosen across sampling rates and industries.

Sampling Period	Micro	Industry	Macro	Finance	Demographic	Other	Total
Yearly	146	102	83	58	245	11	645
Quarterly	204	83	336	76	57		756
Monthly	474	334	312	145	111	52	1428
Other	4			29		141	174
Total	828	519	731	308	413	204	3003

The task was to use all the data available but the last n observations and forecast those subsequently. n differs across the sampling rates.

We will describe in more detail the modeling of one specific dataset, namely N 2792.This dataset is sampled monthly and the task is to forecast 18 samples ahead.

TIM setup¶

We will run TIM in a fully automatic mode.

Demo using Python API Client¶

Set up Python Libraries¶

import logging
import pandas as pd
import plotly as plt
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import json

import tim_client

Credentials and logging¶

(Do not forget to fill in your credentials in the credentials.json file)¶

with open('credentials.json') as f:
    credentials_json = json.load(f)                     # loading the credentials from credentials.json

TIM_URL = 'https://timws.tangent.works/v4/api'          # URL to which the requests are sent

SAVE_JSON = False                                       # if True - JSON requests and responses are saved to JSON_SAVING_FOLDER
JSON_SAVING_FOLDER = 'logs/'                            # folder where the requests and responses are stored

LOGGING_LEVEL = 'INFO'

level = logging.getLevelName(LOGGING_LEVEL)
logging.basicConfig(level=level, format='[%(levelname)s] %(asctime)s - %(name)s:%(funcName)s:%(lineno)s - %(message)s')
logger = logging.getLogger(__name__)

credentials = tim_client.Credentials(credentials_json['license_key'], credentials_json['email'], credentials_json['password'], tim_url=TIM_URL)
api_client = tim_client.ApiClient(credentials)

api_client.save_json = SAVE_JSON
api_client.json_saving_folder_path = JSON_SAVING_FOLDER

[INFO] 2020-11-06 13:06:57,197 - tim_client.api_client:save_json:74 - Saving JSONs functionality has been disabled
[INFO] 2020-11-06 13:06:57,201 - tim_client.api_client:json_saving_folder_path:89 - JSON destination folder changed to logs

Specify configuration¶

To let TIM know that we want to forecast 18 samples ahead we can set the "prediction to" to 18 samples. Model will be built using a range between 1921-01 – 1927-06. Out-of-sample forecasts are made on the rest - from 1927-07 – 1928-12. (the last 18 samples). To get better insights from our model we will also want extended importance.

configuration_backtest = {
    'usage': {                                 
        'predictionTo': { 
            'baseUnit': 'Sample',              # units that are used for specifying the prediction horizon length (one of 'Day', 'Hour', 'QuarterHour', 'Sample')
            'offset': 18                       # number of units we want to predict into the future (24 hours in this case)
        },
        'backtestLength': 18                   # number of samples that are used for backtesting (note that these samples are excluded from model building period)
    },
    "predictionIntervals": {
        "confidenceLevel": 90                  # confidence level of the prediction intervals (in %)
    },
    'extendedOutputConfiguration': {
        'returnExtendedImportances': True      # flag that specifies if the importances of features are returned in the response
    }
}

Data description¶

Dataset used in this example has monthly sampling period and contains data from 1921-01 to 1928-12.

Target¶

Target variable are anonymized values and are categorized as other in the M3 competition.

Predictor candidates¶

No predictors included.

Timestamp¶

Timestamp is the first column and each value of the timestamp is the beginning of the period it corresponds to i.e. row with timestamp 1921-01 corresponds to the whole period between 1921-01 and 1921-02.

Forecasting scenario¶

In this example we will simulate an 18 samples-ahead scenario.

data = tim_client.load_dataset_from_csv_file('N2792.csv', sep=';')                                 # loading data from data.csv
data                                                                                               # quick look at the data

Run TIM¶

backtest = api_client.prediction_build_model_predict(data, configuration_backtest)                 # running the RTInstantML forecasting using data and defined configuration
backtest.status                                                                                    # status of the job

'Finished'

Visualize backtesting¶

fig = plt.subplots.make_subplots(rows=1, cols=1, shared_xaxes=True, vertical_spacing=0.02)      # plot initialization

fig.add_trace(go.Scatter(x = data.loc[:, "timestamp"], y=data.loc[:, "target"],
                         name = "target", line=dict(color='black')), row=1, col=1)              # plotting the target variable

fig.add_trace(go.Scatter(x = backtest.prediction.index, 
                         y = backtest.prediction.loc[:, 'Prediction'],
                         name = "production forecast", 
                         line = dict(color='purple')), row=1, col=1)                            # plotting production prediction

fig.add_trace(go.Scatter(x = backtest.aggregated_predictions[0]['values'].index, 
                         y = backtest.aggregated_predictions[0]['values'].loc[:, 'Prediction'],
                         name = "in-sample MAPE: " + str(round(backtest.aggregated_predictions[0]['accuracyMetrics']['MAPE'], 2)), 
                         line=dict(color='goldenrod')), row=1, col=1)                           # plotting in-sample prediction

fig.add_trace(go.Scatter(x = backtest.aggregated_predictions[1]['values'].index, 
                         y = backtest.aggregated_predictions[1]['values'].loc[:, 'Prediction'],
                         name = "out-of-sample MAPE: " + str(round(backtest.aggregated_predictions[1]['accuracyMetrics']['MAPE'], 2)), 
                         line = dict(color='red')), row=1, col=1)                               # plotting out-of-sample-sample prediction

fig.update_layout(height=600, width=1000, 
                  title_text="Backtesting, modelling difficulty: " 
                  + str(round(backtest.data_difficulty, 2)) + "%" )                             # update size and title of the plot

fig.show()

Visualize predictor and feature importances¶

simple_importances = backtest.predictors_importances['simpleImportances']                                                                # get predictor importances
simple_importances = sorted(simple_importances, key = lambda i: i['importance'], reverse=True)                                           # sort by importance
extended_importances = backtest.predictors_importances['extendedImportances']                                                            # get feature importances
extended_importances = sorted(extended_importances, key = lambda i: i['importance'], reverse=True)                                       # sort by importance

si_df = pd.DataFrame(index=np.arange(len(simple_importances)), columns = ['predictor name', 'predictor importance (%)'])                 # initialize predictor importances dataframe
ei_df = pd.DataFrame(index=np.arange(len(extended_importances)), columns = ['feature name', 'feature importance (%)', 'time', 'type'])   # initialize feature importances dataframe

for (i, si) in enumerate(simple_importances):
    si_df.loc[i, 'predictor name'] = si['predictorName']                   # get predictor name
    si_df.loc[i, 'predictor importance (%)'] = si['importance']            # get importance of the predictor
    
for (i, ei) in enumerate(extended_importances):
    ei_df.loc[i, 'feature name'] = ei['termName']                          # get feature name
    ei_df.loc[i, 'feature importance (%)'] = ei['importance']              # get importance of the feature
    ei_df.loc[i, 'time'] = ei['time']                                      # get time of the day to which the feature corresponds
    ei_df.loc[i, 'type'] = ei['type']                                      # get type of the feature

si_df.head()                                                               # predictor importances data frame

fig = go.Figure(go.Bar(x=si_df['predictor name'], y=si_df['predictor importance (%)']))      # plot the bar chart
fig.update_layout(height=400,                                                                # update size, title and axis titles of the chart
                  width=600, 
                  title_text="Importances of predictors",
                  xaxis_title="Predictor name",
                  yaxis_title="Predictor importance (%)")
fig.show()

ei_df.head()                                                               # first few of the feature importances

time = '[1]'                                                                            # time for which the feature importances are visualized
fig = go.Figure(go.Bar(x=ei_df[ei_df['time'] == time]['feature name'],                       # plot the bar chart
                       y=ei_df[ei_df['time'] == time]['feature importance (%)']))
fig.update_layout(height=700,                                                                # update size, title and axis titles of the chart
                  width=1000,
                  title_text="Importances of features (for {}-sample ahead model)".format(time),
                  xaxis_title="Feature name",
                  yaxis_title="Feature importance (%)")
fig.show()

	timestamp	target
0	1921-01	4800
1	1921-02	3640
2	1921-03	5000
3	1921-04	5960
4	1921-05	7760
...	...	...
91	1928-08	9480
92	1928-09	8080
93	1928-10	6400
94	1928-11	5400
95	1928-12	3680

	feature name	feature importance (%)	time	type
0	target(t-18)	100	[13]	TargetAndTargetTransformation
1	target(t-18)	100	[14]	TargetAndTargetTransformation
2	target(t-18)	100	[15]	TargetAndTargetTransformation
3	target(t-18)	100	[16]	TargetAndTargetTransformation
4	target(t-18)	100	[17]	TargetAndTargetTransformation