Skip to content

Overview

We have put a lot of effort into creating a fully automatic model building engine. Still, even against our best efforts, sometimes some models do not get the highest possible accuracy. However, users can ensure that even the toughest dataset can be modeled properly by toying with the algorithm's exposed parametrization.

The following subsections go through all of the available settings of TIM Forecasting. The table below shows all configuration parameters available for different job types.

Configuration

Configuration parameter build-model rebuild-model predict default
Prediction to Sample +1
Prediction from Sample + 1
Model quality Combined (Very High for D+0 and D+1, High otherwise)
Normalization true
Model complexity automatic
Features Polynomial, Time offsets, Identity, Intercept, Rest of week, Piecewise linear, Exponential moving average, Periodic
Daily cycle automatic
Allow offsets true
Offset limit automatic
Memory limit check true
Rebuilding policy New situations
Prediction intervals 90%
Prediction boundaries automatic
Rolling window 1 day (daily cycle) / Prediction to (nondaily cycle)
Backtest All

Data

Configuration parameter build-model rebuild-model predict default
In-sample rows All records except Out-of-sample
Out-of-sample rows No records
Imputation Linear for gaps no longer than 6
Columns all
Target column First non-timestamp column
Holiday column none
Time scale Originally estimated from dataset
Aggregation Mean (numerical variables) / Maximum(boolean variables)
Alignment Determined from dataset end
Preprocessors No preprocessors

Preprocessors

Type build-model rebuild-model predict default
CategoryFilter All records

Prediction to

This setting serves to define the forecasting horizon. It consists of a baseUnit (one of Month, Day, Hour, Minute, Second and Sample) and a value (non-negative integer). If not set, TIM will default to one Sample ahead.

"predictionTo": {
  "baseUnit": "Day",
  "value": 2
}

Defining PredictionTo with Samples

This is the easiest way to define the forecasting horizon. TIM will try to forecast value samples starting from the last target observation in the dataset and using a step size equal to the sampling period estimated from the dataset (or stored in the model).

Defining PredictionTo with Month, Day, Hour, Minute and Second

Often, a user wishes to forecast the entire following day, but does not want to count how many samples this represents (it changes based on where the last target observation currently is). This notation functions relative to the last target observation. Suppose the user sets the "predictionTo" to Day+1. In that case, TIM will recognize that it should forecast up until the last observation of the following day - ignoring where within the current day your target currently ends (parts of the datetime of the target end that are measured in a smaller granularity than baseUnit are ignored). This logic works similarly for baseUnit Hour and QuarterHour - see the table below with examples.

PredictionTo Last target observation Denotes all samples up until
D+1 28-01-2012 22:13:56 29-01-2012 23:59:59
D+0 28-01-2012 22:13:56 28-01-2012 23:59:59
H+1 28-01-2012 22:13:56 28-01-2012 23:59:59
H+0 28-01-2012 22:13:56 28-01-2012 22:59:59
Q+1 28-01-2012 22:13:56 28-01-2012 22:29:59
Q+0 28-01-2012 22:13:56 28-01-2012 22:14:59

Prediction from

This setting complements 'predictionTo' and allows skipping the first samples in the forecasting horizon. If not set, TIM will default to one Sample ahead, not skipping anything.

"predictionFrom": {
  "baseUnit": "Sample",
  "value": 3
}

In-sample rows

This setting defines which samples should be used for model building (training). The user can specify the in-sample timestamps as an array of timestamp ranges. If not set, all timestamps but the ones defined in the 'outOfsample' rows will be used.

"inSampleRows": [
  {
   "from": "2009-06-01 00:00:00",
   "to": "2009-06-10 23:00:00"
  },
  {
   "from": "2009-05-01 00:00:00",
   "to": "2009-05-10 23:00:00"  
  }
]

Alternatively, a relative notation can be used, expressed as an integer number n with its base unit (one of Month, Day, Hour, Minute, Second and Sample). This defines the length of the time range. The type of the relative range defines the start and the direction from which it is calculated. The Last starts from the last non-missing target observation (the newest observation of the target variable) going backwards and the First starts from the first non-missing target observation (the oldest observation) going forward. If no type is specified, default value is Last.

"outOfSampleRows": {
  "type": "Last",
  "baseUnit": "Day",
  "value": 2
}

If there is an intersection of the insSampleRows┬░ with the outOfSampleRows, observations in the intersection will be considered as follows:

  • by default, observations in the intersection will be considered as outOfsample,
  • when outOfSampleRows are defined as a relative range starting from the first target timestamp (type First), the observations in the intersection will be considered as inSample; the reasoning here is that for out-of-sample validation data towards the end of the dataset are more relevant.

Out-of-sample rows

This setting which samples should be used to backtest (validate) the Model Zoo. These observations will not be used during model building (training), and therefore the forecasts' accuracy on this region more closely resembles that of the real production setup. If not set, none will be used.

There are two ways to configure the out-of-sample rows:

  • as an array of timestamp ranges:
"outOfSampleRows": [
  {
   "from": "2020-06-01 00:00:00",
   "to": "2020-06-10 23:00:00"
  },
  {
   "from": "2020-05-01 00:00:00",
   "to": "2020-05-10 23:00:00"  
  }
]
  • as an integer number n with base unit (one of Month, Day, Hour, Minute, Second and Sample), defining the relative time range and the type of the relative range defining the start and direction (First and Last calculated from the first / last non-missing target observation, default is Last).
"outOfSampleRows": {
  "type": "Last",
  "baseUnit": "Day",
  "value": 2
}

If there is an intersection of the insSampleRows┬░ with the outOfSampleRows, observations in the intersection will be considered as follows:

  • by default, observations in the intersection will be considered as outOfsample,
  • when outOfSampleRows are defined as a relative range starting from the first target timestamp (type First), the observations in the intersection will be considered as inSample; the reasoning here is that for out-of-sample validation data towards the end of the dataset are more relevant.

Rolling window

When TIM evaluates the models built on the in-sample and out-of-sample data, it starts rolling backwards from where the target variable ends until the start of the dataset and forecasts the whole length of the forecasting horizon each time. The user can specify the length of this rolling window to control the size of the output (using any number of months, days, hours, minutes, seconds or samples). By default, for daily cycle datasets a rolling window of 1 day is used and for nondaily cycle datasets a rolling window of 1 sample is used.

"rollingWindow": {
  "baseUnit": "Day",
  "value": 2
}

Rebuilding policy

The rebuilding policy controls which model(s) of the given parent job's Model Zoo should be rebuilt and which should be dropped. There are three different options:

  • all: all models in the current Model Zoo are dropped, and new models are added;
  • newSituations: only models that are needed for the given forecasting horizon that the current Model Zoo cannot handle are built and added to the Model Zoo;
  • olderThan : the same behavior as newSituations, but models older than "timestamp" are deemed useless and replaced with newly built ones too. This is the only option where it makes sense to include the time parameter. The user can specify any number of days, hours, quarter-hours or samples.
"rebuildingPolicy": {
  "type": "OlderThan",
  "time": {
    "baseUnit": "Day",
    "value": 7
  }
}

Model quality

This setting controls the model complexity versus training time tradeoff. The higher the model quality, the longer it takes to build the Model Zoo. If not set, Combined will be used.

  • Low: dummy quality, these models can be used even without any data provided;
  • Medium: models without offsets of the target variable;
  • High: models using only a limited amount of offsets of the target variable;
  • VeryHigh: every model uses the closest target offset possible;
  • UltraHigh: every model uses the closest offset possible for every single predictor;
  • Combined: VeryHigh quality for intra-day and day-ahead forecasts,High quality for further forecasting horizons.

Note: For the qualities Medium, High and VeryHigh, a selection of the offsets within a day is optimized to minimize training time. This may cause scenarios where two identical situations within two different prediction horizons can have slightly different models; ex.g. models for S+1 may be different if the prediction horizon is in one case set to S+5 and in the other case to S+10.

"modelQuality": "High"

Features

TIM tries to enhance the model building process with new artificially created features derived from the original predictors. The following different transformations are available (those in bold are used by default):

It is possible to change the selection of features TIM can use by explicitly sending a list of the features to use (potentially also omitting features that are by default included).

"features": ["TimeOffsets", "Identity", "PiecewiseLinear", "ExponentialMovingAverage",
             "SimpleMovingAverage", "Periodic", "Fourier", "RestOfWeek", "DayOfWeek",
             "PublicHolidays", "Month", "Trend", "Intercept", "Polynomial"]

Normalization

When normalization is on, predictors are scaled by their mean and standard deviation. Switching normalization off may help to model data with structural changes. If not provided or set to automatic, TIM will decide automatically.

"normalization": true

Model complexity

This setting determines the maximal possible number of terms in each model in the Model Zoo. Challenging datasets might require a lower model complexity. If not set, TIM will calculate the model complexity automatically based on the sampling period of the dataset.

"maxModelComplexity": 50

Daily cycle

This setting is a boolean value determining whether or not to use an individual model building approach for different times within a day. Doing so is beneficial if the dynamics of the underlying problem change during the day. Switching it off leads to a common model building approach for all timestamps. If the parameter is not provided, TIM will decide automatically. Learn more about the importance of this parameter in the dedicated section on daily cycle.

"dailyCycle": false

Allow offsets

Allow offsets is a boolean value that determines whether to use offsets of predictors in the model. If allow offsets is set to false, no time offsets, exponential moving average or simple moving average will be used in the model; they should not be explicitly deselected in the feature configuration. The piecewise linearity transformation will be made only from predictors that are available at the forecasted timestamp. If allow offsets is set to false, the explicit offset limit parameter cannot be set to anything other than 0. This setting is applied for all predictors including the target variable. Therefore, setting model quality to High, VeryHigh or Combined while setting allow offsets to false will return the same result as setting model quality to Medium. Calendar features may still occur in the model with offsets, since these are engine features and are obtained only from the forecasted timestamp.

"allowOffsets": false

Offset limit

Offset limit can be set as an explicit value; if it is not set, the value will be determined automatically. This value is a negative number defining how far into the past offsets can go. This setting is mainly used to generate time offsets. Only offsets from the range defined by the offset limit and the closest available offset of a variable will be considered in the model building process. The features exponential moving average, simple moving average and piecewise linearity will be calculated from a variable only if the closest available offset of the variable is closer to the dataset end than the offset limit. The features public holidays, weekrest and weekday will not be affected by this setting, since they are determined separately.

If allow offsets is set to false, the explicit offset limit cannot be set to anything other than 0. The offset limit that was used in model building can be found in the job log.

"offsetLimit": {
  "type": "Explicit",
  "value": -10
}

Backtest

This setting determines which types of forecasts should be returned. The Production option only returns the production forecast, the OutOfSample option also produces out-of-sample forecasts, and the All option also delivers in-sample forecasts.

"backtest": "All"

Prediction interval

The prediction interval expresses the uncertainty in prediction by creating an interval where the prediction should probably occur. The value of this setting expresss the probability that the prediction will be inside the symmetrical prediction interval. Therefore, with increasing value, the prediction intervals widen.

"predictionIntervals": 95

Prediction boundaries

For some datasets, values outside certain boundaries do not make sense - e.g. negative values for energy production. TIM tries to figure these out automatically, but there is an option to override these detected values. Both the lower and upper boundaries should be real values. It might be useful to turn prediction boundaries off for datasets with a visible trend.

"predictionBoundaries": {
  "type": "Explicit",
  "maxValue": 1000,
  "minValue": 0
}

Memory limit check

TIM tries to estimate whether the worker it currently operates on has enough memory to finish the model building and forecasting process. If not, and the memory preprocessing is turned on, it will drop some of the rows and columns of the dataset and turn off some of the transformations. By default, it is turned on. If turned off, this may lead to a crash of the operation for big datasets.

"memoryLimitCheck": false

Target column

This setting defines the column (given either by its name or number) that contains the target variable.

"targetColumn": 2

Holiday column

This setting defines the column (given either by its name or number) that contains the holiday variable. If not provided, TIM will assume there is none provided.

"holidayColumn": 5

Columns

This setting lists all columns (given either by their names or numbers) that should be used for model building. If not provided, TIM will use all available columns. The target column should always be included.

"columns": [5, "y"]

Imputation

The imputation setting applies if there are missing values in the dataset. Using this setting, TIM will impute all gaps in the data that are not longer than the maxLength parameter (in amount of samples). There are two available imputation methods or types: Linear (for linear interpolation) and LOCF (for Last Observation Carried Forward or imputation with the last non-missing observation). The type None turns off imputation. The default setting is Linear with maxLength 6.

"imputation": {
  "type": "Linear",
  "maxLength": 1
}

Time scale

This setting determines the rescaling of the original dataset to another sampling period. The baseUnit of the rescaling is limited to one of Day, Hour, Minute or Second). If not set, the original estimated sampling period will be used. Time scaling only works from lower sampling periods to higher sampling periods, and does not work for data sampled monthly.

"timeScale": {
  "baseUnit": "Day",
  "value": 2
}

Aggregation

This setting defines the aggregation function used for the target variable; predictor variables are always aggregated by the default aggregation function. Available aggregation types are Mean, Sum, Minimum and Maximum. The default aggregation is Mean for numerical variables and the Maximum for boolean variables. It is related to the time scale parameter, as the sampling period to aggregate to is defined there.

"aggregation": "Mean"

Alignment

The alignment setting provides the possibility to set the alignment at the end of the dataset, which is useful for backtesting. This setting enables setting the timestamp of the last target observation (i.e., lastTargetTimestamp) from which the rolling window is applied and production forecasts are calculated. If not given, the last non-missing target timestamp from the original data is used. The last target timestamp cannot be lower than any out-of-sample record. Availabilities of all other variables (except target) may be given relatively to the last non-missing target timestamp. If the alignment is not provided for some variable, the alignment from the original data is taken. This means that difference between the last non-missing timestamp of a variable in the data and the last non-missing target timestamp in the data is used. For more details check the data alignment section.

"alignment": {
    "lastTargetTimestamp": "2021-01-31 00:00:00Z",
    "dataUntil": [
      {
        "column": "Sales",
        "baseUnit": "Hour",
        "offset": -2
      }
    ]
  }

Preprocessors

This setting provides an array of filters and transformations that will be applied on the data in the given order. (Currently only one preprocessor is defined.)

"preprocessors": [
  {
    "type": "CategoryFilter",
    "values": {
      "column": "ColumnName_1",
      "categories": [1, 2, 3]
    }
  }
]

Category filter

The category filter filters the data to select only those rows with specified values - i.e. belonging to a specific category or set of categories. Currently, this fitler is applied only on columns containing group keys. For more details check the documentation section about category filters. By default, all rows will be selected.

{
  "type" : "CategoryFilter",
  "values": {
    "column": "ColumnName_1",
    "categories": [1, 2, 3]
  }
}