Dataset Properties

Ranging from the content of the dataset to the expected format, it is important to correctly handle the data in order to get results from it. A few restrictions apply to ensure the correct interpretation of the dataset’s contents. Here, explanations regarding these restrictions can be found regardless of what TIM tool you use.

Timestamps

To indicate the nature of the data (time series); every single observation should be connected to exactly one timestamp. These timestamps should correspond to the first column of the dataset and should be in UTC format.

Timestamps should start exactly at midnight, i.e. “00:00:00”. As a consequence, subsequent timestamps of “00:00:00”, “00:10:00” and “00:20:00” are accepted, whereas subsequent timestamps of “00:05:00”, “00:15:00” and “00:25:00” are not.

Date-time objects in requests/responses are represented by standard json “date-time” objects, which conform to the ISO 8601 specifications (read more here - in section 7.3.1. and here - in section 5.6.).

Timestamps that represent lower frequencies (like days and years) should not use suffix that does not change - years "1995", months "1995-07", days "1995-07-05", quarter-years "1995-03", etc.

Target variable

Each dataset should contain exactly one target variable (the variable of which one or more forecasts are wanted). The observations of this target variable should correspond to the second column of the dataset.

Explanatory variables

If desired, more variables with possible explanatory powers can be added to enhance modeling results. All remaining columns, if any, can contain these possible explanatory variables (predictors).

Sampling rate

The sampling rate is defined as the time between subsequent observations. All timestamps should be evenly spaced, resulting in a constant sampling rate. The only possible exception is when a gap in measurements occurs.

The frequency of observations in the dataset cannot exceed the sampling rate. In other words, when the sampling rate is 15 minutes, the dataset cannot contain both timestamp “00:00:00” and timestamp “00:12:00”.

Every predictor should have the same sampling rate, as every timestamp is connected to an entire observation. TIM currently supports the following sampling periods:

  • Fixed length:
    • 1, 2, 3, 4, 5, 6, 10, 12, 15, 20 or 30 seconds
    • 1, 2, 3, 4, 5, 6, 10, 12, 15, 20 or 30 minutes
    • 1, 2, 3, 4, 6, 8 or 12 hours
    • any number of days
  • Variable length:
    • any number of months
    • any number of years

Categorical data

Currently TIM does not support categorical variables, with the exception of binary variables (i.e. variables with two possible states). TIM supports them when they are represented as Booleans, i.e. through the values 0 and 1.

Missing data and infinite values, regularly missing data

Since API 4.1

TIM tries to parse every value to a float. If it can not, it will consider it a missing value (examples are categorical valuables, NA strings or infinity markers). If the succession of missing values is short, TIM will automatically replace them by the last available value of the corresponding predictor. This interpolation can be turned off or switched to another type (like linear interpolation).

The "gap" length of missing values to be interpolated can also be manipulated. This is useful when dealing with regularly missing data. If for example data are missing for every weekend, we should change the maximum interpolation length to cover for this gap (e.g. in case of hourly sampled data to 48).

modelBuilding:
  configuration:
    interpolation:
      type: LastValue/Linear/None
      maxLength: 48 

Outliers

In general, most datasets should not be modeled with more than 2 years of data. More data rarely contributes to forecasting accuracy and can sometimes even be detrimental, when underlying dependencies change over time.

Sometimes the data may contain observable outliers when something unusual happened - examples include electricity shortages, portfolio changes and a breakdown of measuring sensors. In these cases, it might be beneficial to omit the respective timestamps and the data connected to them, as the outliers may cause the model to be distorted. This only applies to forecasting purposes; in anomaly detection finding these outliers is the exact purpose of the model.

Data quality

Data quality can vary across datasets, as well as in one dataset across measuring instruments. Meteorological predictors, for example, can be of largely varying quality depending on the instruments used to collect their values. Predictors of varying quality can consist of both historical actuals or forecasts; both categories often are of a different quality.

When backtesting, it is recommended to always merge these types of predictors, resulting in one predictor containing values of the highest possible quality for every observation. Most often, this means a model is trained on historical actuals and applied using the best available forecasts.