# Data properties

Ranging from the content of the dataset to the expected format, it is important to correctly handle the data in order to get results from it. A few restrictions apply to ensure the correct interpretation of the datasetâ€™s contents. Here, explanations regarding these restrictions can be found regardless of what TIM tool you use.

## Size¶

The dataset shouldn't exceed 100 MB. This roughly equals to

Rows | Columns |
---|---|

4 000 000 | 1 |

1 300 000 | 10 |

170 000 | 100 |

17 000 | 1000 |

None: The table is for data with timestamp format yyyy-mm-dd HH:MM:SS and 4 numbers precision (e.g. 0.582).

## Timestamps¶

To indicate the nature of the data (time series); every single observation should be connected to exactly one timestamp. These timestamps usually correspond to the first column of the dataset and are in the UTC format. However, this can be set differently.

## Target or KPI variable¶

Each dataset used for forecasting should contain exactly one target variable - the variable which the forecast is desired for. Similarly, the AD datasets have one KPI variable. The observations of this variable usually correspond to the second column of the dataset.

## Explanatory variables¶

If desired, more variables with possible explanatory powers can be added to enhance modeling results. All remaining columns, if any, can contain these possible explanatory variables (predictors).

## Sampling rate and period¶

Sampling rate is defined as number of samples of equidistant (sampled at a constant rate) time series per unit of time.

Sampling period is time difference between two consecutive samples of equidistant time series.

Once the dataset is uploaded, TIM will try to estimate its native sampling period. This will always be one of

- 1, 2, 3, 4, 5, 6, 10, 12, 15, 20 or 30 seconds
- 1, 2, 3, 4, 5, 6, 10, 12, 15, 20 or 30 minutes given in seconds
- 1, 2, 3, 4, 6, 8 or 12 hours given in seconds
- any number of days given in seconds
- any number of months given in months

and TIM will try to pick the best fit based on the median distance between consecutive observations. This doesn't mean that the data is stored different to how they were uploaded. However, for forecasting applications, the native sampling period is used to rescale the data by default. Forecasting applications always require equidistant distribution of timestamps, however, missing data are still allowed. This means that if your data is, for example, recorded irregularly couple of times per second, TIM will internally convert the dataset to the 1 second resolution and build models that forecast in 1 second resolution as well. If your dataset is recorded every 27 minutes, TIM Forecasting will use a version of the dataset that has a 30 minute resolution instead.

## Categorical data¶

Currently TIM does not support categorical variables, with the exception of binary variables (i.e. variables with two possible states). TIM supports them when they are represented as Booleans, i.e. through the values 0 and 1.

## Missing data, imputation and irregularly missing data¶

TIM tries to parse every value to a float. If it can not, it will consider it a missing value (examples are categorical valuables, NA strings or infinity markers). However, TIM has ways of imputing the missing data:

- last nonmissing observation carried forward
- linear imputation
- no imputation

The maximum "gap" length of missing values to be imputed can be set as well. This is useful when dealing with regularly missing data. If for example data are missing for every weekend, we should change the maximum imputation length to cover this gap (e.g. in case of hourly sampled data to 48).

## Outliers and number of observations¶

In general, most higher sampling rate datasets should not be modeled with more than 2 years of data. More data rarely contributes to forecasting accuracy and can sometimes even be detrimental, when underlying dependencies change over time.

Sometimes the data may contain observable outliers when something unusual happened - examples include electricity shortages, portfolio changes and a breakdown of measuring sensors. In these cases, it might be beneficial to omit the respective timestamps and the data connected to them, as the outliers may cause the model to be distorted. This only applies to forecasting purposes; in anomaly detection finding these outliers is the exact purpose of the model.

## Predictors and their forecasts¶

In some applications, there are predictors for which the values can be "known" in advance - we know their forecasts. However, the quality of the forecasts can vary across datasets, as well as in one dataset across measuring instruments. Meteorological predictors, for example, can be of largely varying quality depending on the instruments used to collect their values. Predictors like these have both historical actuals and forecasts. In general, it's preferable to build your models using historical actuals and then use forecasts for the model evaluation in backtesting / production.