Ranging from the content of the dataset to the expected format, it is important to correctly handle the data in order to get results from it. A few restrictions apply to ensure the correct interpretation of the dataset’s contents. Here, explanations regarding these restrictions can be found regardless of what TIM tool you use.
The dataset shouldn't exceed 100 MB. This roughly equals to
|12 500 000||1|
|1 250 000||10|
To indicate the nature of the data (time series); every single observation should be connected to exactly one timestamp. These timestamps should correspond to the first column of the dataset and should be in the UTC format.
Date-time objects in requests/responses are represented by standard json “date-time” objects in the format "yyyy-mm-dd HH:MM:SS.sss".
Timestamps that represent lower frequencies (like days and years) should not use suffix that does not change - years "1995", months "1995-07", days "1995-07-05", quarter-years "1995-03", etc.
Each dataset should contain exactly one target variable (the variable of which one or more forecasts are wanted). The observations of this target variable should correspond to the second column of the dataset.
If desired, more variables with possible explanatory powers can be added to enhance modeling results. All remaining columns, if any, can contain these possible explanatory variables (predictors).
Sampling rate and period¶
Sampling rate is defined as number of samples of equidistant (sampled at a constant rate) time series per unit of time.
Sampling period is time difference between two consecutive samples of equidistant time series.
Fixed sampling rate¶
All timestamps are evenly spaced, resulting in a constant sampling rate. The only possible exception is when a gap in measurements occurs.
If the dataset would be fully sampled throughout some day, it would contain a midnight timestamp with "00:00:00". As a consequence, subsequent timestamps of “00:10:00”, “00:20:00” and “00:30:00” are accepted, whereas subsequent timestamps of “00:05:00”, “00:15:00” and “00:25:00” are not.
The frequency of observations in the dataset cannot exceed the sampling rate. In other words, when the sampling period is 15 minutes, the dataset cannot contain both timestamp “00:00:00” and timestamp “00:12:00”.
TIM currently supports the following regular sampling periods:
- Fixed length:
- 1, 2, 3, 4, 5, 6, 10, 12, 15, 20 or 30 seconds
- 1, 2, 3, 4, 5, 6, 10, 12, 15, 20 or 30 minutes
- 1, 2, 3, 4, 6, 8 or 12 hours
- any number of days
- Variable length:
- any number of months
- any number of years
If your dataset can not be converted to a unified sampling rate or is sampled more frequently than 1 record per 1 second, your data will be automatically treated as irregularly recorded. Check chapter on Irregularly recorded data.
Mixed sampling rates¶
In some applications it might be useful to deal with data recorded in different sampling rates because they come from different sources (e.g. meteo data are sampled less frequently than your target). TIM can handle this with few restrictions:
- sampling period is detected as the smallest time distance between the combined timestamps of all predictors
- all timestamps should be evenly spaced according to the detected sampling period otherwise it will be treaded as irregularly recorded data
TIM will gather the data in a common tabular form using the detected period. This will cause some of the records of predictors sampled less frequently to be missing. By manipulating the "imputation length" parameter we can get rid of these missing values - see more in the section below.
Irregularly recorded data¶
Sometimes, especially in Anomaly Detection cases, the data are gathered very frequently and even in irregular time intervals and therefore it is difficult to say what exactly should be a "sampling period". Furthermore, if the data tick e.g. in milliseconds, the day of week, month and similar transformations become meaningless. There are two possible ways of using TIM in these situations:
TIM detects irregularly recorded data automatically if sampling rate is not fixed or mixed. In these cases, timestamps are converted to indices and all features related to time are disabled. After the computations are done, outputs from forecasting and detection are reverted to the original timestamp format. Keep in mind, that for truly irregular data the "forecast following three timestamps" (e.g. in the RTInstantML call) might be ambiguous, but TIM tries to estimate the timestamps in output for you.
Converting data to indexed data¶
Before using TIM, user converts timestamps to indices keeping their original order. Indices should start from 1 and the length of gaps between subsequent records can be taken into consideration, but does not have to be. TIM then treats this data as it would treat data with yearly sampling period. This has very little mathematical consequences and TIM can still find most of the time related dependencies if they are present in the data. However, some of the settings like "time specific models" can not be set (following the same logic as for yearly data).
Currently TIM does not support categorical variables, with the exception of binary variables (i.e. variables with two possible states). TIM supports them when they are represented as Booleans, i.e. through the values 0 and 1.
Missing data and infinite values, regularly missing data¶
TIM tries to parse every value to a float. If it can not, it will consider it a missing value (examples are categorical valuables, NA strings or infinity markers). If the succession of missing values is short, TIM will automatically replace them by the last available value of the corresponding predictor. This interpolation can be turned off or switched to another type (like linear interpolation).
The "gap" length of missing values to be interpolated can also be manipulated. This is useful when dealing with regularly missing data. If for example data are missing for every weekend, we should change the maximum interpolation length to cover this gap (e.g. in case of hourly sampled data to 48).
modelBuilding: configuration: interpolation: type: LastValue/Linear/None maxLength: 48
In general, most datasets should not be modeled with more than 2 years of data. More data rarely contributes to forecasting accuracy and can sometimes even be detrimental, when underlying dependencies change over time.
Sometimes the data may contain observable outliers when something unusual happened - examples include electricity shortages, portfolio changes and a breakdown of measuring sensors. In these cases, it might be beneficial to omit the respective timestamps and the data connected to them, as the outliers may cause the model to be distorted. This only applies to forecasting purposes; in anomaly detection finding these outliers is the exact purpose of the model.
Data quality can vary across datasets, as well as in one dataset across measuring instruments. Meteorological predictors, for example, can be of largely varying quality depending on the instruments used to collect their values. Predictors of varying quality can consist of both historical actuals or forecasts; both categories often are of a different quality.
When backtesting, it is recommended to always merge these types of predictors, resulting in one predictor containing values of the highest possible quality for every observation. Most often, this means a model is trained on historical actuals and applied using the best available forecasts.