Skip to content

Input data properties

Ranging from the content of the dataset to the expected format, it is important to correctly handle data in order to get results from it. A few restrictions apply to ensure the correct interpretation of a dataset’s contents. Here, explanations regarding these restrictions can be found regardless of what interface is used to work with TIM.

Dataset type

TIM supports two main types of datasets: single time series and panel datasets.

Time-series data

Time-series data is a collection of observations for an individual entity over time. Each dataset is considered as time-series data by default.

Timestamps in a time-series dataset have to be unique. If duplicate timestamps occur, the first occurrence in the data with corresponding observations is selected and all others are ignored.

A dataset must contain a timestamp column and at least one variable, which can be used as the target variable.

Example

Timestamp Sales Holidays Temperature
2022-01-01 11 1 11
2022-01-02 10 0 10
2022-01-03 16 0 12
2022-01-04 20 0 9
2022-01-05 0 8

Panel data

Panel data is a collection of observations for multiple entities over time. This documentation will refer to the individual entities as groups and to the variables that split the data into different groups as group keys.

Panel data is a dataset with specified group keys. If no group keys are specified, the dataset is considered as single time-series data. Group keys can correspond to one or more columns. The columns with group keys must be defined when first uploading a dataset; it is essential to know which columns are group keys from the beginning, so the validation of a dataset can run correctly. There are slight differences between the validation of classical time-series data and panel data. Once set, the group keys property cannot be changed.

The JSON for specifying the group keys of a panel dataset to the TIM API, thus defining it as panel data, looks like this:

{
  "groupKeys" : ["Store ID", "Category"]
}

Unlike classical time-series data, panel data may - and most often, does - include duplicate timestamps. However, each group should contain only unique timestamps. If there is a duplicate timestamp within a group, this is handled as it is for classical time series: the first occurrence of the timestamp in the group with corresponding data is selected, and all others are ignored.

A dataset must contain a timestamp column, group keys and at least one variable, which can be used as target variable.

The sampling period of a panel dataset is calculated across the whole dataset with regard to groups. Individual groups should be sampled similarly.

Example

Store ID and Category represent group keys.

Store ID Category Timestamp Sales Holidays Temperature Store Size
1 Food 2022-01-01 11 1 11 25
1 Food 2022-01-02 10 0 10 25
1 Food 2022-01-03 0 12 25
1 Household 2022-01-01 20 1 11 25
1 Household 2022-01-02 22 0 10 25
1 Household 2022-01-03 0 12 25
2 Food 2022-01-01 40 1 13 100
2 Food 2022-01-02 43 0 9 100
2 Food 2022-01-03 0 10 100

Dataset size

The dataset size shouldn't exceed 100 MB. The table below gives some rough estimates of what this means in terms of rows (observations) and columns (variables).

Rows Columns
4 000 000 1
1 300 000 10
170 000 100
17 000 1000

Note: This table assumes timestamp format yyyy-mm-dd HH:MM:SS and 4 numbers precision (e.g. 0.582).

Number of observations

In general, most higher sampling rate datasets should not be modeled with more than 2 years of data. More data rarely contributes to the accuracy and can sometimes even be detrimental, when underlying dependencies change over time.

Timestamps

To indicate the nature of the data (time series), every single observation should be connected to exactly one timestamp. For panel data, a single observation pertains to a single group, and duplicate timestamps can thus occur across groups, but not within groups. These timestamps usually correspond to the first column of the dataset and are by default assumed to be in the UTC timezone. Both the column and the timezone can however be set differently.

{
  "timestampColumn" : "Timestamp",
  "timeZone": "+02:00"
}

The TIM Platform is ISO 8601 compliant. The formatting of timestamps is an important topic for time-series analysis and thus has its own section in the documentation that explains accepted formats in more detail.

Sampling rate and sampling period

A sampling rate is defined as the number of samples or observations in equidistant (sampled at a constant rate) time series per unit of time. Conversely, the sampling period is defined as the time difference between two consecutive samples or observations of equidistant time series.

Once a dataset is uploaded, TIM will try to estimate the dataset's original sampling period. This will always be one of the following time differences:

  • 1, 2, 3, 4, 5, 6, 10, 12, 15, 20 or 30 seconds
  • 1, 2, 3, 4, 5, 6, 10, 12, 15, 20 or 30 minutes (expressed in seconds)
  • 1, 2, 3, 4, 6, 8 or 12 hours (expressed in seconds)
  • any number of days (expressed in seconds)
  • any number of months (expressed in months)

TIM will try to determine the best fit based on the median distance between consecutive observations.

This doesn't mean that the data is stored differently from how they were uploaded. However, for forecasting applications, the original sampling period is used to rescale the data by default. Forecasting applications always require an equidistant distribution of timestamps, although missing data are still allowed. This means that if the data is, for example, recorded irregularly a number of times per second, TIM will internally convert the dataset to the 1 second resolution and build models that forecast with a 1 second resolution as well. If the dataset is recorded every 27 minutes, TIM Forecasting will use a version of the dataset that has a 30 minute resolution instead.

Variables

Target or KPI variable

Each dataset to be analysed should contain exactly one target or KPI variable: the variable to forecast for or detect on. The observations of this variable usually correspond to the second column of the dataset. Again, it is possible for a user to indicate a different target or KPI column.

Explanatory variables or predictors

TIM supports multivariate time series analysis. This means that if desired, more variables with potential explanatory power can be added to enhance modeling results. Any remaining dataset columns (i.e. any columns except the timestamp, group key and target or KPI columns) can contain these potential explanatory variables or predictors. TIM will only take into account those variables that are relevant for a specific modeling use case; however, a user can still configure this to overrule TIM's default behavior and avoid some variables or some variable transformations to be taken into account.

Group keys

Group keys are special cases of categorical variables relevant for panel data. Unique combinations of values of the group keys split panel data into smaller groups; each group forms individual time-series data. Group keys can be represented as Strings or Integers and cannot contain missing values. Rows with missing group key values will not be stored, since such rows cannot be assigned to any group. The strings "na", "nan", "n/a", "missing", "null", "none", "nothing" are ignored, as these cases are considered as missing values.

Categorical data

In addition to the group keys, TIM currently only supports binary categorical variables (i.e. variables with two possible states). For TIM to be able to interpret them, these variables should be represented as Booleans, i.e. with the values 0 and 1 or true and false.

Predictors and their forecasts

In some applications, there are predictors for which the values can be "known" in advance - their forecasts have been made. (Examples include binary variables indicating public holidays, or meteorological variables that have been forecasted.) The quality of these forecasts tends to vary across datasets, as well as in a single dataset across measuring instruments. Meteorological predictors, for example, can be of largely varying quality depending on the instruments used to collect their values. Predictors like these have both historical actuals and forecasts. Because of this variation in quality, it can be beneficial to take into account which observations contain historical actuals and which contain predictor forecasts and to potentially treat them differently. In general, it's preferable to build your models using historical actuals and then use the predictor forecasts for model evaluation in backtesting and production. TIM can handle such variables as it supports varying data availability per variable and generates models that are aware of the situation they have been built in.

Missing data or gaps

The strings "na", "nan", "n/a", "missing", "null", "none", "nothing" are ignored by TIM, and thus these cases are considered as missing values for categorical non-Boolean variables (currently only group keys are supported).

For other types of variables, TIM tries to parse every value to a float. If it cannot, it will consider the value to be missing. Examples of values interpreted as missing are null values, categorical (non-Boolean) valuables, NA strings and infinity markers. This is not a problem though, as TIM can handle missing data during model building, and also offers multiple ways to impute missing data.