# Data for Tangent

Ranging from the content of the dataset to the expected format, it is important to correctly handle data in order to get results from it. A few restrictions apply to ensure the correct interpretation of a dataset’s contents. Here, explanations regarding these restrictions can be found regardless of what interface is used to work with Tangent.

## Time-series data

Tangent works with time-series data. It is a collection of observations for an individual entity over time. Each dataset is considered as time-series data by default.

Timestamps in a time-series dataset have to be unique. If duplicate timestamps occur, the first occurrence in the data with corresponding observations is selected and all others are ignored.

The timestamp column cannot contain missing observations (i.e. observations with no timestamp value, yet with values for other columns; not to be confused with missing data or gaps). Such a row is invalid and should be removed or if possible the missing timestamp should be added.

A dataset must contain a timestamp column and at least one variable, which can be used as the target variable.

### Example

Timestamp | Sales | Holidays | Temperature |
---|---|---|---|

2022-01-01 | 11 | 1 | 11 |

2022-01-02 | 10 | 0 | 10 |

2022-01-03 | 16 | 0 | 12 |

2022-01-04 | 20 | 0 | 9 |

2022-01-05 | 0 | 8 |

## Dataset Size

### Number of observations

In general, most higher sampling rate datasets should not be modeled with more than 2 years of data. Typical datasets hold between 100 and 100.000 observations. There is no theoretical limit to the algorithm however more data rarely contributes to the accuracy and can sometimes even be detrimental, when underlying dependencies change over time. In practice, dataset sizes are limited by the memory limits of the underlying infrastructure where Tangent is running.

### Number of columns

Typical datasets hold between 1 and 100 columns. There is no theoretical limit to the algorithm however it is always best practice to allow Tangent to focus on those columns where real predictive value can be found. In practice, dataset sizes are limited by the memory limits of the underlying infrastructure where Tangent is running.

## Timestamps

To indicate the nature of the data (time series), every single observation should be connected to exactly one timestamp. These timestamps usually correspond to the first column of the dataset and are by default assumed to be in the UTC time zone. Both the column and the time zone can however be set differently.

```
{
"timestampColumn" : "Timestamp",
"timeZoneName": "Europe/Bratislava"
}
```

Tangent is **ISO 8601** compliant. The formatting of timestamps is an important topic for time-series analysis and thus has its own section in the documentation that explains accepted formats in more detail.

Note: Only timestamps higher or equal to "0001-01-01 00:00:00" are supported.

## Sampling rate and sampling period

A sampling rate is defined as the number of samples or observations in equidistant (sampled at a constant rate) time series per unit of time. Conversely, the sampling period is defined as the time difference between two consecutive samples or observations of equidistant time series.

Once a dataset is uploaded, Tangent will try to estimate the dataset's original sampling period. This will always be one of the following time differences:

1, 2, 3, 4, 5, 6, 10, 12, 15, 20 or 30 seconds

1, 2, 3, 4, 5, 6, 10, 12, 15, 20 or 30 minutes (expressed in seconds)

1, 2, 3, 4, 6, 8 or 12 hours (expressed in seconds)

any number of days (expressed in seconds)

any number of months (expressed in months)

Tangent will try to determine the best fit based on the median distance between consecutive observations.

This doesn't mean that the data is stored differently from how they were uploaded. However, for forecasting applications, the original sampling period is used to rescale the data by default. Forecasting applications always require an equidistant distribution of timestamps, although missing data are still allowed. This means that if the data is, for example, recorded irregularly a number of times per second, Tangent will internally convert the dataset to the 1 second resolution and build models that forecast with a 1 second resolution as well. If the dataset is recorded every 27 minutes, Tangent Forecasting will use a version of the dataset that has a 30 minute resolution instead.

If less than two rows of data are uploaded during the first upload, the sampling period should be provided.

## Data types

There are two main data types supported in TIM: numerical and categorical. Numerical data consists of values that can be measured and represented using numbers, such as age or height. Categorical data, on the other hand, consists of values that represent categories or groups, such as gender or color. Categorical data can be further divided into Boolean variables, and the rest of the categorical variables.

### Numerical variables

Numerical variables refer to data that can be represented by numbers and can be measured or quantified. This type of data includes variables that are continuous, such as temperature or pressure, as well as variables that are discrete, such as the number of people in a household or the number of items sold.

In Tangent, numerical variables are fully supported and can be used in all modules. Tangent provides various tools and techniques for analyzing and modeling numerical data, making it a valuable tool for time-series analysis.

### Boolean variables

Tangent provides support for special subcases of categorical data, specifically Boolean variables with only two possible values (0/1 or True/False). Tangent provides extensive support for this case, including the automatic creation of features and the ability to use it as the target variable in classification tasks.

### Categorical variables

In addition to Boolean variable, Tangent also supports other types of categorical variables.

When uploading a dataset, categorical variables can be specified by listing them explicitly. If categorical variables are not provided, automatic detection will be run. By default, any column containing at least one non-missing string value will be considered a categorical variable.

There are some restrictions when it comes to using categorical data in Tangent. Anomaly Detection does not support categorical data. Only numerical and Boolean variables should be used when working with this module. In Forecasting, categorical data is supported except for using it as the target variable for classification purposes. Categorical data can be used as predictors or features in your analysis, but cannot be set as the target variable for classification.

## Variables

### Target

Each dataset to be analyzed should contain exactly one target variable: the variable to forecast for or detect on. The observations of this variable usually correspond to the second column of the dataset. Again, it is possible for a user to indicate a different target column.

### Explanatory variables or predictors

Tangent supports multivariate time series analysis. This means that if desired, more variables with potential explanatory power can be added to enhance modeling results. Any remaining dataset columns (i.e. any columns except the timestamp and target columns) can contain these potential explanatory variables or predictors. Tangent will only take into account those variables that are relevant for a specific modeling use case; however, a user can still configure this to overrule Tangent’s default behavior and avoid some variables or some variable transformations to be taken into account.

### Predictors and their forecasts

In some applications, there are predictors for which the values can be "known" in advance - their forecasts have been made. (Examples include binary variables indicating public holidays, or meteorological variables that have been forecasted.) The quality of these forecasts tends to vary across datasets, as well as in a single dataset across measuring instruments. Meteorological predictors, for example, can be of largely varying quality depending on the instruments used to collect their values. Predictors like these have both historical actuals and forecasts. Because of this variation in quality, it can be beneficial to take into account which observations contain historical actuals and which contain predictor forecasts and to potentially treat them differently. In general, it's preferable to build your models using historical actuals and then use the predictor forecasts for model evaluation in back testing and production. Tangent can handle such variables as it supports varying data availability per variable and generates models that are aware of the situation they have been built in.

### Missing data or gaps

The strings "", "na", "nan", "n/a", "missing", "null", "none", "nothing" are ignored by Tangent, and thus these cases are considered as missing values for categorical variables.

For other types of variables, Tangent tries to parse every value to a float. If it cannot, it will consider the value to be missing. Examples of values interpreted as missing are *null* values, categorical (non-Boolean) valuables, NA strings and infinity markers. This is not a problem though, as Tangent can handle missing data during model building, and also offers multiple ways to impute missing data.