Size your Infrastructure for a TIM 5.0 Implementation

Introduction

This paper explains the overall capacity planning methodologies available for TIM InstantML, and explains the calculations used to obtain metrics for estimating and sizing a TIM Engine environment.

The TIM InstantML technology is deployable in various ways. TIM InstantML can be deployed as a SAAS Service that is called in your IT environment.

Alternatively, TIM InstantML can also be deployed in an On-Premise environment where you provide the server infrastructure but also in a Bring Your Own Cloud License (BYOL on Azure, AWS,...) way.

This document describes sizing considerations for an On-Premise or BYOL environment.

In the SAAS scenario, scaling of the service is done automatically. The BYOL/On-Premise solution also provides scaling but you will need to provide sufficient resources.

This paper is related to following versions of the TIM software:

TIM Engine 5.X
TIM Studio 5.X

Architectural Components

TIM InstantML runs mainly on a Kubernetes cluster.

As an example, we provide an Azure Deployment Scheme:

Scalability Fabric TIM Engine with queuing - AKS Cluster with D3 v2 VM – This is the Kubernetes Cluster Service implementation by Azure.
Database - Azure SQL for PostgreSQL – This is an Azure Database Service.
TIM Workers - ACI for TIM worker instances – This is the fast scaling Azure Container Instances service (or Kubernetes).

Typically at least two of these services will be setup for redundancy.

The number of TIM Workers are scaled up depending on the number of requests you send to the TIM Engine and therefore the number of requests in the queue.

In an On-Premise environment local Kubernetes and local PostgreSQL installs are used.

On other cloud environments appropriate services will be used. As as an example on AWS:

Scalability Fabric TIM Engine and TIM Workers - EKS Cluster with m5.xlarge VM – This is the Kubernetes Cluster Service implementation by AWS.
Database - PostgreSQL
Queuing of TIM Engine tasks - AmazonMQ for RabbitMQ

Difficulties in Sizing the Environment

Consider the following elements that determine the CPU time and memory requirements for creating a model or creating a forecast, classification or anomaly detection.

The size of the data structure
The number of predictors (columns)
The number of timestamps (rows)
The predictor feature importance
Correlation between the predictor candidates and target.

This makes it difficult to have an algorithm that provides you the exact memory and CPU consumption. This document provides you benchmarking figures that make it easy to calculate the size of the architecture using benchmark data rather than rock-solid calculations.

Capacity Planning And Performance Overview

Data input size Considerations

The lightning fast speed of TIM InstantML is the result of efficiently using in memory processing and parallelization of computation. The default maximum memory usage is a dataset of 100 MB (measured in a CSV format). Check out Data Properties

Memory Usage

The only noteworthy objects that require significant space are:

dataset
forecasting / detection tabular output
root cause analysis output

There are many more objects involved in the process such as model, logs, accuracy metrics and other, however they all require less then 1 MB of space. The tabular output increases in size significantly with higher forecasting horizon and smaller rolling window in case of forecasting, however this is only relevant in "backtesting" scenarios where users try different settings on historical data. In production setups, the tabular output diminishes to kilobytes because only the new timestamps are evaluated. The same goes for root cause analysis output. All in all, the only memory extensive object in a real production pipeline remains to be the dataset itself.

Processing Time

There are 2 significant bottlenecks in the whole forecasting / detection process:

dataset upload / update
model building / rebuilding

There are usually many more steps in the whole pipeline, however they require zero to no time to process. This includes the model evaluation (forecasting / detection itself) - once the model is ready, generating forecasts / detections is done lightning fast (under second). That is why we restrict the benchmarking times to the model building

Benchmark Data and Scenarios

In most cases the sizing calculation is straightforward.

A typical TIM worker runs on following configuration:

CPU
- 4 virtual CPU cores
Memory
- 12 GB of RAM

In this benchmark we provide performance data on one single TIM Worker instance for different data set sizes.

Benchmark results

In the following tables you can find the processing response time and CPU load created by the request based on 1 TIM Worker (running on one 4 core CPU). The datasets were already uploaded before the benchmark started.

We provide benchmark for two forecasting endpoinds and different situations:

Case	Request type	New model	Backtesting
1	forecasting/forecast-jobs/build-model	yes	yes
2	forecasting/forecast-jobs/{id}/rebuild-model	yes	yes
3	forecasting/forecast-jobs/{id}/rebuild-model	no	yes
4	forecasting/forecast-jobs/{id}/rebuild-model	no	no

Forecasting and classification jobs

This benchmark was done for forecasting jobs that build models for 1-sample ahead forecasts. The benchmark is related to the forecasting execution request. There are different job types (build and rebuild), however they always call the same core underneath. Benchmark result doesn't differ for the request type per se, it does differ for the amount of models that have to be built. Imagine you call your build request first for the S+1 to S+3 horizon and then rebuild the same Model Zoo for the S+1 to S+6 horizon - in both cases, only three models are added to the Model Zoo, so the benchmark stays the same - slightly less than 3 times the respective number in the tables provided (the benchmark is only for the 1 model Model Zoo and the scaling is less than linear). The tables are always denoted with number of rows on y axis and number of variables (target variable and predictors).

Dataset size in DB	1	5	10	50	100	150	500	1000
100	8192 bytes	40 kB	40 kB	72 kB	120 kB	160 kB	424 kB	824 kB
1000	88 kB	120 kB	160 kB	472 kB	920 kB	1360 kB	4024 kB	8024 kB
10000	616 kB	936 kB	1336 kB	4472 kB	8920 kB	13 MB	39 MB	78 MB
100000	5912 kB	9120 kB	13 MB	43 MB	87 MB	130 MB	N/A	N/A
1000000	57 MB	89 MB	128 MB	N/A	N/A	N/A	N/A	N/A

Size of CSV file	1	5	10	50	100	150	500	1000
100	2.6 kB	4.9 kB	7.8 kB	30.9 kB	59.9 kB	88.9 kB	291.9 kB	581.7 kB
1000	25.7 kB	48.4 kB	77.1 kB	307.5 kB	595.2 kB	883.0 kB	2.8 MB	5.6 MB
10000	256.5 kB	482.9 kB	770.3 kB	3.0 MB	5.8 MB	8.6 MB	28.3 MB	56.4 MB
100000	2.5 MB	4.7 MB	7.5 MB	30.0 MB	58.1 MB	86.2 MB	N/A	N/A
1000000	25.0 MB	47.2 MB	75.2 MB	N/A	N/A	N/A	N/A	N/A

Case 1 and 2

As described above, what influence the benchmark results is number of new models, which TIM is building. Therefore case 1 and case 2 are merged into one benchmark. In both cases will be 1 model build and in-sample forecast with production forecast calculated.

Max CPU usage in %	1	5	10	50	100	150	500	1000
100	23	30	30	32	32	32	34	33
1000	28	30	34	34	34	34	34	34
10000	60	80	120	129	133	137	144	200
100000	165	176	215	338	372	380	N/A	N/A
1000000	149	192	198	N/A	N/A	N/A	N/A	N/A

Max RAM usage in %	1	5	10	50	100	150	500	1000
100	46	46	46	46	47	47	47	47
1000	47	47	47	48	49	49	49	50
10000	52	53	53	57	57	58	58	58
100000	52	53	53	67	67	67	N/A	N/A
1000000	53	60	60	N/A	N/A	N/A	N/A	N/A

Model building and prediction time in seconds	1	5	10	50	100	150	500	1000
100	0.04	0.04	0.05	0.09	0.11	0.2	0.5	0.8
1000	0.2	0.2	0.4	0.6	0.9	1.0	2.0	5
10000	2.5	3	4	10	13	14	32	61
100000	21	25	42	102	131	156	N/A	N/A
1000000	329	347	384	N/A	N/A	N/A	N/A	N/A

Total execution time in seconds	1	5	10	50	100	150	500	1000
100	0.4	0.5	0.5	0.8	1.4	1.7	5	7
1000	0.5	0.6	0.8	1.6	2.5	2.9	8	17
10000	3.8	3.8	5	14	20	25	65	133
100000	25	30	50	135	195	262	N/A	N/A
1000000	364	405	471	N/A	N/A	N/A	N/A	N/A

Forecasting result table size in DB	1	5	10	50	100	150	500	1000
100	9.7 kB	9.7 kB	9.7 kB	9.7 kB	9.7 kB	9.7 kB	9.7 kB	9.7 kB
1000	95.8 kB	95.8 kB	95.8 kB	95.8 kB	95.8 kB	95.8 kB	95.8 kB	95.8 kB
10000	957 kB	957 kB	957 kB	957 kB	957 kB	957 kB	957 kB	957 kB
100000	9.4 MB	9.4 MB	9.4 MB	9.4 MB	9.4 MB	9.4 MB	N/A	N/A
1000000	93.5 MB	93.5 MB	93.5 MB	N/A	N/A	N/A	N/A	N/A

Case 3

In this case, there is no new situation detected and no new model is build. Out-of-sample forecast (out-of-sample rows has to be set) and production forecast are calculated.

Max CPU usage in %	1	5	10	50	100	150	500	1000
100	23	30	30	32	32	32	33	33
1000	28	30	34	34	34	34	34	34
10000	60	80	101	113	119	144	100	86
100000	100	127	157	130	108	101	N/A	N/A
1000000	111	118	255	N/A	N/A	N/A	N/A	N/A

Max RAM usage in %	1	5	10	50	100	150	500	1000
100	32	32	32	32	32	32	32	32
1000	32	32	32	32	32	32	32	32
10000	34	34	35	34	34	34	34	34
100000	39	43	53	45	41	42	N/A	N/A
1000000	52	58	62	N/A	N/A	N/A	N/A	N/A

Model building and prediction time in seconds	1	5	10	50	100	150	500	1000
100	0.03	0.04	0.04	0.05	0.07	0.08	0.24	0.4
1000	0.05	0.06	0.07	0.14	0.23	0.31	1.1	2.4
10000	0.4	0.6	0.7	3.2	4.5	6.4	19	37
100000	7	8	11	29	49	69	N/A	N/A
1000000	288	309	343	N/A	N/A	N/A	N/A	N/A

Total execution time in seconds	1	5	10	50	100	150	500	1000
100	0.4	0.4	0.5	0.7	1.2	1.6	5	6
1000	0.4	0.4	0.5	1.1	2.0	2.8	7.3	15
10000	0.9	1.4	2.0	6	12	16	50	105
100000	10	14	20	76	114	184	N/A	N/A
1000000	326	368	433	N/A	N/A	N/A	N/A	N/A

Forecasting result table size in DB	1	5	10	50	100	150	500	1000
100	9.7 kB	9.7 kB	9.7 kB	9.7 kB	9.7 kB	9.7 kB	9.7 kB	9.7 kB
1000	95.8 kB	95.8 kB	95.8 kB	95.8 kB	95.8 kB	95.8 kB	95.8 kB	95.8 kB
10000	957 kB	957 kB	957 kB	957 kB	957 kB	957 kB	957 kB	957 kB
100000	9.4 MB	9.4 MB	9.4 MB	9.4 MB	9.4 MB	9.4 MB	N/A	N/A
1000000	93.5 MB	93.5 MB	93.5 MB	N/A	N/A	N/A	N/A	N/A

Case 4

No new situation is detected and no new model is build. Only production forecast are calculated.

Max CPU usage in %	1	5	10	50	100	150	500	1000
100	23	30	30	32	32	32	34	33
1000	28	30	34	34	34	34	34	34
10000	34	34	34	35	36	36	36	47
100000	45	45	45	45	49	60	N/A	N/A
1000000	45	45	56	N/A	N/A	N/A	N/A	N/A

Max RAM usage in %	1	5	10	50	100	150	500	1000
100	32	32	32	32	32	32	32	32
1000	32	32	32	32	32	32	32	32
10000	34	34	35	34	34	34	34	34
100000	39	43	53	45	41	42	N/A	N/A
1000000	52	58	62	N/A	N/A	N/A	N/A	N/A

Model building and prediction time in seconds	1	5	10	50	100	150	500	1000
100	0.02	0.02	0.02	0.03	0.04	0.13	0.2	0.5
1000	0.02	0.02	0.03	0.10	0.19	0.4	1.0	2.4
10000	0.04	0.2	0.3	1.6	3.0	4.6	16	33
100000	0.4	2	2	15	33	52	N/A	N/A
1000000	5	17	22	N/A	N/A	N/A	N/A	N/A

Total execution time in seconds	1	5	10	50	100	150	500	1000
100	0.3	0.4	0.4	1.0	1.6	2.3	7	9
1000	0.3	0.4	0.5	1.3	2.3	3.5	11	17
10000	0.8	0.9	1.4	6	11	17	53	110
100000	2.5	7	7	47	99	152	N/A	N/A
1000000	26	61	107	N/A	N/A	N/A	N/A	N/A

Forecasting result table size in DB	1	5	10	50	100	150	500	1000
100	0.1 kB	0.1 kB	0.1 kB	0.1 kB	0.1 kB	0.1 kB	0.1 kB	0.1 kB
1000	0.1 kB	0.1 kB	0.1 kB	0.1 kB	0.1 kB	0.1 kB	95.8 kB	95.8 kB
10000	0.1 kB	0.1 kB	0.1 kB	0.1 kB	0.1 kB	0.1 kB	957 kB	957 kB
100000	0.1 kB	0.1 kB	0.1 kB	0.1 kB	0.1 kB	0.1 kB	N/A	N/A
1000000	0.1 kB	0.1 kB	0.1 kB	N/A	N/A	N/A	N/A	N/A

Remarks

The data used 4 numbers precision (e.g. 0.582).
Some datasets exceed 100 MB storage size in the database because we include db indices.
The computing time increases with bigger forecasting horizons, however, the increase is smaller than linear.
The output table size is only relevant for backtesting tasks. It scales up with bigger forecasting horizon and down with bigger rolling window.
As the memory usage approaches 100 percent, TIM starts to preprocess the data by throwing rows away and switching off features which results into smaller numbers all across the tables after that breaking point (RAM, CPU and forecasting output table size). This is why the numbers may not always rise in the axis directions. The numbers where such preprocessing took place are denoted in italic and in this benchmark only polynomial features were switched off.
Some fields are not filled because the respective dataset would be bigger than the 100 MB threshold.
The CPU Load is expressed per CPU. i.e.. 400% 4 X 100% / Core
The performance figures are for sequential execution of the ML requests without scaling and spinning up more TIM Workers.

Scaling the workers

What do you do if you need more transaction per hour?

The TIM Engine provides queueing and automatically spins up new TIM workers to cater for the volume of request being handled.

How to calculate the size and pricing of your infrastructure?

The benchmark figures give you an indication what the performance will be in your use case. You need to determine the profile of ML request you need and calculate the number of TIM Workers you will need.

In this table, we give an example of a calculation:

Component	Sizing Consideration	Cost	Costing Example
TIM Engine Fabric	This Kubernetes installed component ensure a REST Endpoint is available	We recommend 2 VM with 4 core CPU and 32 Memory for this.	140 Euro / Month for 2 D3 Servers to support the cluster
Queueing Service	Rabbit MQ is available a Kubernetes cluster deployment - alternately you can use a Platform service for this.	RabbitMQ service are available on AWS and Azure	Optional
Database	This is the Postgres database service.		Azure SQL for Postgress - 130 Euro / Month
TIM Workers	The TIM workers are scalable component.	You can find the CPU load response time in the benchmark tables. This allows you to calculate the number of 4 Core/12 GB Servers you need.	2 ACI containers for TIM Workers - 240 Euro per Months
Total			510 Euro / Month

This is a two TIM worker configuration. Some Example throughputs:

RTInstantML Scenario (/forecasting/forecast-jobs/build-model) - 1000 observations, 50 variables - 1,6 Sec Response Time - 2250 Transactions/Hour/Works = 4500 Transaction per hour for this configuration
RTInstantML Scenario (/forecasting/forecast-jobs/build-model) - 10000 observations, 50 variables - 14 Sec Response Time - 257 Transactions/Hour = 514 Transaction per hour for this configuration

Notes:

Do not forget to cater for data collection. The measurements in the tables above are processing time (Response Time) of the TIM Worker.
the The prices are indicational and dependent on your plan with Azure.
The Azure prices are based on 3 years upfront.
Similar pricing is possible for AWS or on premise.
You might want to consider servers with less cores if your cases does not benefit from parallelization over multiple cores.

Sizing And Estimation Methodology

Estimating anything can be a complex and error-prone process. That’s why it's called an 'estimation', rather than a 'calculation'. There are three primary approaches to sizing a TIM InstantML implementation:

Algorithm, or Calculation Based
Size By Example Based
Proof of Concept Based

Typical implementations of TIM InstantML do not required complex sizing and estimation processes. An algorithm based approach, taking into account the data size and the number of ML transaction per hour per worker, allows you to determine the number of parallel workers and design your architecture.

In more complex cases a Proof of Concept might be useful. This is typically the case with more complicated peak time ML consumption requirements.

Algorithm, Or Calculation Based

An algorithm or process that accepts data input is probably the most commonly accepted tool for delivering sizing estimations. Unfortunately, this approach is generally the most inaccurate.

When considering a multiple model – multiple use case implementation, the number of variables involved in delivering a calculation that even approaches a realistic sizing response requires input values numbering in excess of one hundred, and calculations so complex and sensitive that providing an input value plus or minus 1% of the correct value results in wildly inaccurate results.

The other approach to calculation-based solutions is to simplify the calculation to the point where it is simple to understand and simple to use. This paper shows how this kind of simplification can provide us with a sizing calculator.

Size-By-Example Based

A size-by-example (SBE) approach requires a set of known samples to use as data points along the thermometer of system size. The more examples available for SBE, the more accurate the intended implementation will be.

By using these real world examples, both customers and Tangent Works can be assured that the configurations proposed have been implemented before and will provide the performance and functionality unique to the proposed implementation. Tangent Works Engineering can help here.

Proof Of Concept Based

A proof of concept (POC), or pilot based approach, offers the most accurate sizing data of all three approaches.

A POC you to do the following:

Test your InstantML implementation design
Test your chosen hardware or cloud platform
Simulate projected load
Validate design assumptions
Validate Usage
Provide iterative feedback for your implementation team
Adjust or validate the implementation decisions made prior to the POC

There are, however, two downsides to a POC based approach, namely time and money. Running a POC requires the customer to have manpower, hardware, and the time available to implement the solution, validate the solution, iterate changes, re-test, and finally analyze the POC findings.

A POC is always the best and recommended approach for any sizing exercise. It delivers results that are accurate for the unique implementation of the specific customer that are as close to deploying the real live solution as possible, without the capital outlay on hardware and project resources.

Introduction​

Related Software Versions​

Architectural Components​

Difficulties in Sizing the Environment​

Capacity Planning And Performance Overview​

Data input size Considerations​

Memory Usage​

Processing Time​

Benchmark Data and Scenarios​

Benchmark results​

Forecasting and classification jobs​

Case 1 and 2​

Case 3​

Case 4​

Remarks​

Scaling the workers​

How to calculate the size and pricing of your infrastructure?​

Sizing And Estimation Methodology​

Algorithm, Or Calculation Based​

Size-By-Example Based​

Proof Of Concept Based​