Skip to content

Size your infrastructure for a TIM 5.0 implementation

Introduction

This paper explains the overall capacity planning methodologies available for TIM InstantML, and explains the calculations used to obtain metrics for estimating and sizing a TIM Engine environment.

The TIM InstantML technology is deployable in various ways. TIM InstantML can be deployed as a SAAS Service that is called in your IT environment.

Alternatively, TIM InstantML can also be deployed in an On-Premise environment where you provide the server infrastructure but also in a Bring Your Own Cloud License (BYOL on Azure, AWS,...) way.

This document describes sizing considerations for an On-Premise or BYOL environment.

In the SAAS scenario, scaling of the service is done automatically. The BYOL/On-Premise solution also provides scaling but you will need to provide sufficient resources.

This paper is related to following versions of the TIM software:

  • TIM Engine 5.X
  • TIM Studio 5.X

Architectural Components

TIM InstantML runs mainly on a Kubernetes cluster.

As an example, we provide an Azure Deployment Scheme:

  • Scalability Fabric TIM Engine with queuing - AKS Cluster with D3 v2 VM – This is the Kubernetes Cluster Service implementation by Azure.
  • Database - Azure SQL for PostgreSQL – This is an Azure Database Service.
  • TIM Workers - ACI for TIM worker instances – This is the fast scaling Azure Container Instances service (or Kubernetes).

Typically at least two of these services will be setup for redundancy.

The number of TIM Workers are scaled up depending on the number of requests you send to the TIM Engine and therefore the number of requests in the queue.

In an On-Premise environment local Kubernetes and local PostgreSQL installs are used.

On other cloud environments appropriate services will be used. As as an example on AWS:

  • Scalability Fabric TIM Engine and TIM Workers - EKS Cluster with m5.xlarge VM – This is the Kubernetes Cluster Service implementation by AWS.
  • Database - PostgreSQL
  • Queuing of TIM Engine tasks - AmazonMQ for RabbitMQ

Difficulties in Sizing the Environment

Consider the following elements that determine the CPU time and memory requirements for creating a model or creating a forecast, classification or anomaly detection.

  • The size of the data structure
  • The number of predictors (columns)
  • The number of timestamps (rows)
  • The predictor feature importance
  • Correlation between the predictor candidates and target.

This makes it difficult to have an algorithm that provides you the exact memory and CPU consumption. This document provides you benchmarking figures that make it easy to calculate the size of the architecture using benchmark data rather than rock-solid calculations.

Capacity Planning And Performance Overview

Data input size Considerations

The lightning fast speed of TIM InstantML is the result of efficiently using in memory processing and parallelization of computation. The default maximum memory usage is a dataset of 100 MB (measured in a CSV format). Check out Data Properties

Memory Usage

The only noteworthy objects that require significant space are:

  • dataset
  • forecasting / detection tabular output
  • root cause analysis output

There are many more objects involved in the process such as model, logs, accuracy metrics and other, however they all require less then 1 MB of space. The tabular output increases in size significantly with higher forecasting horizon and smaller rolling window in case of forecasting, however this in only relevant in "backtesting" scenarios where users try different settings on historical data. In production setups, the tabular output diminishes to kilobytes because only the new timestamps are evaluated. The same goes for root cause analysis output. All in all, the only memory extensive object in a real production pipeline remains to be the dataset itself.

Processing Time

There are 2 significant bottlenecks in the whole forecasting / detection process:

  • dataset upload / update
  • model building / rebuilding

There are usually many more steps in the whole pipeline, however they require zero to no time to process. This includes the model evaluation (forecasting / detection itself) - once the model is ready, generating forecasts / detections is done lightning fast (under second). That is why we restrict the benchmarking times to the model building.

Benchmark Data and Scenarios

In most cases the sizing calculation is straightforward.

A typical TIM worker runs on following configuration:

  • CPU
    • 4 virtual CPU cores
  • Memory
    • 12 GB of RAM

In this benchmark we provide performance data on one single TIM Worker instance for different data set sizes.

Benchmark results

In the following tables you can find the processing response time and CPU load created by the request based on 1 TIM Worker (running on one 4 core CPU). The datasets were already uploaded before the benchmark started.

We provide benchmark for two forecasting endpoinds and different situations:

Case Request type New model Backtesting
1 forecasting/forecast-jobs/build-model yes yes
2 forecasting/forecast-jobs/{id}/rebuild-model yes yes
3 forecasting/forecast-jobs/{id}/rebuild-model no yes
4 forecasting/forecast-jobs/{id}/rebuild-model no no

Forecasting and classification jobs

This benchmark was done for forecasting jobs that build models for 1-sample ahead forecasts. The benchmark is related to the forecasting execution request. There are different job types (build and rebuild), however they always call the same core underneath. Benchmark result doesn't differ for the request type per se, it does differ for the amount of models that have to be built. Imagine you call your build request first for the S+1 to S+3 horizon and then rebuild the same Model Zoo for the S+1 to S+6 horizon - in both cases, only three models are added to the Model Zoo, so the benchmark stays the same - slightly less than 3 times the respective number in the tables provided (the benchmark is only for the 1 model Model Zoo and the scaling is less than linear). The tables are always denoted with number of rows on y axis and number of variables (target variable and predictors).

Dataset size in DB 1 5 10 50 100 150 500 1000
100 8192 bytes 40 kB 40 kB 72 kB 120 kB 160 kB 424 kB 824 kB
1000 88 kB 120 kB 160 kB 472 kB 920 kB 1360 kB 4024 kB 8024 kB
10000 616 kB 936 kB 1336 kB 4472 kB 8920 kB 13 MB 39 MB 78 MB
100000 5912 kB 9120 kB 13 MB 43 MB 87 MB 130 MB N/A N/A
1000000 57 MB 89 MB 128 MB N/A N/A N/A N/A N/A
Size of CSV file 1 5 10 50 100 150 500 1000
100 2.6 kB 4.9 kB 7.8 kB 30.9 kB 59.9 kB 88.9 kB 291.9 kB 581.7 kB
1000 25.7 kB 48.4 kB 77.1 kB 307.5 kB 595.2 kB 883.0 kB 2.8 MB 5.6 MB
10000 256.5 kB 482.9 kB 770.3 kB 3.0 MB 5.8 MB 8.6 MB 28.3 MB 56.4 MB
100000 2.5 MB 4.7 MB 7.5 MB 30.0 MB 58.1 MB 86.2 MB N/A N/A
1000000 25.0 MB 47.2 MB 75.2 MB N/A N/A N/A N/A N/A

Case 1 and 2

As described above, what influence the benchmark results is number of new models, which TIM is building. Therefore case 1 and case 2 are merged into one benchmark. In both cases will be 1 model build and in-sample forecast with production forecast calculated.

Max CPU usage in % 1 5 10 50 100 150 500 1000
100 23 30 30 32 32 32 34 33
1000 28 30 34 34 34 34 34 34
10000 60 80 120 129 133 137 144 200
100000 165 176 215 338 372 380 N/A N/A
1000000 149 192 198 N/A N/A N/A N/A N/A
Max RAM usage in % 1 5 10 50 100 150 500 1000
100 46 46 46 46 47 47 47 47
1000 47 47 47 48 49 49 49 50
10000 52 53 53 57 57 58 58 58
100000 52 53 53 67 67 67 N/A N/A
1000000 53 60 60 N/A N/A N/A N/A N/A
Model building and prediction time in seconds 1 5 10 50 100 150 500 1000
100 0.04 0.04 0.05 0.09 0.11 0.2 0.5 0.8
1000 0.2 0.2 0.4 0.6 0.9 1.0 2.0 5
10000 2.5 3 4 10 13 14 32 61
100000 21 25 42 102 131 156 N/A N/A
1000000 329 347 384 N/A N/A N/A N/A N/A
Total execution time in seconds 1 5 10 50 100 150 500 1000
100 0.4 0.5 0.5 0.8 1.4 1.7 5 7
1000 0.5 0.6 0.8 1.6 2.5 2.9 8 17
10000 3.8 3.8 5 14 20 25 65 133
100000 25 30 50 135 195 262 N/A N/A
1000000 364 405 471 N/A N/A N/A N/A N/A
Forecasting result table size in DB 1 5 10 50 100 150 500 1000
100 9.7 kB 9.7 kB 9.7 kB 9.7 kB 9.7 kB 9.7 kB 9.7 kB 9.7 kB
1000 95.8 kB 95.8 kB 95.8 kB 95.8 kB 95.8 kB 95.8 kB 95.8 kB 95.8 kB
10000 957 kB 957 kB 957 kB 957 kB 957 kB 957 kB 957 kB 957 kB
100000 9.4 MB 9.4 MB 9.4 MB 9.4 MB 9.4 MB 9.4 MB N/A N/A
1000000 93.5 MB 93.5 MB 93.5 MB N/A N/A N/A N/A N/A

Case 3

In this case, there is no new situation detected and no new model is build. Out-of-sample forecast (out-of-sample rows has to be set) and production forecast are calculated.

Max CPU usage in % 1 5 10 50 100 150 500 1000
100 23 30 30 32 32 32 33 33
1000 28 30 34 34 34 34 34 34
10000 60 80 101 113 119 144 100 86
100000 100 127 157 130 108 101 N/A N/A
1000000 111 118 255 N/A N/A N/A N/A N/A
Max RAM usage in % 1 5 10 50 100 150 500 1000
100 32 32 32 32 32 32 32 32
1000 32 32 32 32 32 32 32 32
10000 34 34 35 34 34 34 34 34
100000 39 43 53 45 41 42 N/A N/A
1000000 52 58 62 N/A N/A N/A N/A N/A
Model building and prediction time in seconds 1 5 10 50 100 150 500 1000
100 0.03 0.04 0.04 0.05 0.07 0.08 0.24 0.4
1000 0.05 0.06 0.07 0.14 0.23 0.31 1.1 2.4
10000 0.4 0.6 0.7 3.2 4.5 6.4 19 37
100000 7 8 11 29 49 69 N/A N/A
1000000 288 309 343 N/A N/A N/A N/A N/A
Total execution time in seconds 1 5 10 50 100 150 500 1000
100 0.4 0.4 0.5 0.7 1.2 1.6 5 6
1000 0.4 0.4 0.5 1.1 2.0 2.8 7.3 15
10000 0.9 1.4 2.0 6 12 16 50 105
100000 10 14 20 76 114 184 N/A N/A
1000000 326 368 433 N/A N/A N/A N/A N/A
Forecasting result table size in DB 1 5 10 50 100 150 500 1000
100 9.7 kB 9.7 kB 9.7 kB 9.7 kB 9.7 kB 9.7 kB 9.7 kB 9.7 kB
1000 95.8 kB 95.8 kB 95.8 kB 95.8 kB 95.8 kB 95.8 kB 95.8 kB 95.8 kB
10000 957 kB 957 kB 957 kB 957 kB 957 kB 957 kB 957 kB 957 kB
100000 9.4 MB 9.4 MB 9.4 MB 9.4 MB 9.4 MB 9.4 MB N/A N/A
1000000 93.5 MB 93.5 MB 93.5 MB N/A N/A N/A N/A N/A

Case 4

No new situation is detected and no new model is build. Only production forecast are calculated.

Max CPU usage in % 1 5 10 50 100 150 500 1000
100 23 30 30 32 32 32 34 33
1000 28 30 34 34 34 34 34 34
10000 34 34 34 35 36 36 36 47
100000 45 45 45 45 49 60 N/A N/A
1000000 45 45 56 N/A N/A N/A N/A N/A
Max RAM usage in % 1 5 10 50 100 150 500 1000
100 32 32 32 32 32 32 32 32
1000 32 32 32 32 32 32 32 32
10000 34 34 35 34 34 34 34 34
100000 39 43 53 45 41 42 N/A N/A
1000000 52 58 62 N/A N/A N/A N/A N/A
Model building and prediction time in seconds 1 5 10 50 100 150 500 1000
100 0.02 0.02 0.02 0.03 0.04 0.13 0.2 0.5
1000 0.02 0.02 0.03 0.10 0.19 0.4 1.0 2.4
10000 0.04 0.2 0.3 1.6 3.0 4.6 16 33
100000 0.4 2 2 15 33 52 N/A N/A
1000000 5 17 22 N/A N/A N/A N/A N/A
Total execution time in seconds 1 5 10 50 100 150 500 1000
100 0.3 0.4 0.4 1.0 1.6 2.3 7 9
1000 0.3 0.4 0.5 1.3 2.3 3.5 11 17
10000 0.8 0.9 1.4 6 11 17 53 110
100000 2.5 7 7 47 99 152 N/A N/A
1000000 26 61 107 N/A N/A N/A N/A N/A
Forecasting result table size in DB 1 5 10 50 100 150 500 1000
100 0.1 kB 0.1 kB 0.1 kB 0.1 kB 0.1 kB 0.1 kB 0.1 kB 0.1 kB
1000 0.1 kB 0.1 kB 0.1 kB 0.1 kB 0.1 kB 0.1 kB 95.8 kB 95.8 kB
10000 0.1 kB 0.1 kB 0.1 kB 0.1 kB 0.1 kB 0.1 kB 957 kB 957 kB
100000 0.1 kB 0.1 kB 0.1 kB 0.1 kB 0.1 kB 0.1 kB N/A N/A
1000000 0.1 kB 0.1 kB 0.1 kB N/A N/A N/A N/A N/A

Remarks

  1. The data used 4 numbers precision (e.g. 0.582).
  2. Some datasets exceed 100 MB storage size in the database because we include db indices.
  3. The computing time increases with bigger forecasting horizons, however, the increase is smaller than linear.
  4. The output table size is only relevant for backtesting tasks. It scales up with bigger forecasting horizon and down with bigger rolling window.
  5. As the memory usage approaches 100 percent, TIM starts to preprocess the data by throwing rows away and switching off features which results into smaller numbers all across the tables after that breaking point (RAM, CPU and forecasting output table size). This is why the numbers may not always rise in the axis directions. The numbers where such preprocessing took place are denoted in italic and in this benchmark only polynomial features were switched off.
  6. Some fields are not filled because the respective dataset would be bigger than the 100 MB threshold.
  7. The CPU Load is expressed per CPU. i.e.. 400% 4 X 100% / Core
  8. The performance figures are for sequential execution of the ML requests without scaling and spinning up more TIM Workers.

Scaling the workers

What do you do if you need more transaction per hour?

The TIM Engine provides queueing and automatically spins up new TIM workers to cater for the volume of request being handled.

How to calculate the size and pricing of your infrastructure?

The benchmark figures give you an indication what the performance will be in your use case. You need to determine the profile of ML request you need and calculate the number of TIM Workers you will need.

In this table, we give an example of a calculation:

Component Sizing Consideration Cost Costing Example
TIM Engine Fabric This Kubernetes installed component ensure a REST Endpoint is available We recommend 2 VM with 4 core CPU and 32 Memory for this. 140 Euro / Month for 2 D3 Servers to support the cluster
Queueing Service Rabbit MQ is available a Kubernetes cluster deployment - alternately you can use a Platform service for this. RabbitMQ service are available on AWS and Azure Optional
Database This is the Postgres database service. Azure SQL for Postgress - 130 Euro / Month
TIM Workers The TIM workers are scalable component. You can find the CPU load response time in the benchmark tables. This allows you to calculate the number of 4 Core/12 GB Servers you need. 2 ACI containers for TIM Workers - 240 Euro per Months
Total 510 Euro / Month

This is a two TIM worker configuration. Some Example throughputs:

  • RTInstantML Scenario (/forecasting/forecast-jobs/build-model) - 1000 observations, 50 variables - 1,6 Sec Response Time - 2250 Transactions/Hour/Works = 4500 Transaction per hour for this configuration
  • RTInstantML Scenario (/forecasting/forecast-jobs/build-model) - 10000 observations, 50 variables - 14 Sec Response Time - 257 Transactions/Hour = 514 Transaction per hour for this configuration

Notes:

  • Do not forget to cater for data collection. The measurements in the tables above are processing time (Response Time) of the TIM Worker.
  • the The prices are indicational and dependent on your plan with Azure.
  • The Azure prices are based on 3 years upfront.
  • Similar pricing is possible for AWS or on premise.
  • You might want to consider servers with less cores if your cases does not benefit from parallelization over multiple cores.

Sizing And Estimation Methodology

Estimating anything can be a complex and error-prone process. That’s why it's called an 'estimation', rather than a 'calculation'. There are three primary approaches to sizing a TIM InstantML implementation:

  • Algorithm, or Calculation Based
  • Size By Example Based
  • Proof of Concept Based

Typical implementations of TIM InstantML do not required complex sizing and estimation processes. An algorithm based approach, taking into account the data size and the number of ML transaction per hour per worker, allows you to determine the number of parallel workers and design your architecture.

In more complex cases a Proof of Concept might be useful. This is typically the case with more complicated peak time ML consumption requirements.

Algorithm, Or Calculation Based

An algorithm or process that accepts data input is probably the most commonly accepted tool for delivering sizing estimations. Unfortunately, this approach is generally the most inaccurate.

When considering a multiple model – multiple use case implementation, the number of variables involved in delivering a calculation that even approaches a realistic sizing response requires input values numbering in excess of one hundred, and calculations so complex and sensitive that providing an input value plus or minus 1% of the correct value results in wildly inaccurate results.

The other approach to calculation-based solutions is to simplify the calculation to the point where it is simple to understand and simple to use. This paper shows how this kind of simplification can provide us with a sizing calculator.

Size-By-Example Based

A size-by-example (SBE) approach requires a set of known samples to use as data points along the thermometer of system size. The more examples available for SBE, the more accurate the intended implementation will be.

By using these real world examples, both customers and Tangent Works can be assured that the configurations proposed have been implemented before and will provide the performance and functionality unique to the proposed implementation. Tangent Works Engineering can help here.

Proof Of Concept Based

A proof of concept (POC), or pilot based approach, offers the most accurate sizing data of all three approaches.

A POC you to do the following:

  • Test your InstantML implementation design
  • Test your chosen hardware or cloud platform
  • Simulate projected load
  • Validate design assumptions
  • Validate Usage
  • Provide iterative feedback for your implementation team
  • Adjust or validate the implementation decisions made prior to the POC

There are, however, two downsides to a POC based approach, namely time and money. Running a POC requires the customer to have manpower, hardware, and the time available to implement the solution, validate the solution, iterate changes, re-test, and finally analyze the POC findings.

A POC is always the best and recommended approach for any sizing exercise. It delivers results that are accurate for the unique implementation of the specific customer that are as close to deploying the real live solution as possible, without the capital outlay on hardware and project resources.