Skip to main content

Size your Infrastructure for a TIM 5.0 Implementation

Introduction​

This paper explains the overall capacity planning methodologies available for TIM InstantML, and explains the calculations used to obtain metrics for estimating and sizing a TIM Engine environment.

The TIM InstantML technology is deployable in various ways. TIM InstantML can be deployed as a SAAS Service that is called in your IT environment.

Alternatively, TIM InstantML can also be deployed in an On-Premise environment where you provide the server infrastructure but also in a Bring Your Own Cloud License (BYOL on Azure, AWS,...) way.

This document describes sizing considerations for an On-Premise or BYOL environment.

In the SAAS scenario, scaling of the service is done automatically. The BYOL/On-Premise solution also provides scaling but you will need to provide sufficient resources.

This paper is related to following versions of the TIM software:

  • TIM Engine 5.X
  • TIM Studio 5.X

Architectural Components​

TIM InstantML runs mainly on a Kubernetes cluster.

As an example, we provide an Azure Deployment Scheme:

  • Scalability Fabric TIM Engine with queuing - AKS Cluster with D3 v2 VM – This is the Kubernetes Cluster Service implementation by Azure.
  • Database - Azure SQL for PostgreSQL – This is an Azure Database Service.
  • TIM Workers - ACI for TIM worker instances – This is the fast scaling Azure Container Instances service (or Kubernetes).

Typically at least two of these services will be setup for redundancy.

The number of TIM Workers are scaled up depending on the number of requests you send to the TIM Engine and therefore the number of requests in the queue.

In an On-Premise environment local Kubernetes and local PostgreSQL installs are used.

On other cloud environments appropriate services will be used. As as an example on AWS:

  • Scalability Fabric TIM Engine and TIM Workers - EKS Cluster with m5.xlarge VM – This is the Kubernetes Cluster Service implementation by AWS.
  • Database - PostgreSQL
  • Queuing of TIM Engine tasks - AmazonMQ for RabbitMQ

Difficulties in Sizing the Environment​

Consider the following elements that determine the CPU time and memory requirements for creating a model or creating a forecast, classification or anomaly detection.

  • The size of the data structure
  • The number of predictors (columns)
  • The number of timestamps (rows)
  • The predictor feature importance
  • Correlation between the predictor candidates and target.

This makes it difficult to have an algorithm that provides you the exact memory and CPU consumption. This document provides you benchmarking figures that make it easy to calculate the size of the architecture using benchmark data rather than rock-solid calculations.

Capacity Planning And Performance Overview​

Data input size Considerations​

The lightning fast speed of TIM InstantML is the result of efficiently using in memory processing and parallelization of computation. The default maximum memory usage is a dataset of 100 MB (measured in a CSV format). Check out Data Properties

Memory Usage​

The only noteworthy objects that require significant space are:

  • dataset
  • forecasting / detection tabular output
  • root cause analysis output

There are many more objects involved in the process such as model, logs, accuracy metrics and other, however they all require less then 1 MB of space. The tabular output increases in size significantly with higher forecasting horizon and smaller rolling window in case of forecasting, however this is only relevant in "backtesting" scenarios where users try different settings on historical data. In production setups, the tabular output diminishes to kilobytes because only the new timestamps are evaluated. The same goes for root cause analysis output. All in all, the only memory extensive object in a real production pipeline remains to be the dataset itself.

Processing Time​

There are 2 significant bottlenecks in the whole forecasting / detection process:

  • dataset upload / update
  • model building / rebuilding

There are usually many more steps in the whole pipeline, however they require zero to no time to process. This includes the model evaluation (forecasting / detection itself) - once the model is ready, generating forecasts / detections is done lightning fast (under second). That is why we restrict the benchmarking times to the model building

Benchmark Data and Scenarios​

In most cases the sizing calculation is straightforward.

A typical TIM worker runs on following configuration:

  • CPU
    • 4 virtual CPU cores
  • Memory
    • 12 GB of RAM

In this benchmark we provide performance data on one single TIM Worker instance for different data set sizes.

Benchmark results​

In the following tables you can find the processing response time and CPU load created by the request based on 1 TIM Worker (running on one 4 core CPU). The datasets were already uploaded before the benchmark started.

We provide benchmark for two forecasting endpoinds and different situations:

CaseRequest typeNew modelBacktesting
1forecasting/forecast-jobs/build-modelyesyes
2forecasting/forecast-jobs/{id}/rebuild-modelyesyes
3forecasting/forecast-jobs/{id}/rebuild-modelnoyes
4forecasting/forecast-jobs/{id}/rebuild-modelnono

Forecasting and classification jobs​

This benchmark was done for forecasting jobs that build models for 1-sample ahead forecasts. The benchmark is related to the forecasting execution request. There are different job types (build and rebuild), however they always call the same core underneath. Benchmark result doesn't differ for the request type per se, it does differ for the amount of models that have to be built. Imagine you call your build request first for the S+1 to S+3 horizon and then rebuild the same Model Zoo for the S+1 to S+6 horizon - in both cases, only three models are added to the Model Zoo, so the benchmark stays the same - slightly less than 3 times the respective number in the tables provided (the benchmark is only for the 1 model Model Zoo and the scaling is less than linear). The tables are always denoted with number of rows on y axis and number of variables (target variable and predictors).

Dataset size in DB1510501001505001000
1008192 bytes40 kB40 kB72 kB120 kB160 kB424 kB824 kB
100088 kB120 kB160 kB472 kB920 kB1360 kB4024 kB8024 kB
10000616 kB936 kB1336 kB4472 kB8920 kB13 MB39 MB78 MB
1000005912 kB9120 kB13 MB43 MB87 MB130 MBN/AN/A
100000057 MB89 MB128 MBN/AN/AN/AN/AN/A
Size of CSV file1510501001505001000
1002.6 kB4.9 kB7.8 kB30.9 kB59.9 kB88.9 kB291.9 kB581.7 kB
100025.7 kB48.4 kB77.1 kB307.5 kB595.2 kB883.0 kB2.8 MB5.6 MB
10000256.5 kB482.9 kB770.3 kB3.0 MB5.8 MB8.6 MB28.3 MB56.4 MB
1000002.5 MB4.7 MB7.5 MB30.0 MB58.1 MB86.2 MBN/AN/A
100000025.0 MB47.2 MB75.2 MBN/AN/AN/AN/AN/A

Case 1 and 2​

As described above, what influence the benchmark results is number of new models, which TIM is building. Therefore case 1 and case 2 are merged into one benchmark. In both cases will be 1 model build and in-sample forecast with production forecast calculated.

Max CPU usage in %1510501001505001000
1002330303232323433
10002830343434343434
100006080120129133137144200
100000165176215338372380N/AN/A
1000000149192198N/AN/AN/AN/AN/A
Max RAM usage in %1510501001505001000
1004646464647474747
10004747474849494950
100005253535757585858
100000525353676767N/AN/A
1000000536060N/AN/AN/AN/AN/A
Model building and prediction time in seconds1510501001505001000
1000.040.040.050.090.110.20.50.8
10000.20.20.40.60.91.02.05
100002.5341013143261
100000212542102131156N/AN/A
1000000329347384N/AN/AN/AN/AN/A
Total execution time in seconds1510501001505001000
1000.40.50.50.81.41.757
10000.50.60.81.62.52.9817
100003.83.8514202565133
100000253050135195262N/AN/A
1000000364405471N/AN/AN/AN/AN/A
Forecasting result table size in DB1510501001505001000
1009.7 kB9.7 kB9.7 kB9.7 kB9.7 kB9.7 kB9.7 kB9.7 kB
100095.8 kB95.8 kB95.8 kB95.8 kB95.8 kB95.8 kB95.8 kB95.8 kB
10000957 kB957 kB957 kB957 kB957 kB957 kB957 kB957 kB
1000009.4 MB9.4 MB9.4 MB9.4 MB9.4 MB9.4 MBN/AN/A
100000093.5 MB93.5 MB93.5 MBN/AN/AN/AN/AN/A

Case 3​

In this case, there is no new situation detected and no new model is build. Out-of-sample forecast (out-of-sample rows has to be set) and production forecast are calculated.

Max CPU usage in %1510501001505001000
1002330303232323333
10002830343434343434
10000608010111311914410086
100000100127157130108101N/AN/A
1000000111118255N/AN/AN/AN/AN/A
Max RAM usage in %1510501001505001000
1003232323232323232
10003232323232323232
100003434353434343434
100000394353454142N/AN/A
1000000525862N/AN/AN/AN/AN/A
Model building and prediction time in seconds1510501001505001000
1000.030.040.040.050.070.080.240.4
10000.050.060.070.140.230.311.12.4
100000.40.60.73.24.56.41937
1000007811294969N/AN/A
1000000288309343N/AN/AN/AN/AN/A
Total execution time in seconds1510501001505001000
1000.40.40.50.71.21.656
10000.40.40.51.12.02.87.315
100000.91.42.06121650105
10000010142076114184N/AN/A
1000000326368433N/AN/AN/AN/AN/A
Forecasting result table size in DB1510501001505001000
1009.7 kB9.7 kB9.7 kB9.7 kB9.7 kB9.7 kB9.7 kB9.7 kB
100095.8 kB95.8 kB95.8 kB95.8 kB95.8 kB95.8 kB95.8 kB95.8 kB
10000957 kB957 kB957 kB957 kB957 kB957 kB957 kB957 kB
1000009.4 MB9.4 MB9.4 MB9.4 MB9.4 MB9.4 MBN/AN/A
100000093.5 MB93.5 MB93.5 MBN/AN/AN/AN/AN/A

Case 4​

No new situation is detected and no new model is build. Only production forecast are calculated.

Max CPU usage in %1510501001505001000
1002330303232323433
10002830343434343434
100003434343536363647
100000454545454960N/AN/A
1000000454556N/AN/AN/AN/AN/A
Max RAM usage in %1510501001505001000
1003232323232323232
10003232323232323232
100003434353434343434
100000394353454142N/AN/A
1000000525862N/AN/AN/AN/AN/A
Model building and prediction time in seconds1510501001505001000
1000.020.020.020.030.040.130.20.5
10000.020.020.030.100.190.41.02.4
100000.040.20.31.63.04.61633
1000000.422153352N/AN/A
100000051722N/AN/AN/AN/AN/A
Total execution time in seconds1510501001505001000
1000.30.40.41.01.62.379
10000.30.40.51.32.33.51117
100000.80.91.46111753110
1000002.5774799152N/AN/A
10000002661107N/AN/AN/AN/AN/A
Forecasting result table size in DB1510501001505001000
1000.1 kB0.1 kB0.1 kB0.1 kB0.1 kB0.1 kB0.1 kB0.1 kB
10000.1 kB0.1 kB0.1 kB0.1 kB0.1 kB0.1 kB95.8 kB95.8 kB
100000.1 kB0.1 kB0.1 kB0.1 kB0.1 kB0.1 kB957 kB957 kB
1000000.1 kB0.1 kB0.1 kB0.1 kB0.1 kB0.1 kBN/AN/A
10000000.1 kB0.1 kB0.1 kBN/AN/AN/AN/AN/A

Remarks​

  1. The data used 4 numbers precision (e.g. 0.582).
  2. Some datasets exceed 100 MB storage size in the database because we include db indices.
  3. The computing time increases with bigger forecasting horizons, however, the increase is smaller than linear.
  4. The output table size is only relevant for backtesting tasks. It scales up with bigger forecasting horizon and down with bigger rolling window.
  5. As the memory usage approaches 100 percent, TIM starts to preprocess the data by throwing rows away and switching off features which results into smaller numbers all across the tables after that breaking point (RAM, CPU and forecasting output table size). This is why the numbers may not always rise in the axis directions. The numbers where such preprocessing took place are denoted in italic and in this benchmark only polynomial features were switched off.
  6. Some fields are not filled because the respective dataset would be bigger than the 100 MB threshold.
  7. The CPU Load is expressed per CPU. i.e.. 400% 4 X 100% / Core
  8. The performance figures are for sequential execution of the ML requests without scaling and spinning up more TIM Workers.

Scaling the workers​

What do you do if you need more transaction per hour?

The TIM Engine provides queueing and automatically spins up new TIM workers to cater for the volume of request being handled.

How to calculate the size and pricing of your infrastructure?​

The benchmark figures give you an indication what the performance will be in your use case. You need to determine the profile of ML request you need and calculate the number of TIM Workers you will need.

In this table, we give an example of a calculation:

ComponentSizing ConsiderationCostCosting Example
TIM Engine FabricThis Kubernetes installed component ensure a REST Endpoint is availableWe recommend 2 VM with 4 core CPU and 32 Memory for this.140 Euro / Month for 2 D3 Servers to support the cluster
Queueing ServiceRabbit MQ is available a Kubernetes cluster deployment - alternately you can use a Platform service for this.RabbitMQ service are available on AWS and AzureOptional
DatabaseThis is the Postgres database service.Azure SQL for Postgress - 130 Euro / Month
TIM WorkersThe TIM workers are scalable component.You can find the CPU load response time in the benchmark tables. This allows you to calculate the number of 4 Core/12 GB Servers you need.2 ACI containers for TIM Workers - 240 Euro per Months
Total510 Euro / Month

This is a two TIM worker configuration. Some Example throughputs:

  • RTInstantML Scenario (/forecasting/forecast-jobs/build-model) - 1000 observations, 50 variables - 1,6 Sec Response Time - 2250 Transactions/Hour/Works = 4500 Transaction per hour for this configuration
  • RTInstantML Scenario (/forecasting/forecast-jobs/build-model) - 10000 observations, 50 variables - 14 Sec Response Time - 257 Transactions/Hour = 514 Transaction per hour for this configuration

Notes:

  • Do not forget to cater for data collection. The measurements in the tables above are processing time (Response Time) of the TIM Worker.
  • the The prices are indicational and dependent on your plan with Azure.
  • The Azure prices are based on 3 years upfront.
  • Similar pricing is possible for AWS or on premise.
  • You might want to consider servers with less cores if your cases does not benefit from parallelization over multiple cores.

Sizing And Estimation Methodology​

Estimating anything can be a complex and error-prone process. That’s why it's called an 'estimation', rather than a 'calculation'. There are three primary approaches to sizing a TIM InstantML implementation:

  • Algorithm, or Calculation Based
  • Size By Example Based
  • Proof of Concept Based

Typical implementations of TIM InstantML do not required complex sizing and estimation processes. An algorithm based approach, taking into account the data size and the number of ML transaction per hour per worker, allows you to determine the number of parallel workers and design your architecture.

In more complex cases a Proof of Concept might be useful. This is typically the case with more complicated peak time ML consumption requirements.

Algorithm, Or Calculation Based​

An algorithm or process that accepts data input is probably the most commonly accepted tool for delivering sizing estimations. Unfortunately, this approach is generally the most inaccurate.

When considering a multiple model – multiple use case implementation, the number of variables involved in delivering a calculation that even approaches a realistic sizing response requires input values numbering in excess of one hundred, and calculations so complex and sensitive that providing an input value plus or minus 1% of the correct value results in wildly inaccurate results.

The other approach to calculation-based solutions is to simplify the calculation to the point where it is simple to understand and simple to use. This paper shows how this kind of simplification can provide us with a sizing calculator.

Size-By-Example Based​

A size-by-example (SBE) approach requires a set of known samples to use as data points along the thermometer of system size. The more examples available for SBE, the more accurate the intended implementation will be.

By using these real world examples, both customers and Tangent Works can be assured that the configurations proposed have been implemented before and will provide the performance and functionality unique to the proposed implementation. Tangent Works Engineering can help here.

Proof Of Concept Based​

A proof of concept (POC), or pilot based approach, offers the most accurate sizing data of all three approaches.

A POC you to do the following:

  • Test your InstantML implementation design
  • Test your chosen hardware or cloud platform
  • Simulate projected load
  • Validate design assumptions
  • Validate Usage
  • Provide iterative feedback for your implementation team
  • Adjust or validate the implementation decisions made prior to the POC

There are, however, two downsides to a POC based approach, namely time and money. Running a POC requires the customer to have manpower, hardware, and the time available to implement the solution, validate the solution, iterate changes, re-test, and finally analyze the POC findings.

A POC is always the best and recommended approach for any sizing exercise. It delivers results that are accurate for the unique implementation of the specific customer that are as close to deploying the real live solution as possible, without the capital outlay on hardware and project resources.