Performance

SSF was designed to be part of larger deployment solution and its main focus is ease of use. The operational area of an SSF instance is a host unit of one IPU-POD machine or one VM with assigned compute and IPU resources. SSF performance covers mid-scale single model deployments. When used with higher level platforms such as Kubernetes, it can serve as a base unit for larger production deployments.

Overview

As a rule, in production use cases the maximum number of requests per second (RPS) for machine learning applications is limited by the time required for low level inference requests. In other words, the time spent computing the model output is many times greater than the time spent delivering the message from the user to the input of the sample processing algorithms and the model.

On the other hand, maximum theoretical performance only depends on the performance of the transportation and scheduling mechanisms and can be useful as reference number.

Upper bound of performance

The theoretical maximum RPS for SSF with a pass-through application (a workload that performs no operation, for example the application in the examples/simple directory) is limited by the internally used dispatching mechanism. Another thing to note is that using the server replication factor (--rs) (only available for FastAPI) is practically equal to running multiple independent SSF instances with separate communication channels. When used it allows linear scaling at the expense of linear HW allocation. App replication factor (--ra) is not included in this test since for pass-through workloads it has no effect (when used with SSF it is still only a single internal dispatching mechanism). The table below shows the maximum RPS for SSF with a pass-through application deployed on a single host machine.

API	`--rs`	RPS
FastAPI	1	750
FastAPI	2	1400
GRPC	1	1150

Practical application performance

Workload performance for machine learning applications will be influenced by multiple factors. Such applications use both CPUs and IPUs and are dependent on network and transportation layers. Performance, therefore, needs to be evaluated with these factors in mind.

For reference, the table below contains results gathered by testing one SSF instance running the test application from examples/simple_delay. This application imitates a workload that is IPU performance bound - each request waits on a sleep system call releasing CPU resources. For test purposes the sleep time was set to 1/30 of a second to simulate a model that would be capable of processing 30 FPS.

Server replication factor (--rs) and app replication factor (--ra) are equivalent in a way because they both aim to occupy more IPUs by replicating the same model. However server replication factor relies on the operating system for task scheduling while app replication factor relies on the SSF internal queue. The operating system task scheduling may be assigning jobs unevenly since, from a system level perspective, a task waiting on a system call (for example an external hardware call) is ready to be assigned the next task. In contrast the internal SSF scheduling uses queue length predicate. Depending on the application that is running, optimal replication performance may require only one of these parameters, or both.

API	`--ra`	`--rs`	RPS
FastAPI	1	1	30
FastAPI	4	1	119
FastAPI	8	1	239
FastAPI	16	1	455
FastAPI	32	1	510
FastAPI	16	2	910
GRPC	1	1	30
GRPC	4	1	119
GRPC	8	1	239
GRPC*	16	1	477
GRPC*	32	1	670

NOTE: () when app replication factor* > 10, grcp-max-connections was set to 20 since achieving max RPS required starting ~ 50 simulated users.

Performance recommendation

To achieve optimal performance, consider the following recommendations: - Using application trace incurs a significant performance penalty. To disable application logs, use --modify-config "application.trace=False" (up to 20% performance impact) - Prometheus metrics incur a minor performance impact (less than 5% for a pass-through workload) but can be disabled using the --prometheus-disable option - Internal SSF scheduling can process a maximum of ~1200 RPS, so to achieve more with a single SSF instance use the FastAPI server replication factor which will stress the hardware but start multiple SSF instances under the hood, enabling an additional scheduling layer on top - For FastAPI, depending on the application and the expected load on the endpoint, SSF internal queuing may decrease the performance significantly, so consider using a combination of server replication factor (--rs) and app replication factor (--ra) to achieve optimal performance result - GRPC starts 10 workers by default. This is enough for mid-size deployments but can be increased using grcp-max-connections if we expect tens of users to stress the server at the same time - For large scale applications it is recommended to use Kubernetes and HorizontalPodAutoscaler combined with server replication factor (--rs) and/or app replication factor (--ra)