High-level comparison#

Inference engine#

The inference engine performs the computation only, it doesn't manage the communication part (HTTP/GRPC API, etc.).

Summary

	Nvidia TensorRT	Microsoft ONNX Runtime	Meta Pytorch	comments
transformer-deploy support
Licence	Apache 2, optimization engine is closed source	MIT	Modified BSD
ease of use (API)	:fontawesome-regular-angry:			Nvidia has chosen to not hide technical details + model is specific to a single `hardware + model + data shapes` association
ease of use (documentation)	(spread out, incomplete)	(improving)	(strong community)
Hardware support	GPU + Jetson	CPU + GPU + IoT + Edge + Mobile	CPU + GPU
Performance				TensorRT is usually 5 to 10X faster than Pytorch when you use quantization, etc.
Accuracy				TensorRT optimizations may be a bit too aggressive and decrease model accuracy. It requires manual modification to retrieve it.

	Nvidia Triton	Meta TorchServe	FastAPI	comments
transformer-deploy support
Licence	Modified BSD	Apache 2	MIT
ease of use (API)				As a classic HTTP server, FastAPI may appear easier to use
ease of use (documentation)				FastAPI has one of the most beautiful documentation ever!
Performance				FastAPI is 6-10X slower to manage user query than Triton
Support
CPU
GPU
dynamic batching				combine individual inference requests together to improve inference throughput
concurrent model execution				run multiple models (or multiple instances of the same model)
pipeline				one or more models and the connection of input and output tensors between those models
native multiple backends* support				*backends: Microsoft ONNX Runtime, Nvidia Triton, Meta Pytorch
REST API
GRPC API
Inference metrics				GPU utilization, server throughput, and server latency