Skip to content

High-level comparison#

Inference engine#

The inference engine performs the computation only, it doesn't manage the communication part (HTTP/GRPC API, etc.).

Summary

  • don't use Pytorch in production for inference
  • ONNX Runtime is your good enough API for most inference jobs
  • if you need best performances, use TensorRT
Nvidia TensorRT Microsoft ONNX Runtime Meta Pytorch comments
transformer-deploy support
Licence Apache 2, optimization engine is closed source MIT Modified BSD
ease of use (API) :fontawesome-regular-angry: Nvidia has chosen to not hide technical details + model is specific to a single hardware + model + data shapes association
ease of use (documentation)
(spread out, incomplete)

(improving)

(strong community)
Hardware support
GPU + Jetson

CPU + GPU + IoT + Edge + Mobile

CPU + GPU
Performance TensorRT is usually 5 to 10X faster than Pytorch when you use quantization, etc.
Accuracy TensorRT optimizations may be a bit too aggressive and decrease model accuracy. It requires manual modification to retrieve it.

Inference HTTP/GRPC server#

Nvidia Triton Meta TorchServe FastAPI comments
transformer-deploy support
Licence Modified BSD Apache 2 MIT
ease of use (API) As a classic HTTP server, FastAPI may appear easier to use
ease of use (documentation) FastAPI has one of the most beautiful documentation ever!
Performance FastAPI is 6-10X slower to manage user query than Triton
Support
CPU
GPU
dynamic batching combine individual inference requests together to improve inference throughput
concurrent model execution run multiple models (or multiple instances of the same model)
pipeline one or more models and the connection of input and output tensors between those models
native multiple backends* support *backends: Microsoft ONNX Runtime, Nvidia Triton, Meta Pytorch
REST API
GRPC API
Inference metrics GPU utilization, server throughput, and server latency