High-level comparison#
Inference engine#
The inference engine performs the computation only, it doesn't manage the communication part (HTTP/GRPC API, etc.).
Summary
- don't use Pytorch in production for inference
- ONNX Runtime is your good enough API for most inference jobs
- if you need best performances, use TensorRT
Nvidia TensorRT | Microsoft ONNX Runtime | Meta Pytorch | comments | |
---|---|---|---|---|
transformer-deploy support | ||||
Licence | Apache 2, optimization engine is closed source | MIT | Modified BSD | |
ease of use (API) | :fontawesome-regular-angry: | Nvidia has chosen to not hide technical details + model is specific to a single hardware + model + data shapes association |
||
ease of use (documentation) | (spread out, incomplete) |
(improving) |
(strong community) |
|
Hardware support | GPU + Jetson |
CPU + GPU + IoT + Edge + Mobile |
CPU + GPU |
|
Performance | TensorRT is usually 5 to 10X faster than Pytorch when you use quantization, etc. | |||
Accuracy | TensorRT optimizations may be a bit too aggressive and decrease model accuracy. It requires manual modification to retrieve it. |
Inference HTTP/GRPC server#
Nvidia Triton | Meta TorchServe | FastAPI | comments | |
---|---|---|---|---|
transformer-deploy support | ||||
Licence | Modified BSD | Apache 2 | MIT | |
ease of use (API) | As a classic HTTP server, FastAPI may appear easier to use | |||
ease of use (documentation) | FastAPI has one of the most beautiful documentation ever! | |||
Performance | FastAPI is 6-10X slower to manage user query than Triton | |||
Support | ||||
CPU | ||||
GPU | ||||
dynamic batching | combine individual inference requests together to improve inference throughput | |||
concurrent model execution | run multiple models (or multiple instances of the same model) | |||
pipeline | one or more models and the connection of input and output tensors between those models | |||
native multiple backends* support | *backends: Microsoft ONNX Runtime, Nvidia Triton, Meta Pytorch | |||
REST API | ||||
GRPC API | ||||
Inference metrics | GPU utilization, server throughput, and server latency |