Nvidia GPU INT-8 quantization#
What is it about?#
Quantization is one of the most effective and generic approaches to make model inference faster. Basically, it replaces high precision float numbers in model tensors encoded in 32 or 16 bits by lower precision ones encoded in 8 bits or less:
- it takes less memory
- computation is easier / faster
It can be applied to any model in theory, and, if done well, it should maintain accuracy.
The purpose of this notebook is to show a process to perform quantization on any Transformer
architectures.
Moreover, the library is designed to offer a simple API and still let advanced users tweak the algorithm.
Benchmark#
TL;DR
We benchmarked Pytorch and Nvidia TensorRT, on both CPU and GPU, with/without quantization, our methods provide the fastest inference by large margin.
Framework | Precision | Latency (ms) | Accuracy | Speedup | Hardware |
---|---|---|---|---|---|
Pytorch | FP32 | 4267 | 86.6 % | X 0.02 | CPU |
Pytorch | FP16 | 4428 | 86.6 % | X 0.02 | CPU |
Pytorch | INT-8 | 3300 | 85.9 % | X 0.02 | CPU |
Pytorch | FP32 | 77 | 86.6 % | X 1 | GPU |
Pytorch | FP16 | 56 | 86.6 % | X 1.38 | GPU |
ONNX Runtime | FP32 | 76 | 86.6 % | X 1.01 | GPU |
ONNX Runtime | FP16 | 34 | 86.6 % | X 2.26 | GPU |
ONNX Runtime | FP32 | 4023 | 86.6 % | X 0.02 | CPU |
ONNX Runtime | FP16 | 3957 | 86.6 % | X 0.02 | CPU |
ONNX Runtime | INT-8 | 3336 | 86.5 % | X 0.02 | CPU |
TensorRT | FP16 | 30 | 86.6 % | X 2.57 | GPU |
TensorRT (our method) | INT-8 | 17 | 86.2 % | X 4.53 | GPU |
Note
measures done on a Nvidia RTX 3090 GPU + 12 cores i7 Intel CPU (support AVX-2 instruction)
Roberta base
architecture flavor with batch of size 32 / seq len 256, similar results obtained for other sizes/seq len not included in the table.
Accuracy obtained after a single epoch, no LR search or any hyper parameter optimization
Check the end to end demo to see where these numbers are from.