Nvidia GPU INT-8 quantization#

What is it about?#

Quantization is one of the most effective and generic approaches to make model inference faster. Basically, it replaces high precision float numbers in model tensors encoded in 32 or 16 bits by lower precision ones encoded in 8 bits or less:

it takes less memory
computation is easier / faster

It can be applied to any model in theory, and, if done well, it should maintain accuracy.

The purpose of this notebook is to show a process to perform quantization on any Transformer architectures.

Moreover, the library is designed to offer a simple API and still let advanced users tweak the algorithm.

Benchmark#

TL;DR

We benchmarked Pytorch and Nvidia TensorRT, on both CPU and GPU, with/without quantization, our methods provide the fastest inference by large margin.

Framework	Precision	Latency (ms)	Accuracy	Speedup	Hardware
Pytorch	FP32	4267	86.6 %	X 0.02	CPU
Pytorch	FP16	4428	86.6 %	X 0.02	CPU
Pytorch	INT-8	3300	85.9 %	X 0.02	CPU
Pytorch	FP32	77	86.6 %	X 1	GPU
Pytorch	FP16	56	86.6 %	X 1.38	GPU
ONNX Runtime	FP32	76	86.6 %	X 1.01	GPU
ONNX Runtime	FP16	34	86.6 %	X 2.26	GPU
ONNX Runtime	FP32	4023	86.6 %	X 0.02	CPU
ONNX Runtime	FP16	3957	86.6 %	X 0.02	CPU
ONNX Runtime	INT-8	3336	86.5 %	X 0.02	CPU
TensorRT	FP16	30	86.6 %	X 2.57	GPU
TensorRT (our method)	INT-8	17	86.2 %	X 4.53	GPU

Note

measures done on a Nvidia RTX 3090 GPU + 12 cores i7 Intel CPU (support AVX-2 instruction) Roberta base architecture flavor with batch of size 32 / seq len 256, similar results obtained for other sizes/seq len not included in the table. Accuracy obtained after a single epoch, no LR search or any hyper parameter optimization

Check the end to end demo to see where these numbers are from.