Post Training Quantization (PTQ) and Quantization Aware Training (QAT)#
What is it?#
A PTQ is basically a fine tuned model where we add quantization nodes and that we calibrate.
Calibration is a key step in the static quantization process. Its quality depends on the final accuracy (the inference speed will stay the same).
Moreover, a good PTQ is a good basis for a good Quantization Aware Training (QAT).
By calling with QATCalibrate(...) as qat:
, the lib will patch Transformer
model AST (source code) in RAM, basically adding quantization support to each model.
How to tune a PTQ?#
One of the things we try to guess during the calibration is what range of tensor values capture most of the information stored in the tensor.
Indeed, a FP32 tensor can store at the same time very large and very small values, we obviously can't do the same with a 8-bits integer tensors and a scale.
An 8-bits integer can only encode 255 values so we need to fix some limits and say, if a value is outside our limits, it just takes a maximum value instead of its real one.
For instance, if we say our range is -1000 to +1000 and a tensor contains the value +4000, it will be replaced by the maximum value, +1000.
As said before, we will use the histogram method to find the perfect range. We also need to choose a percentile.
Usually, you will choose something very close to 100.
Danger
If the percentile is too small, we put too many values outside the covered range.
Values outside the range will be replaced by a single maximum value and you lose some granularity in model weights.
If the percentile is too big, your range will be very large and because 8-bits signed integers can only encode values between -127 to +127, even when you use a scale you lose in granularity.
Therefore, we in our demo, we launched a grid search on percentile hyper parameter and retrived 1 accuracy point with very little effort.
One step further, the QAT#
If it's not enough, the last step is to just fine tune a second the model. It helps to retrieve a bit of accuracy.