This chapter introduces the quantization design of TPU-MLIR, focusing on the application of the paper in practical quantization.
Basic Concepts
INT8 quantization is divided into symmetric and asymmetric quantization. Symmetric quantization is a special case of asymmetric quantization, and usually, the performance of the former will be better than the latter, while the accuracy is in contrast.
Asymmetric Quantization
As shown in the figure (Asymmetric quantization), asymmetric quantization is actually the fixed-pointing of values in the range [min,max] to the interval [-128, 127] or [0, 255].
The quantization formula from int8 to float is:
\[\begin{split}r &= S(q-Z) \\
S &= \frac{max-min}{qmax-qmin} \\
Z &= Round(- \frac{min}{S} + qmin)\end{split}\]
where r is the real value of type float and q is the quantized value of type INT8 or UINT8.
S denotes scale, which is float; Z is zeropoint, which is of type INT8.
When quantized to INT8, qmax=127,qmin=-128, and for UINT8, qmax=255,qmin=0.
The quantization formula from float to INT8 is:
\[\begin{split}q = \frac{r}{S} + Z\end{split}\]
Symmetric Quantization
Symmetric quantization is a special case of asymmetric quantization when Z=0. The formula is:
\[\begin{split}&y = x \times 0.1234 \\
&=> y = x \times 0.9872 \times 2^{-3} \\
&=> y = x \times (0.9872 \times 2^{31}) \times 2^{-34} \\
&=> y = x \times \frac{2119995857}{1 \ll 34} \\
&=> y = (x \times 2119995857) \gg 34\end{split}\]
The higher the number of bits supported by Multiplier, the closer to Scale it will be, but that leads to worse performance. Therefore, generally, the chip will use a 32-bit or 8-bit Multiplier.
Quantization derivation
We can use quantization formulas and derive quantization for different OPs to get their corresponding INT8 calculations.
Both symmetric and asymmetric are used for Activation, and for weights generally only symmetric quantization is used.
Convolution
The abbreviation of Convolution: \(Y = X_{(n,ic,ih,iw)}\times W_{(oc,ic,kh,kw)} + B_{(1,oc,1,1)}\).
Substitute it into the int8 quantization formula, the derivation is as follows:
In particular, for asymmetric quantization, Pad is filled with Zx.
In the symmetric case, Pad is filled with 0 (both Zx and Zy are 0).
In PerAxis (or PerChannal) quantization, each OC of Filter will be quantized, and the derivation formula will remain unchanged, but there will be OC Multiplier and rshift.
InnerProduct
Expression and derivation are the same as (Convolution).
Add
The expression for addition is: \(Y = A + B\)
Substitute it into the int8 quantization formula, the derivation is as follows:
In the symmetric case, both Zx and Zy are 0, so the padded value is round(value/Sy). When asymmetric quantization, the padded value is round(value/Sy + Zy)。
PReLU
The expression of PReLU can be abbreviated as: \(Y_i = \begin{cases} X_i, if \ X_i \geq 0\\ \alpha_i X_i, if \ X_i < 0 \end{cases}\)
Substitute it into the int8 quantization formula, the derivation is as follows: