Introduction to sensitive layer search of TPU-MLIR

Background#

The TPU-MLIR compiler converts machine learning models into bmodels that can be run on Sophgo chips. The calculation of the floating-point number consumes more computing resources and storage space than the fixed-point number. Thus, the quantized model (also known as fixed-point model) is often used in practical applications. Compared with floating-point model, the inference precision of quantization model will be lost to some extent. When the accuracy loss is large, it is necessary to search the layer that has a great influence on the accuracy, that is, the sensitive layer, change it back to the floating-point type, and generate a mixed precision model for inference.

Using the mobilenet-v2 network as an example, 50,000 images from the ILSVRC-2012 validation set were used to verify the accuracy of the floating-point model and the quantization model (denoted as FLOAT and INT8 in the table, respectively). This table shows that the Top1 accuracy of INT8 model decreased by 3.2% and Top5 accuracy decreased by 2%.

Type	Top1 (%)	Top5 (%)
FLOAT	70.72	89.81
INT8	67.53	87.84

Sensitive layer search#

The sensitive layer search function calculates the model’s output loss after each layer is converted from floating-point type to fixed-point type. The quantization threshold also affects the accuracy of the fixed-point model, thus three quantization methods, KL, MAX and Percentile, are used in this search. KL method first calculates the histogram (2048 bins) of the absolute value of FLOAT model tensor to obtain the reference probability distribution P. Then, INT8 type is used to simulate the expression of this histogram (128 bins) to obtain the quantized probability distribution Q. The KL divergences of P and Q are calculated at different interception positions. The interception position corresponding to the minimum divergence is the threshold value obtained by the KL method. The MAX method uses the maximum of the absolute value of the floating-point tensor as the quantization threshold. The Percentile method determines the threshold by the number ranked in the specified percentile.

Algorithm workflow#

The workflow of sensitive layer search algorithm is as follows.

Workflow of sensitive layer search

Step 1: Inferring the FLOAT and INT8 models with a certain number of pictures, such as 30 pictures, and then calculating the average cosine similarity of output results. If the average output cosine reaches the expected cosine, such as 0.99, the INT8 model is considered to be highly similar to the FLOAT model and the sensitive layer search is not required.
Step 2: Generating three kinds of calibration tables under KL, MAX and Percentile methods.
Step 3: Iterating each op with each threshold: 1) setting this op to INT8 and generating mixed model; 2) calculating the output cosine loss (1 minus cosine similarity); 3) updating the best threshold as that contributes the lowest loss; 4) setting this op to FLOAT.
Step 4: Ranking all op losses from largest to lowest, output loss information, and select the first five ops to generate a qtable.

During the search process, detailed loss information (including the op name, type, and loss) for each op and each quantization method, are recorded in a log file. Users can view this log and adjust the qtable file manually.

Usage#

Sensitive layer search uses the mlir file obtained from the model_transform step, the calibration table obtained from the run_calibration step and the dataset for inference as input. The following commands are used to do sensitive layer search.

run_sensitive_layer.py mobilenet.mlir \
    --dataset ../ILSVRC2012 \
    --input_num 100 \
    --inference_num 30 \
    --max_float_layers 5 \
    --expected_cos 0.99 \
    --post_process postprocess.py \
    --calibration_table mobilenet_cali_table \
    --processor bm1684 \
    -o mobilenet_qtable

The meanings of each parameter are shown in the following table.

Parameters	Meanings
dataset	Datasets for calibration and inference, bad cases are recommended
input_num	The number of pictures for calibration
inference_num	The number of pictures for inference
max_float_layers	The number of ops in qtable
expected_cos	Cosine similarity threshold of INT8 model and FLOAT model
post_process	The path of user defined post process file. The post process function must be named PostProcess.
calibration_table	The calibration table generated in the run_calibration step
processor	Chip type
o	Qtable name

After sensitive layer search, the following files are generated.

A qtable file to generate the mixed model. Each line in this file records one op and its type, such as “input3.1 F32”.
A newly generated calibration table named new_cali_table. Compared with the original calibration table, the threshold of each op is updated to the best one of three kinds of thresholds.
A log file named SensitveLayerSearch, which records the loss information of each op and each threshold during the search process.

Note that the qtable and new calibration table must be used in the model_deploy step to generate the mixed model.

Accuracy test result#

Taking the mobilenet-v2 network mentioned above as an example, 100 pictures in the ILSVRC2012 dataset were used for calibration and 30 pictures were used for inference. The sensitive layer search process took 402 seconds and occupied 800M memory. The output information is as follows.

the layer input3.1 is 0 sensitive layer, loss is 0.008808857469573828, type is top.Conv
the layer input11.1 is 1 sensitive layer, loss is 0.0016958347875666302, type is top.Conv
the layer input128.1 is 2 sensitive layer, loss is 0.0015641432811860367, type is top.Conv
the layer input130.1 is 3 sensitive layer, loss is 0.0014325751094084183, type is top.Scale
the layer input127.1 is 4 sensitive layer, loss is 0.0011817314259702227, type is top.Add
the layer input13.1 is 5 sensitive layer, loss is 0.001018420214596527, type is top.Scale
the layer 787 is 6 sensitive layer, loss is 0.0008603856180608993, type is top.Scale
the layer input2.1 is 7 sensitive layer, loss is 0.0007558935451825732, type is top.Scale
the layer input119.1 is 8 sensitive layer, loss is 0.000727441637624282, type is top.Add
the layer input0.1 is 9 sensitive layer, loss is 0.0007138056757098887, type is top.Conv
the layer input110.1 is 10 sensitive layer, loss is 0.000662179506136229, type is top.Conv
……
run result:
int8 outputs_cos:0.978847 old
mix model outputs_cos:0.989741
Output mix quantization table to mobilenet_qtable
total time:402.15848112106323

The loss of input3.1 is the largest and at least 5 times larger than those of other ops. Try to add only input3.1 to qtable, keep INT8 type unchanged for other layers, generate mixed precision model, and deduce on ILSVRC2012 verification dataset. The precision of the mixed model is as follows.

Type	Top1 (%)	Top5 (%)
FLOAT	70.72	89.81
INT8	67.53	87.84
MIX(oricali)	68.19	88.33
MIX(newcali)	69.07	88.73

MIX(oricali) represents the mixed model generated by the original calibration table. MIX(newcali) represents the mixed model generated by the new calibration table. It can be seen that tuning the threshold with three quantization methods also has a positive impact on the accuracy of the model. Compared with the INT8 model, the Top1 and Top5 precision of the mixed model increased about 1.5% and 1% respectively.

Comparison with mix_precision search#

Mix_precision search is another layer search method to generate qtable in TPU-MLIR. This method first find the layer whose layer cosine does not meet the requirement, then turn this layer and the next layer from INT8 to FLOAT, generate the mixed precision model, and calculate the cosine similarity between the outputs of the mixed model and the FLOAT model. The search will be stopped when the output cosine similarity reaches the expected value. Note that the mix_precision search will only set the op type from FLOAT back to INT8 if the op performs poor (generating lower output cosine than the INT8 model), otherwise it will not change the type of the op. Therefore, the mix_precision search does not need to be inferred from the input of the network, and the search speed is very fast. The comparison between the two methods is as follows.

Comparison	Sensitive Layer Search	Mix_precision Search
Algorithms	Iterating all ops to find sensitive layers	Only consider ops with low layer cosine, stop search when output cosine meet requirement
Consider multiple thresholds	support	Only KL threshold
Modify cali_table	support	Use the original cali_table
Consider two adjacent layers	Only one layer	support
Consider layer cosine	Not support	support
Consider output cosine	support	support
Qtable needs manual modification	support	Recommend using the qtable generated by the program
Support user defined post process	support	Not support
Iterate all ops	support	Stop search when output cosine meet the expected value
Op type conversion	From FLOAT to INT8, calculate loss	From INT8 to FLOAT, calculate cosine

The mix_precision search starts from the layer after input, and continuously add layers which improved network output similarity to qtable. The search is terminated when the output cosine reaches the expected value. Thus, this method may miss sensitive layers near the output part of the network. Whereas, the sensitive layer search travels all ops and will not lead this omission, which is also an advantage of sensitive layer search. In practical applications, users can try the mix_precision search first. If the expected output cosine is not achieved, then use the sensitive layer search function for global traversal.

Summary#

The sensitive layer search aims to find layers that have great influence on the accuracy of the quantized model. It iterates all ops and three quantization methods, selects the best threshold for each op and records all loss information. The searched layers are set as FLOAT type and the rest layers are set as INT8 type. Then, generating a mixed model to improve the accuracy. Currently, the sensitive layer search function performs well in the mobilenet-v2 network. Only the layer with the greatest loss needs to be set as FLOAT, and a 1.5% accuracy improvement can be achieved. Compared with the mix_precision search method, the sensitive layer search takes longer time, but it considers all ops and three quantization methods, and will not miss the sensitive layer near the output part of the network. Future improvements of the sensitive layer search function can focus on three aspects: 1) Following the mix_precision search, consider the influence of each layer and its adjacent layer on the network output. 2) Automatically select ops that can make the network output similarity reach the expected value to generate qtable instead of using the user defined number of layers. 3) Consider parallelism in the process of traversing all ops to shorten the search time.