Overall Design

Layered

TPU-MLIR treats the compilation process of the network model in two layers.

Top Dialect

Chip-independent layer, including graph optimization, quantization and inference, etc.

Tpu Dialect

Chip-related layer, including weight reordering, operator slicing, address assignment, inference, etc.

The overall flow is shown in the (TPU-MLIR overall process) diagram, where the model is gradually converted into final instructions by Passes. Here is a detailed description of what functions each Pass does in the Top layer and the Tpu layer. The following chapters will explain the key points of each Pass in detail.

Top Pass

Canonicalize

Graph optimization related to specific OP, such as merging relu into conv, shape merge, etc.

Calibration

According to the calibration table, insert min and max for each OP for subsequent quantization. Insert threshold for symmetric quantization.

Lowering

Lower the OP to the tpu layer according to the quantization type. Supported types are F32/F16/BF16/INT8 symmetric/INT8 asymmetric.

Tpu Pass

Canonicalize

Graph optimization related to specific OP, such as merging of consecutive Requants, etc.

WeightReorder

Reorder the weights of individual OP based on chip characteristics, such as filter and bias for convolution.

Subnet

Split the network into different subnets according to TPU/CPU, if all operators are TPU, there is only one subnet.

LayerGroup

Slice the network so that as many OPs as possible are computed consecutively in the local mem.

MemAssign

Assign addresses to the OPs that need global mem.

CodeGen

Use Builder module to generate the final model in flatbuffers format.