4. Overall Design

4.1. Layered

TPU-MLIR treats the compilation process of the network model in two layers.

Top Dialect: Hardware-independent layer, including graph optimization, quantization and inference, etc.
Tpu Dialect: Hardware-related layer, including weight reordering, operator slicing, address assignment, inference, etc.

The overall flow is shown in the (TPU-MLIR overall process) diagram, where the model is gradually converted into final instructions by Passes. Here is a detailed description of what functions each Pass does in the Top layer and the Tpu layer. The following chapters will explain the key points of each Pass in detail.

_images/flow.png — Fig. 4.1 TPU-MLIR overall process

4.2. Top Pass

shape-infer: Do shape inference, and constant folder
canonicalize: Graph optimization related to specific OP, such as merging relu into conv, shape merge, etc.
extra-optimize: Do extra patterns, such as get FLOPs, remove unuse output, etc.
processor-assign: Assign processor, such as BM1684X, CV183X, etc; and adjust top mlir by processor, for example, make all CV18XX input types as F32.
import-calibration-table: Import calibration table, assign min and max for all ops, for quantization later.
processor-top-optimize: Do top ops optimization by processor.
convert-top-to-tpu: Lower top ops to tpu ops; if for mode F32/F16/BF16, top op normally convert to tpu op directly; if INT8, quantization is needed.

4.3. Tpu Pass

canonicalize: Graph optimization related to specific OP, such as merging of consecutive Requants, etc.
strip-io-quant: Input and output types will be quantized if true; or be F32
processor-tpu-optimize: Do tpu ops optimization by processor.
weight-reorder: Reorder the weights of individual OP based on processor characteristics, such as filter and bias for convolution.
subnet-divide: Divide the network into various subnets based on the processor type. If the Tensor Competing Processor can compute all operators, then it forms a single subnet.
op-reorder: Reorder op to make sure ops are close to their users.
layer-group: Slice the network so that as many OPs as possible are computed consecutively in the local mem.
address-assign: Assign addresses to the OPs that need global mem.
codegen: Use Builder module to generate the final model in flatbuffers format.

4.4. Other Passes

There are some optional passes, not in the diagram, used for special functions.

fuse-preprocess: Fuse image preprocess to model.
add-postprocess: add postprocess to model, only support ssd/yolov3/yolov5.