How to Add a New Operator for TPU-MLIR

As we all know, a complete model is actually composed of a series of operators, so if we wanna make a compiler more general, supporting as many operators as possible is a must-do.

So that no matter if the op is from onnx, caffe, pytorch and whatnot, we can find a corresponding op in TPU-MLIR to express it.

So the first thing to add a new operator is the definition as we mentioned in the front-end conversion episode.

In MLIR, you can just use the TableGen tool to do the definition work instead of implementing all cpp template specialization by yourselves. In which the input, output, and attributes of each operator are specified.

In TPU-MLIR, operators of different dialects are defined in different td files and these operators will be registered under the corresponding dialect at compiler build time.

But the definition part just for generating the template.

In other words, our compiler still doesn’t know what this operator will do with the input tensor, so we have to finish this work by implementing the inference function under the corresponding directory.

In Top dialect, except inference interface, other interfaces that we are required to implement for each operator are FLOPs and shape interface. The former one is used for computing floating point operations and the latter is for inferencing the output shape in case the output is unranked.

Yeah, in MLIR, we got RankedTensorType and UnRankedTensorType.

The declaration of these interfaces is required in the td file, so all Operators derived from the Top_Op class need to declare these interfaces

Similarly, we also have to implement the inference interface for each Tpu operator. Since we can directly get FLOPs and Shape information from top (operators), these interfaces are unnecessary here.

Because Top and Tpu operators do the inference work on the CPU, sometimes we will hand over the work to oneDNN, which is a cross-platform neural network library mainly used to improve the inference performance on CPU. But I wont go further in this part, maybe we can make another video to introduce it if you guys are interested.

So let me know by leaving a comment below, ok?

We know, the TPU operators, in the end, will be used for codegen of different hardware, so for operators in this dialect, additional interfaces are required to be implemented for each hardware.

where the LocalGenInterface is for operators applying LayerGroup, and GlobalGenInterface is for the rest. So you will see all operators have to declare GlobalGenInterface but only some of them got LocalGen.

In the GlobalGen case, the tensor is in global memory, so what we need to do here is prepare all arguments required by the backend API, such as the operator’s attributes as well as the global address of the input and output tensor.

When it comes to LocalGen, the tensors is now in local memory, which means it has done the work to move the tensor from global to local memory, so we need to call the backend api for local.

Also, sometimes we need to compute buffer size for storing intermediate results when the tensor is quantized. That’s because intermediate results are usually stored with higher-byte data types. For example, in int8 quantization, we need to firstly store the calculation result as int16 or int32 data and then re-quantize it back to int8.

After finishing the definition and interface implementation work, there is still one thing that needs to be done. That’s lowering.

In the TopToTpu pass, we have to apply the pattern set for op conversion, which requires us to implement the conversion pattern for every single operator in each hardware.

There are 3 steps in total to be done, firstly, declare the lowering pattern in the header file. Next, implement the pattern. Then add it to the pattern set.

As shown in this example, the main thing we need to do in the implementation part is replacing the current Top op with a corresponding Tpu op and the type of this op should be set according to the specified quantization mode.

By this point, the work of adding a new operator is done.