18. Implementing Backend Operators with PPL

PPL (Programming Language for TPUs) is a domain-specific programming language (DSL) based on C/C++ syntax extensions, designed for programming Tensor Processing Units (TPUs). This chapter demonstrates how to implement backend operators in PPL using the add_const_fp operator as an example, illustrating the compilation and utilization of PPL code within TPU-MLIR.

The implementation of PPL backend operators can be found in the tpu-mlir/lib/PplBackend/src directory. For release packages, it will be located in the PplBackend/src directory of the TPU-MLIR release package. For detailed instructions on writing PPL source code, refer to the documentation in tpu-mlir/third_party/ppl/doc.

18.1. How to Write and Call Backend Operators

Step 1: Implement Three Source Files

You need to create three source files: one for the device-side pl code, one for the host-side cpp code, and another for the host-side tiling function cpp code. . For the add_const_fp example, these files are:

  • add_const_fp.pl: Implements the add_const_f32 , add_const_f16 and add_const_bf16, etc kernel interfaces.

  • add_const_fp_tile.cpp: Implements the add_tiling function to call these kernel interfaces.

  • add_const_fp_api.cpp: Implements the api_add_const_fp_global function to call these add_tiling interfaces.

tiling.cpp File Example

// Include the automatically generated header file from the pl file
#include "add_const_fp.h"
// Include the header file for MLIR data types and structures
#include "tpu_mlir/Backend/BM168x/Param.h"

// The entry function must be defined using extern "C"
extern "C" {
// If the pl file provides multiple operators, you can define function pointers in advance.
// This can help reduce repetitive code. Note that the pointer type in the pl file
// needs to be defined using `gaddr_t`.
using KernelFunc = int (*)(gaddr_t, gaddr_t, float, int, int, int, int, int, bool);

// Add the entry function with user-defined input parameters
int add_tiling(gaddr_t ptr_dst, gaddr_t ptr_src, float rhs, int N, int C, int H,
            int W, bool relu, int dtype) {
KernelFunc func;
// Select the appropriate operator based on the input data type
if (dtype == DTYPE_FP32) {
    func = add_const_f32;
} else if (dtype == DTYPE_FP16) {
    func = add_const_f16;
} else if (dtype == DTYPE_BFP16) {
    func = add_const_bf16;
} else {
    assert(0 && "unsupported dtype");
}

// Calculate the block size. Align the block size to `EU_NUM` to reduce memory allocation failures.
// Since most of the memory on the TPU is aligned to `EU_NUM`, this alignment will not affect memory allocation.
int block_w = align_up(N * C * H * W, EU_NUM);
int ret = -1;
while (block_w > 1) {
    ret = func(ptr_dst, ptr_src, rhs, N, C, H, W, block_w, relu);
    if (ret == 0) {
    return 0;
    } else if (ret == PplLocalAddrAssignErr) {
    // If the error type is `PplLocalAddrAssignErr`, it means the block size is too large,
    // and the local memory cannot accommodate it. The block size needs to be reduced.
    block_w = align_up(block_w / 2, EU_NUM);
    continue;
    } else if (ret == PplL2AddrAssignErr) {
    // If the error type is `PplL2AddrAssignErr`, it means the block size is too large,
    // and the L2 memory cannot accommodate it. The block size needs to be reduced.
    // In this example, L2 memory is not allocated, so this error will not occur.
    assert(0);
    } else {
    // Other errors require debugging
    assert(0);
    return ret;
    }
}
return ret;
}
}

Notes

  • The add_const_fp.h header file contains some error codes and chip-related parameter definitions.

  • The pointers in the pl file need to be defined using the gaddr_t type.

Table 18.1 Built-in Error Codes

Parameter Name

Description

PplLocalAddrAssignErr

Local memory allocation failed

FileErr

LlvmFeErr

PplFeErr

AST to IR conversion failed

PplOpt1Err

Optimization pass opt1 failed

PplOpt2Err

Optimization pass opt2 failed

PplFinalErr

Optimization pass final failed

PplTransErr

Code generation failed

EnvErr

Environment variable exception

PplL2AddrAssignErr

L2 memory allocation failed

PplShapeInferErr

Shape inference failed

PplSetMemRefShapeErr

ToPplErr

PplTensorConvErr

PplDynBlockErr

Table 18.2 Built-in Chip Parameters

Parameter Name

Description

EU_NUM

Number of EUs

LANE_NUM

Number of lanes

Step 2: Call the Kernel Interface

In the function void tpu::AddConstOp::codegen_global_bm1684x() within lib/Dialect/Tpu/Interfaces/BM1684X/AddConst.cpp, call api_add_const_fp_global as follows:

BM168x::call_ppl_global_func("api_add_const_fp_global", &param,
                             sizeof(param), input_spec->data(),
                             output_spec->data());

If the operator supports local execution, implement api_xxxxOp_local and call it using BM168x::call_ppl_local_func.

BM168x::call_ppl_local_func("api_xxxx_local", &spec, sizeof(spec),
                            &sec_info, input_spec->data(),
                            output_spec->data());

This completes the implementation of the backend operator.

18.2. PPL Workflow in TPU-MLIR

  1. Place the PPL compiler in the third_party/ppl directory and update it by referring to the README.md file in this directory.

  2. Integrate the PPL source code compilation in model_deploy.py. The process is illustrated in the following diagram:

_images/ppl_flow.png

Fig. 18.1 PPL Workflow