Implementing Backend Operators with PPL ========================================= PPL (Programming Language for TPUs) is a domain-specific programming language (DSL) based on C/C++ syntax extensions, designed for programming Tensor Processing Units (TPUs). This chapter demonstrates how to implement backend operators in PPL using the ``add_const_fp`` operator as an example, illustrating the compilation and utilization of PPL code within TPU-MLIR. The implementation of PPL backend operators can be found in the ``tpu-mlir/lib/PplBackend/src`` directory. For release packages, it will be located in the ``PplBackend/src`` directory of the TPU-MLIR release package. For detailed instructions on writing PPL source code, refer to the documentation in ``tpu-mlir/third_party/ppl/doc``. How to Write and Call Backend Operators ----------------------------------------- Step 1: Implement Three Source Files You need to create three source files: one for the device-side ``pl`` code, one for the host-side ``cpp`` code, and another for the host-side tiling function ``cpp`` code. . For the ``add_const_fp`` example, these files are: - ``add_const_fp.pl``: Implements the ``add_const_f32`` , ``add_const_f16`` and ``add_const_bf16``, etc kernel interfaces. - ``add_const_fp_tile.cpp``: Implements the ``add_tiling`` function to call these kernel interfaces. - ``add_const_fp_api.cpp``: Implements the ``api_add_const_fp_global`` function to call these ``add_tiling`` interfaces. **tiling.cpp File Example** .. code-block:: cpp // Include the automatically generated header file from the pl file #include "add_const_fp.h" // Include the header file for MLIR data types and structures #include "tpu_mlir/Backend/BM168x/Param.h" // The entry function must be defined using extern "C" extern "C" { // If the pl file provides multiple operators, you can define function pointers in advance. // This can help reduce repetitive code. Note that the pointer type in the pl file // needs to be defined using `gaddr_t`. using KernelFunc = int (*)(gaddr_t, gaddr_t, float, int, int, int, int, int, bool); // Add the entry function with user-defined input parameters int add_tiling(gaddr_t ptr_dst, gaddr_t ptr_src, float rhs, int N, int C, int H, int W, bool relu, int dtype) { KernelFunc func; // Select the appropriate operator based on the input data type if (dtype == DTYPE_FP32) { func = add_const_f32; } else if (dtype == DTYPE_FP16) { func = add_const_f16; } else if (dtype == DTYPE_BFP16) { func = add_const_bf16; } else { assert(0 && "unsupported dtype"); } // Calculate the block size. Align the block size to `EU_NUM` to reduce memory allocation failures. // Since most of the memory on the TPU is aligned to `EU_NUM`, this alignment will not affect memory allocation. int block_w = align_up(N * C * H * W, EU_NUM); int ret = -1; while (block_w > 1) { ret = func(ptr_dst, ptr_src, rhs, N, C, H, W, block_w, relu); if (ret == 0) { return 0; } else if (ret == PplLocalAddrAssignErr) { // If the error type is `PplLocalAddrAssignErr`, it means the block size is too large, // and the local memory cannot accommodate it. The block size needs to be reduced. block_w = align_up(block_w / 2, EU_NUM); continue; } else if (ret == PplL2AddrAssignErr) { // If the error type is `PplL2AddrAssignErr`, it means the block size is too large, // and the L2 memory cannot accommodate it. The block size needs to be reduced. // In this example, L2 memory is not allocated, so this error will not occur. assert(0); } else { // Other errors require debugging assert(0); return ret; } } return ret; } } **Notes** - The add_const_fp.h header file contains some error codes and chip-related parameter definitions. - The pointers in the pl file need to be defined using the gaddr_t type. .. list-table:: Built-in Error Codes :widths: 30 30 :header-rows: 1 * - Parameter Name - Description * - PplLocalAddrAssignErr - Local memory allocation failed * - FileErr - * - LlvmFeErr - * - PplFeErr - AST to IR conversion failed * - PplOpt1Err - Optimization pass opt1 failed * - PplOpt2Err - Optimization pass opt2 failed * - PplFinalErr - Optimization pass final failed * - PplTransErr - Code generation failed * - EnvErr - Environment variable exception * - PplL2AddrAssignErr - L2 memory allocation failed * - PplShapeInferErr - Shape inference failed * - PplSetMemRefShapeErr - * - ToPplErr - * - PplTensorConvErr - * - PplDynBlockErr - .. list-table:: Built-in Chip Parameters :widths: 30 30 :header-rows: 1 * - Parameter Name - Description * - EU_NUM - Number of EUs * - LANE_NUM - Number of lanes Step 2: Call the Kernel Interface In the function ``void tpu::AddConstOp::codegen_global_bm1684x()`` within ``lib/Dialect/Tpu/Interfaces/BM1684X/AddConst.cpp``, call ``api_add_const_fp_global`` as follows: .. code-block:: cpp BM168x::call_ppl_global_func("api_add_const_fp_global", ¶m, sizeof(param), input_spec->data(), output_spec->data()); If the operator supports local execution, implement ``api_xxxxOp_local`` and call it using ``BM168x::call_ppl_local_func``. .. code-block:: cpp BM168x::call_ppl_local_func("api_xxxx_local", &spec, sizeof(spec), &sec_info, input_spec->data(), output_spec->data()); This completes the implementation of the backend operator. PPL Workflow in TPU-MLIR ------------------------- 1. Place the PPL compiler in the ``third_party/ppl`` directory and update it by referring to the README.md file in this directory. 2. Integrate the PPL source code compilation in ``model_deploy.py``. The process is illustrated in the following diagram: .. _ppl_flow: .. figure:: ../assets/ppl_flow.png :height: 9.5cm :align: center PPL Workflow