TPU-MLIR is an open-source machine-learning compiler that converts
pre-trained networks — from PyTorch, ONNX, TFLite, Caffe and HuggingFace
LLMs — into .bmodel files running efficiently on Sophgo TPUs.
A two-dialect MLIR pipeline keeps the path from algorithm to silicon clean,
retargetable and easy to extend.
A pragmatic, scriptable flow built around three small Python entry points.
Import the source network into the framework-agnostic Top dialect.
$ model_transform.py \ --model_name yolov5s \ --model_def yolov5s.onnx \ --input_shapes [[1,3,640,640]] \ --mlir yolov5s.mlir
Optional: build an INT8 calibration table from a small dataset slice.
$ run_calibration.py yolov5s.mlir \ --dataset ../COCO2017 \ --input_num 100 \ -o yolov5s_cali_table
Lower to the TPU dialect and emit a .bmodel for the target chip.
$ model_deploy.py \ --mlir yolov5s.mlir \ --quantize INT8 \ --processor bm1684x \ --model yolov5s_int8.bmodel
Multiple frameworks in, multiple precisions out — on a clean MLIR foundation that's easy to extend.
PyTorch, ONNX, TFLite and Caffe out of the box. Other frameworks can be exported via ONNX.
First-class support for HuggingFace LLMs — including AWQ, GPTQ and AutoRound variants — through a single llm_convert.py.
F32 / F16 / BF16 / INT8 / W4A16 / W8A16 — including post-training calibration for INT8.
Targets Sophgo TPU processors such as bm1684x and bm1688 from the same MLIR pipeline.
Aggressive graph and tensor-level optimizations — Layer Group, fused preprocess and postprocess — keep TPUs busy and memory traffic low.
Vision (YOLO family), speech (Whisper) and large language models (Qwen3.5, Qwen3-VL, MiniCPM-V-4, …) ship in the regression set.
Pick a frontend, choose a precision, deploy to the chip you ship on.
A non-exhaustive snapshot of the families that ship through TPU-MLIR's regression and demo flow.
llm_convert.py and chatting on a Sophgo BM1684X.Both flows assume the official sophgo/tpuc_dev docker image and pip install tpu_mlir.
# 1. Pull a quantized build from HuggingFace $ git lfs install $ git clone https://huggingface.co/Intel/Qwen3.5-2B-int4-AutoRound # 2. Convert to a bmodel for the BM1684X $ llm_convert.py \ -m /workspace/Qwen3.5-2B-int4-AutoRound \ --max_input_length 1024 -s 2048 \ -c bm1684x --max_pixels 768,768 \ -o qwen3.5_2b # → qwen3.5_2b/*.bmodel ✔ ready for TPU
# 1) Frontend → Top dialect MLIR $ model_transform.py \ --model_name yolov5s \ --model_def ../yolov5s.onnx \ --input_shapes [[1,3,640,640]] \ --mlir yolov5s.mlir # 2) Calibrate for INT8 $ run_calibration.py yolov5s.mlir \ --dataset ../COCO2017 --input_num 100 \ -o yolov5s_cali_table # 3) Lower to a bmodel $ model_deploy.py \ --mlir yolov5s.mlir --quantize INT8 \ --calibration_table yolov5s_cali_table \ --processor bm1684x \ --model yolov5s_int8.bmodel
Everything you need to go deeper, in English and 中文.