TPU-MLIR

We designed a general-purpose, MLIR-based TPU compiler with a two-layer dialect hierarchy—Top Dialect and TPU Dialect—to bridge high-level algorithms and low-level processors. At the algorithm level, it provides native support for deep-learning frameworks such as PyTorch and ONNX and enables conversion of HuggingFace LLMs; at the hardware level, it adapts to multiple processor targets.

Paper | Code | QuickStart | DevManual

English | 简体中文

Features

Supports F32, F16, BF16, INT8, W4A16 and W8A16 data types
Supports calibration-based quantization of models
Compatible with major deep learning frameworks such as PyTorch, ONNX, Caffe, etc.
Supports HuggingFace LLMs, including floating-point models as well as AWQ and GPTQ quantized variants
Compatible with 120+ mainstream models, including:

1) Vision models (e.g. YOLO series)

2) Speech models (e.g. Whisper)

3) Large language models (e.g. Qwen3, Qwen2.5VL, Llama2, MiniCPM-V-2_6, ChatGLM3, etc.)

Yolov5s Demo

Effect Comparison Chart for Different Data Types

Qwen2.5-VL-3B Demo

Performance of Qwen2.5-VL-3B-AWQ INT4 on the Sophgo BM1684X