10. Compile LLM Model
10.1. Overview
llm_convert.py converts the original weights of a Large Language Model
(LLM) — including multimodal LLMs — into a .bmodel for efficient
inference on chips (bm1684x, bm1688, cv186x).
The tool takes care of:
loading the model config with
transformers.AutoConfigand dispatching to the appropriate converter (LLM / VLM / ASR / OMNI);exporting MLIR for each transformer block, the embedding, the LM head, and (for VLMs) the vision tower;
compiling and combining everything into a single
.bmodel.
Supported model families include qwen2, qwen3, qwen2_vl,
qwen2_5_vl, qwen3_vl, qwen3_5, llama, minicpm,
minicpmv, internvl_chat, chatglm, phi3, glm4v,
gemma3, mllama, paddleocr_vl, lfm2_vl, janus,
qwen2_5_omni, qwen3_asr.
AWQ / GPTQ / AutoRound pre-quantized weights are accepted directly.
10.2. Prerequisites
Install Python dependencies once after installing the tpu_mlir wheel
or after cloning the source tree:
pip3 install -r requirements.txt
If AutoConfig.from_pretrained fails for a recent model, upgrade
transformers and torchvision:
pip3 install -U transformers torchvision
10.3. Command-Line Arguments
10.3.1. Required
-m,--model_path(string) Path to the original model directory, e.g../Qwen3.5-2B-int4-AutoRound.-s,--seq_length(int) Maximum sequence length supported by the compiled bmodel.
10.3.2. Quantization
-q,--quantize(default: ``auto``) One ofauto,bf16,f16,w8bf16,w4bf16,w8f16,w4f16.autohonors a pre-quantized config when present (AWQ / GPTQ / AutoRound), otherwise falls back tobf16.-g,--q_group_size(int, default: ``64``) Group size for per-group W4 quantization.--symmetricUse symmetric quantization.
10.3.3. Target hardware
-c,--chip,--processor(default: ``bm1684x``) Target chip:bm1684x,bm1688,cv186x.--num_device(int, default: ``1``) Number of devices the bmodel will run on.--distribute_strategy(``tp`` | ``pp``, default: ``tp``) Multi-device strategy. Only takes effect when--num_device > 1.--num_core(int, default: ``0``) Number of cores.0means use the chip’s maximum.
10.3.4. Sequence and prefill
-b,--batch(int, default: ``1``)--max_input_length(int, default: ``0`` ⇒ same as ``seq_length``)--max_prefill_kv_length(int, default: ``0`` ⇒ same as ``seq_length``)--use_block_with_kvReuse history KV during prefill. Required for--share_prompt.--share_promptShare the same prompt prefix across multiple dialogs.
10.3.5. Vision (multimodal models only)
--max_pixels(``W,H``) Maximum image size for the vision tower, given aswidth,height(e.g.672,896). When omitted, a default is picked by model type:model_typedefault
qwen2_5_vl672,896minicpmv980,980others
768,768W*Hmust be a multiple of28*28for Qwen2-VL / Qwen2.5-VL / GLM4V / MiniCPM-V, and a multiple of32*32for Qwen3-VL / Qwen3.5. The tool validates this up front.
10.3.6. LoRA / sampling / misc
--lora_max_rank(int, default: ``0``) Reserve LoRA capacity.0disables LoRA support.--do_sampleAdd a sampling head separate from the LM head (greedy + topk/topp).--embedding_diskExport the word-embedding as a.binfile and run it on the CPU (saves on-chip memory).--dynamicEnable dynamic shape compilation for prefill.qwen3_5always enables this.
10.3.7. Workflow control
-o,--out_dir(string) Output directory. If omitted, defaults to./<model_name>_<chip>_<quantize>and is printed to stdout.--only_mlirStop after MLIR export; do not compile to bmodel.--againResume an interrupted run by skipping stages whose outputs already exist in--out_dir.--debugKeep intermediate temp files for debugging.--dry_runResolve the configuration (defaults, per-model overrides, divisibility checks, chosen converter class) and print it without compiling. Recommended before launching a long conversion.-V,--versionPrint the underlyingpymlirversion.
10.4. Examples
Plain LLM (Qwen2-7B-Instruct, w4bf16):
llm_convert.py -m /workspace/Qwen2-7B-Instruct \
-s 384 -q w4bf16 -g 128 -c bm1684x \
-o qwen2_7b
Pre-quantized weights (AWQ / GPTQ work the same way):
llm_convert.py -m /workspace/Qwen2.5-0.5B-Instruct-AWQ -s 384 -c bm1684x -o qwen2.5_0.5b
llm_convert.py -m /workspace/Qwen2.5-0.5B-Instruct-GPTQ-Int4 -s 384 -c bm1684x -o qwen2.5_0.5b
Qwen3.5 multimodal (AutoRound int4):
llm_convert.py -m /workspace/Qwen3.5-2B-int4-AutoRound \
--max_input_length 1024 -s 2048 -c bm1684x \
--max_pixels 768,768 -o qwen3.5_2b
Sanity-check first with --dry_run:
llm_convert.py -m /workspace/Qwen3.5-2B-int4-AutoRound \
-s 2048 -c bm1684x --dry_run
The dry-run prints the resolved chip, quantization, sequence lengths,
max_shape / max_pixels, and the converter class that will be
used — useful for catching typos before a multi-hour compile.