10. Compile LLM Model
10.1. Overview
llm_convert.py
is a tool for converting large language models (LLM) into the bmodel format. It converts the original model weights to the bmodel format, enabling efficient inference on chip platforms such as BM1684X, BM1688, and CV186AH.
Currently supported LLM types include qwen2 and llama, for example, Qwen2-7B-Instruct, Llama-2-7b-chat-hf, etc.
10.2. Command-Line Arguments
Below is an explanation of the command-line arguments supported by this tool:
-m
,--model_path
(string, required) Specify the path to the original model weights. For example:./Qwen2-7B-Instruct
-s
,--seq_length
(integer, required) Specify the sequence length to be used during the conversion.-q
,--quantize
(string, required) Specify the quantization type for the bmodel. You must choose from the following options:bf16
w8bf16
w4bf16
f16
w8f16
w4f16
-g
,--q_group_size
(integer, default: 64) When using the W4A16 quantization mode, this sets the group size for quantization.-c
,--chip
(string, default:bm1684x
) Specify the chip platform for generating the bmodel. Supported options are:bm1684x
bm1688
cv186ah
--num_device
(integer, default: 1) Specify the number of devices for bmodel deployment.--num_core
(integer, default: 0) Specify the number of cores to be used for bmodel deployment, where 0 indicates using the maximum number of cores.--symmetric
Set this flag to use symmetric quantization.--embedding_disk
Set this flag to export the word_embedding as a binary file and run inference on the CPU.--max_pixels
(integer) For multimodal models such as qwen2.5vl, it is used to specify the maximum pixel dimensions. For example, you can specify 672,896, indicating an image of 672x896; or it can be 602112, representing the maximum number of pixels.-o
,--out_dir
(string, default:./tmp
) Specify the output directory for the generated bmodel files.
10.3. Example Usage
Assume you need to convert a large model located at /workspace/Qwen2-7B-Instruct
into a bmodel for the bm1684x
platform, using a sequence length of 384
and the w4bf16
quantization type. Additionally, set the group size to 128
and store the output files in the directory qwen2_7b
. You can execute the following command:
First, download Qwen2-7B-Instruct locally from Hugging Face, then run:
llm_convert.py -m /workspace/Qwen2-7B-Instruct -s 384 -q w4bf16 -g 128 -c bm1684x -o qwen2_7b
Note: If you encounter an error indicating that transformers is not found, you will need to install it. The command is as follows (apply a similar approach for other pip packages):
pip3 install transformers
Also supports AWQ and GPTQ models, such as Qwen2.5-0.5B-Instruct-AWQ and Qwen2.5-0.5B-Instruct-GPTQ-Int4. The conversion command is as follows:
llm_convert.py -m /workspace/Qwen2.5-0.5B-Instruct-AWQ -s 384 -q w4bf16 -c bm1684x -o qwen2.5_0.5b
llm_convert.py -m /workspace/Qwen2.5-0.5B-Instruct-GPTQ-Int4 -s 384 -q w4bf16 -c bm1684x -o qwen2.5_0.5b