21. LLMC Guidance

21.1. TPU-MLIR weight-only quantization

TPU-MLIR supports weight-only quantization for large models, utilizing the RTN (round to nearest) quantization algorithm with a quantization granularity of per-channel or per-group. The specific quantization configurations are as follows:

Table 21.1 weight-only quantization parameters

bit

symmetric

granularity

group_size

4

False

per-channel or per-group

-1 or 64(default)

8

True

per-channel

-1

The RTN quantization algorithm is straightforward and efficient, but it also has some limitations. In scenarios that require higher model accuracy, models quantized using the RTN algorithm may not meet the precision requirements. In such cases, it is necessary to utilize the large model quantization tool llmc_tpu to further enhance accuracy.

21.2. llmc_tpu

This project originates from ModelTC/llmc. ModelTC/llmc is an excellent project specifically designed for compressing Large Language Models (LLMs). It leverages state-of-the-art compression algorithms to enhance efficiency and reduce model size without compromising prediction accuracy. If you want to learn more about the llmc project, please visit https://github.com/ModelTC/llmc.

This project is based on ModelTC/llmc with some customized modifications to support the Sophgo processor.

21.2.1. Environment Setup

  1. Download This Project

1git clone git@github.com:sophgo/llmc-tpu.git
  1. Prepare the LLM or VLM Model for Quantization,Place the model you need to quantize in the same-level directory as llmc-tpu

For Example: Download Qwen2-VL-2B-Instruct from Huggingface

1git lfs install
2git clone git@hf.co:Qwen/Qwen2-VL-2B-Instruct
  1. Download Docker and Set Up a Docker Container

pull docker images

1docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-latest

create container. llmc_test is just a name, and you can set your own name

1docker run --privileged --name llmc_test -it --shm-size 64G --gpus all -v $PWD:/workspace  registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-latest
  1. Enter llmc-tpu Directory and Install Dependencies

Note that you are already in a Docker container.

1cd /workspace/llmc-tpu
2pip3 install -r requirements.txt

21.2.2. tpu Directory

├── README.md ├── data │ ├──LLM │ ├──cali #Calibration Dataset │ ├──eval #Eval Dataset │ ├──VLM │ ├──cali │ ├──eval ├── config │ ├──LLM #LLM quant config │ ├── Awq.yml #Awq config │ ├── GPTQ.yml #GPTQ config │ ├──VLM #VLM quant config │ ├── Awq.yml #Awq config ├── example.yml #Quantization Parameters Reference Example ├── llm_quant.py #Quantization Main Program ├── run_llmc.sh #Quantization Run Script

21.2.3. Operating Steps

21.2.3.1. [Phase 1] Prepare Calibration and Eval Datasets

  • Note 1: Calibration Dataset can be an open-source dataset or a business dataset. If the model has been fine-tuned on downstream business datasets, then a business dataset needs to be selected for calibration.

  • Note 2: Eval Dataset is primarily used to evaluate the accuracy performance of the current model, including the accuracy of pre-trained (pretrain) models or quantized (fake_quant) models.

You can choose to use an open-source dataset or a business dataset.

21.2.3.1.1. open-source dataset

If a business dataset is available, it is preferable. If not, you can use an open-source dataset as follows:

Table 21.2 Dataset Selection

Model Type

Quantization Algorithm

Calibration Dataset (Open-source)

Eval Dataset (Open-source)

LLM

Awq

pileval

wikitext2

LLM

GPTQ

wikitext2

wikitext2

VLM

Awq

MME

MME

The selection of the calibration dataset depends on the model type and quantization algorithm. For example, if the model being quantized is an LLM and uses the Awq algorithm, it is typically recommended to use the Pileval dataset as the calibration set. For these open-source datasets, this document provides the corresponding download commands, which can be executed to download the respective datasets. The specific steps are as follows: open the llmc-tpu/tools directory, where you will find two Python scripts, download_calib_dataset.py and download_eval_dataset.py, which are used to download the calibration and eval datasets, respectively.

If it is a VLM model, it is recommended to use the Awq algorithm. The command to download the dataset is as follows:

1cd /workspace/llmc-tpu
  • Calibration Dataset

1python3 tools/download_calib_dataset.py --dataset_name MME --save_path tpu/data/VLM/cali
  • Eval Dataset

1python3 tools/download_eval_dataset.py --dataset_name MME --save_path tpu/data/VLM/eval

If it is an LLM model, it is recommended to use the Awq algorithm. The command to download the dataset is as follows:

1cd /workspace/llmc-tpu
  • Calibration Dataset

1python3 tools/download_calib_dataset.py --dataset_name pileval --save_path tpu/data/LLM/cali
  • Eval Dataset

1python3 tools/download_eval_dataset.py --dataset_name wikitext2 --save_path tpu/data/LLM/eval
21.2.3.1.2. business dataset
  1. business calibration dataset

If the model has been fine-tuned on downstream business datasets, it is generally recommended to select the business dataset when choosing the calibration set. * If it is an LLM, simply place the business dataset in the aforementioned LLM/cali directory. Regarding the specific format of the dataset, users can write each data entry as separate lines in a .txt file, with each line representing a single text data entry. By using the above configuration, you can perform calibration with a custom dataset. * If it is a VLM, simply place the business dataset in the aforementioned VLM/cali directory. Regarding the specific format of the dataset, you can refer to the format in VLM/cali/general_custom_data and choose the format that meets your needs. It is important to note that the final JSON file should be named samples.json.

  1. business eval dataset

If the model has been calibrated with downstream business datasets, it is generally recommended to use a business dataset for eval when selecting the eval set. * If it is an LLM, simply place the business dataset in the aforementioned LLM/eval directory. Regarding the specific format of the dataset, users can write each data entry as a separate line of text in a .txt file, with each line representing one text data entry. Using the above configuration, custom dataset testing can be achieved. * If it is a VLM, simply place the business dataset in the aforementioned VLM/eval directory. Regarding the specific format of the dataset, you can refer to the format in VLM/cali/general_custom_data and choose the format that meets your needs. It is important to note that the final JSON file should be named samples.json.

21.2.3.2. Phase Two: Configure the Quantization Configuration File

  • Note: The quantization configuration file includes the settings required for the quantization process. Users can select configurations according to their needs. Additionally, to align with the TPU hardware configuration, certain parameters may have restrictions. Please refer to the detailed explanation below for more information.

21.2.3.2.1. Configuration File Parameter Description
 1 base:
 2     seed: &seed 42
 3 model:
 4     type: Qwen2VL # Set the model name. For specific supported models, refer to the llmc/models directory.
 5     path: /workspace/Qwen2-VL-2B-Instruct    # Set the model weights path, please change to your desired model
 6     torch_dtype: auto
 7 calib:
 8     name: mme   # Set to the actual calibration dataset name, such as mme, pileval, etc.
 9     download: False
10     path: /workspace/llmc-tpu/tpu/data/VLM/cali/MME  # Set the calibration dataset path
11     n_samples: 128
12     bs: 1
13     seq_len: 512
14     preproc: pileval_awq
15     seed: *seed
16 eval:
17     eval_pos: [pretrain, fake_quant]
18     name: mme  # Set to the actual eval dataset name, such as mme, wikitext2, etc.
19     download: False
20     path: /workspace/llmc-tpu/tpu/data/VLM/eval/MME # Set the eval dataset path
21     bs: 1
22     seq_len: 2048
23 quant:
24     method: Awq
25     quant_objects: [language] # By default, only quantize the LLM part. If you want to quantize the VIT part, set it to [vision, language].
26     weight:
27         bit: 4 # Set to the desired quantization bit, supports 4 or 8
28         symmetric: False # Set to False for 4-bit and True for 8-bit
29         granularity: per_group # Set to per_group for 4-bit and per_channel for 8-bit.
30         group_size: 64 # Set to 64 for 4-bit (corresponding to TPU-MLIR); set to -1 for 8-bit.
31     special:
32         trans: True
33         trans_version: v2
34         weight_clip: True
35         clip_sym: True
36 save:
37     save_trans: True       # When set to True, you can save the adjusted floating-point weights.
38     save_path: ./save_path # Set the path to save the weights
39 run:
40     task_name: awq_w_only
41     task_type: VLM   # Set to VLM or LLM

The above is a complete config file constructed using the Awq algorithm as an example. To simplify user operations, users can directly copy the above into their own config and then modify the parameters that are annotated.

Below are detailed explanations of some important parameters:

Table 21.3 Introduction of Relevant Parameters

Parameter

Description

model

model name. the supported models are in the llmc/models directory. You can add new models by including llmc/models/xxxx.py.

calib

calib class parameters mainly specify parameters related to the calibration set

eval

eval class parameters mainly specify parameters related to the eval set.

quant

specify the quantization parameters. It is generally recommended to use the Awq algorithm. For quant_objects, typically select language. For weight quantization parameters, refer to the table below.

To align with TPU-MLIR, the configuration of weight quantization related parameters is as follows:

Table 21.4 weight-only quantization parameters

bit

symmetric

granularity

group_size

4

False

per-channel or per-group

-1 or 64(default)

8

True

per-channel

-1

21.2.3.3. Stage 3: Execute the Quantization Algorithm

1cd /workspace/llmc-tpu
2python3 tpu/llm_quant.py --llmc_tpu_path . --config_path ./tpu/example.yml
  • config_path refers to the path of the quantization configuration file, and llmc_tpu_path refers to the current llmc_tpu directory path.