今日大宗商品价格行情_网站建设网站推广优化_北京it培训机构哪家好_苏州百度推广公司

MiniCPM-V模型使用

前言
1. 模型文件下载和选择
2. 环境安装配置
3. 模型微调
- 3.1 qlora微调minicpm-v-int4
- 3.2 lora微调minicpm-v
- 3.3 merge_lora
- 3.4 lora微调后量化int4
4. 模型推理
- 4.1 huggingface API
- 4.2 swift API
- - (A) swift（不支持batch inference）
  - (B) swift的VLLM
- 4.3 VLLM
- - (A) 单个推理
  - (B) batch inference
5. 参考链接

前言

前面学习了一些常见多模态模型的架构，现在开始学习使用minicpm-v-2.6模型，记录学习过程，欢迎批评指正～

排行榜上数据供参考，测试下来qwen2-vl稍微好一点点，然后minivpm-v-2.6稍差一点点
在这里插入图片描述

1. 模型文件下载和选择

在modelscope上下载，其中int4的模型推理显存占用7-9GB，效果和全量模型很接近。全量模型下载推理可能稍微慢一些，int4就够用了，并且int4的推理挺快的，平均不到0.5秒一张图，如果同时开4个进程就是一秒钟4张图左右。

#模型下载
from modelscope import snapshot_download
model_dir = snapshot_download('OpenBMB/MiniCPM-V-2_6-int4',cache_dir='要存放模型的路径')

2. 环境安装配置

有几个点需要注意的：

官方飞书文档里面说微调时deepspeed需要手动安装，不知道为什么手动下载源码的安装的跑不成功，自动安装的pip install deepspeed最新的比如0.15.0，微调就不会报错
swift的安装最好直接安装，不要从源代码安装，不然万一删除了源代码环境就无了，然后也不方便，直接pip install ‘ms-swift[llm]’ -U
如果要和qwen2-vl的环境通用，注意pip install transformers==4.46.1
flash-attn可以先在官方github上下载whl文件，如果网速慢的话，一般直接pip install flash-attn就行
如果要使用vllm安装pip install vllm

3. 模型微调

微调有好几种选择：(1)qlora微调minicpm-v-int4；(2)lora微调minicpm-v；(3)lora微调minicpm-v-int4，然后量化为int4

测试下来(1)(2)(3)准确率的差距不大，可能有的情况下(2)比(3)好一点点
显卡试了RTX-8000和A100-40/80GB，还是A100比较好，RTX-8000跑大半天，A100半小时到一小时

3.1 qlora微调minicpm-v-int4

qlora微调时需要把–tune_vision设置为false，同时–qlora设置为true
显存开销上，ds_config_zero2和batchsize=1的情况下，qlora大概30-40GB显存开销，如果显存够大用ds_config_zero2，不然用ds_config_zero3(训练速度变慢)。
训练时如果出现data fetch error注意检查路径和数据json文件的格式，应该不会有其他什么问题。

#!/bin/bash
GPUS_PER_NODE=1 # 改成你的机器每个节点共有多少张显卡，如果是单机八卡就是8
NNODES=1 # 改成你的机器有多少个节点，如果就是一台服务器就是1
NODE_RANK=0 # 使用第几个服务器训练
MASTER_ADDR=localhost
MASTER_PORT=6001MODEL="/root/ld/ld_model_pretrained/Minicpmv2_6" # 本地模型路径 or openbmb/MiniCPM-V-2.5
# ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.
# See the section for finetuning in README for more information.
DATA="/root/ld/ld_project/MiniCPM-V/finetune/mllm_demo.json" # 训练数据文件地址
LLM_TYPE="qwen2" # if use openbmb/MiniCPM-V-2, please set LLM_TYPE=minicpmexport NCCL_P2P_DISABLE=1 # a100等支持nccl_p2p的显卡去掉此行
export NCCL_IB_DISABLE=1 # a100等显卡去掉此行DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE \--nnodes $NNODES \--node_rank $NODE_RANK \--master_addr $MASTER_ADDR \--master_port $MASTER_PORT
"
.conda/envs/yourenv/python -m torchrun $DISTRIBUTED_ARGS finetune.py  \--model_name_or_path $MODEL \--llm_type $LLM_TYPE \--data_path $DATA \--remove_unused_columns false \ --label_names "labels" \ # 数据构造，不要动--prediction_loss_only false \ --bf16 false \ # 使用bf16精度训练，4090，a100，h100等可以开启--bf16_full_eval false \ # 使用bf16精度测试--fp16 true \ # 使用fp16精度训练--fp16_full_eval true \ # 使用pf16精度测试--do_train \ # 是否训练--tune_vision true \ # 是否微调siglip(vit)模块--tune_llm false \ # 是否微调大语言模型模块--use_lora true \ # 是否lora微调--lora_target_modules "llm\..*layers\.\d+\.self_attn\.(q_proj|k_proj｜v_proj)" \ #lora插入的层，这里写的是正则表达式，建议不改--model_max_length 2048 \ # 模型训练的最大长度--max_slice_nums 9 \ # 模型最大切分次数--max_steps 1000 \ # 最多训练步数--output_dir output/output_minicpmv2_lora \ # 模型lora保存地址--logging_dir output/output_minicpmv2_lora \ # 日志保存地址--logging_strategy "steps" \ # 日志输出策略（可选epoch）--per_device_train_batch_size 2 \ # 每张卡训练的batch_size--gradient_accumulation_steps 1 \ # 梯度累积，当显存少时可以增大这个参数从而减少per_device_train_batch_size--save_strategy "steps" \ # 保存策略(可选epoch)与save_steps同时起作用--save_steps 1000 \ # 1000个step保存一次--save_total_limit 1 \ # 最大储存总数--learning_rate 1e-6 \ # 学习率--weight_decay 0.1 \ # 权重正则化参数--adam_beta2 0.95 \ # --warmup_ratio 0.01 \ # 总步数的预热率，即：总训练步数*warmup_ratio=预热步数--lr_scheduler_type "cosine" \ # 学习率调整器--logging_steps 10 \--gradient_checkpointing false \ # 梯度检查点，建议开启，极大减少显存使用--deepspeed ds_config_zero2.json \ # 使用zero3，显存充足建议使用ds_config_zero2.json

3.2 lora微调minicpm-v

显存开销上，ds_config_zero2和batchsize=1的情况下，lora大概77-79GB显存开销，如果显存够大用ds_config_zero2，不然用ds_config_zero3(训练速度变慢)

3.3 merge_lora

使用官方飞书文档里面的copy之后，注意需要检查是否拷贝全了，通常会因为原始模型目录下面产生了asset等临时文件，会报错然后漏拷贝image_processing_minicpmv.py、preprocessor_config.json和processing_minicpmv.py。

注意，这里如果存的是bin而不是safetensor格式的文件，后面使用官方飞书文档里面的awq量化会报错，awq量化那里输入要求safetensor格式存储的模型

from peft import PeftModel
from transformers import AutoModel, AutoTokenizer
import os
import shutilmodel_type = "原始minicpm-v模型地址"  # Local model path or huggingface id
path_to_adapter = "存放输出lora文件的地址"  # Path to the saved LoRA adapter
merge_path = "合并后模型地址"  # Path to save the merged model# 保证原始模型的各个文件不遗漏保存到merge_path中
def copy_files_not_in_B(A_path, B_path):"""Copies files from directory A to directory B if they exist in A but not in B.:param A_path: Path to the source directory (A).:param B_path: Path to the destination directory (B)."""# 保证路径存在if not os.path.exists(A_path):raise FileNotFoundError(f"The directory {A_path} does not exist.")if not os.path.exists(B_path):os.makedirs(B_path)# 获取路径A中所有非权重文件files_in_A = os.listdir(A_path)files_in_A = set([file for file in files_in_A if not (".bin" in file or "safetensors" in file)])# List all files in directory Bfiles_in_B = set(os.listdir(B_path))# 找到所有A中存在但B中不存在的文件files_to_copy = files_in_A - files_in_B# 将这些文件复制到B路径下for file in files_to_copy:if os.path.isfile(file):src_file = os.path.join(A_path, file)dst_file = os.path.join(B_path, file)shutil.copy2(src_file, dst_file)# 加载原始模型
model = AutoModel.from_pretrained(model_type,trust_remote_code=True
)# 加载lora模块到原始模型中
lora_model = PeftModel.from_pretrained(model,path_to_adapter,device_map="auto",trust_remote_code=True
).eval()# 将加载的lora模块合并到原始模型中
merge_model = lora_model.merge_and_unload()# 将新合并的模型进行保存
merge_model.save_pretrained(merge_path, safe_serialization=True)# 加载分词器
tokenizer = AutoTokenizer.from_pretrained(model_type, trust_remote_code=True)
tokenizer.save_pretrained(merge_path)copy_files_not_in_B(model_type,merge_path)

3.4 lora微调后量化int4

这里的注意点和merge_lora一样，如果使用bnb量化，按照官方飞书文档，量化完了之后记得确认文件是否都在，否则拷贝即可
awq量化，最好重新conda create一个新的环境专门装这个，并且保证模型是safetensor格式存储即可进行awq量化，awq环境需要使用官方飞书文档里面介绍的环境

from datasets import load_dataset
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import os
import shutil
model_path = 'minicpm-v-2_6路径'
quant_path = '存储量化模型路径'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path,trust_remote_code=True,device_map='cuda')
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True,device_map='cuda')def copy_files_not_in_B(A_path, B_path):"""Copies files from directory A to directory B if they exist in A but not in B.:param A_path: Path to the source directory (A).:param B_path: Path to the destination directory (B)."""# 保证路径存在if not os.path.exists(A_path):raise FileNotFoundError(f"The directory {A_path} does not exist.")if not os.path.exists(B_path):os.makedirs(B_path)# 获取路径A中所有非权重文件files_in_A = os.listdir(A_path)files_in_A = set([file for file in files_in_A if not (".bin" in file or "safetensors" in file )])# List all files in directory Bfiles_in_B = set(os.listdir(B_path))# 找到所有A中存在但B中不存在的文件files_to_copy = files_in_A - files_in_B# 将这些文件复制到B路径下for file in files_to_copy:src_file = os.path.join(A_path, file)dst_file = os.path.join(B_path, file)shutil.copy2(src_file, dst_file)
# Define data loading methods
def load_alpaca():#data = load_dataset('/root/ld/pull_request/MiniCPM/quantize/quantize_data/alpaca', split="train")data = load_dataset('tatsu-lab/alpaca', split="train") # concatenate datadef concatenate_data(x):msgs=[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": x['input']},{"role": "system", "content": x['output']}]data=tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)return {"text": data}concatenated = data.map(concatenate_data)return [text for text in concatenated["text"]][:1000]def load_wikitext():data = load_dataset('wikitext', 'wikitext-2-raw-v1', split="train")return [text for text in data["text"] if text.strip() != '' and len(text.split(' ')) > 20]# Quantize
model.quantize(tokenizer, quant_config=quant_config, calib_data=load_alpaca())# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)print(f'Model is quantized and saved at "{quant_path}"')copy_files_not_in_B(model_path,quant_path)

4. 模型推理

4.1 huggingface API

huggingface的方式进行batch inference时，需要找到模型的modeling_minicpmv.py文件，定位里面的chat()函数，把if batched is False后面else那部分注释掉，int4的A100-80GB可以把batch size开到26

    def chat(self,image,msgs,tokenizer,processor=None,vision_hidden_states=None,max_new_tokens=2048,min_new_tokens=0,sampling=True,max_inp_length=8192,system_prompt='',stream=False,max_slice_nums=None,use_image_id=None,**kwargs):if isinstance(msgs[0], list):batched = Trueelse:batched = Falsemsgs_list = msgsimages_list = imageif batched is False:images_list, msgs_list = [images_list], [msgs_list]#else:#    assert images_list is None, "Please integrate image to msgs when using batch inference."#     images_list = [None] * len(msgs_list)# assert len(images_list) == len(msgs_list), "The batch dim of images_list and msgs_list should be the same."if processor is None:if self.processor is None:self.processor = AutoProcessor.from_pretrained(self.config._name_or_path, trust_remote_code=True)processor = self.processor

推理时，使用如下代码：

prompt = 'What can you see in the image?'
msgs = [{'role': 'user', 'content': prompt}]img1 = Image.open('AAA.jpg')
img2 = Image.open('BBB.jpg')
images_input_list.append(img1)
images_input_list.append(img2)
prompt_input_list.append(msgs)
prompt_input_list.append(msgs)# batch inference
with torch.inference_mode():res = model.chat(images_input_list,msgs=prompt_input_list,tokenizer=tokenizer,sampling=False,max_new_tokens=30)

可以使用flash-attention加速，只需要网络良好的情况下，pip install flash-attn，然后加载模型时，指定attn_implementation=‘flash_attention_2’

model = AutoModel.from_pretrained('/', trust_remote_code=True,attn_implementation='flash_attention_2')

4.2 swift API

(A) swift（不支持batch inference）

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
os.environ['MAX_SLICE_NUMS'] = '9'
from swift.llm import (get_model_tokenizer, get_template, inference, ModelType,get_default_template_type, inference_stream
)
from swift.utils import seed_everything
import torchmodel_type = ModelType.minicpm_v_v2_6_chat
template_type = get_default_template_type(model_type)
print(f'template_type: {template_type}')model_id_or_path = '模型地址'
model, tokenizer = get_model_tokenizer(model_type, torch.bfloat16,model_id_or_path=model_id_or_path,model_kwargs={'device_map': 'auto'})
model.generation_config.max_new_tokens = 256
template = get_template(template_type, tokenizer)
seed_everything(42)
model.generation_config.do_sample = False
query = """<img>要推理的图片存储地址</img>"""
prompt=" What can you see in this image?"query = query+prompt
response, history = inference(model, template, query)
print(f'query: {query}')
print(f'response: {response}')

minicpm-v-awq-int4的，不能使用swift的vllm推理，可以使用原始的VLLM推理，不过速度上差别倒不是特别大

(B) swift的VLLM

# swift的infer
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
os.environ['TENSOR_PARALLEL_SIZE'] = '1'
from swift.llm import (ModelType, get_vllm_engine, get_default_template_type,get_template, inference_vllm, inference_stream_vllm
)
from swift.utils import seed_everything
import torch
model_type = ModelType.minicpm_v_v2_6_chat
model_id_or_path = '模型路径'
llm_engine = get_vllm_engine(model_type, model_id_or_path=model_id_or_path)
template_type = get_default_template_type(model_type)
template = get_template(template_type, llm_engine.hf_tokenizer)
generation_info = {}query1 = """<img>图片路径1</img>"""
query2 = """<img>图片路径2</img>"""query1 = '.......'
query2 = '.......'
request_list = [{'query':query1},{'query':query2}]resp_list = inference_vllm(llm_engine, template, request_list, generation_info=generation_info)
for request, resp in zip(request_list, resp_list):print(f"query: {request['query']}")print(f"response: {resp['response']}")
print(generation_info)

4.3 VLLM

(A) 单个推理

from PIL import Image
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams# 图像文件路径列表
IMAGES = ["图片路径",  # 本地图片路径
]# 改成你量化后的awq路径/ 原始minicpm-v的路径
# awq模型路径
MODEL_NAME = '模型路径'
# 打开并转换图像
image = Image.open(IMAGES[0]).convert("RGB")# 初始化分词器
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)# 初始化语言模型
llm = LLM(model=MODEL_NAME,gpu_memory_utilization=0.5,  # 1表示使用全部GPU内存，如果希望gpu占用率降低，就减少gpu_memory_utilizationtrust_remote_code=True,max_model_len=2048)  # 根据内存状况可调整此值
# 构建对话消息
messages = [{'role': 'user', 'content': '(<image>./</image>)\n' + '请描述这张图片'}]# 应用对话模板到消息
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)# 设置停止符ID
# 2.0
# stop_token_ids = [tokenizer.eos_id]
# 2.5
#stop_token_ids = [tokenizer.eos_id, tokenizer.eot_id]
# 2.6 
stop_tokens = ['<|im_end|>', '<|endoftext|>']
stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]# 设置生成参数
sampling_params = SamplingParams(stop_token_ids=stop_token_ids,# temperature=0.7,# top_p=0.8,# top_k=100,# seed=3472,max_tokens=1024,# min_tokens=150,temperature=0,use_beam_search=True,# length_penalty=1.2,best_of=3)# 获取模型输出
outputs = llm.generate({"prompt": prompt,"multi_modal_data": {"image": img1_path}
}, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

(B) batch inference

from PIL import Image
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams# 所有待输入的图像
IMAGES = ["图片地址1","图片地址2"
]# MODEL_NAME = "HwwwH/MiniCPM-V-2" # If you use the local MiniCPM-V-2 model, please update the model code from HwwwH/MiniCPM-V-2
# If using a local model, please update the model code to the latest
MODEL_NAME = '模型路径'
images = [Image.open(i).convert("RGB") for i in IMAGES]tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
llm = LLM(model=MODEL_NAME,gpu_memory_utilization=1,trust_remote_code=True,max_model_len=1024)
prompt = 'What can you see in this image?'
messages = [{'role': 'user', 'content': '(<image>./</image>)\n' + prompt}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Construct multiple inputs. This example shares prompt, or you don’t need to share prompt.
inputs=[{"prompt": prompt,"multi_modal_data": {"image": i }} for i in images]
# 2.0
# stop_token_ids = [tokenizer.eos_id]
# 2.5
#stop_token_ids = [tokenizer.eos_id, tokenizer.eot_id]
# 2.6
stop_tokens = ['<|im_end|>', '<|endoftext|>']
stop_token_ids = [tokenizer.convert_tokens_to_ids(i) for i in stop_tokens]sampling_params = SamplingParams(stop_token_ids=stop_token_ids,# temperature=0.7,# top_p=0.8,# top_k=100,# seed=3472,max_tokens=200,# min_tokens=150,temperature=0,use_beam_search=True,# length_penalty=1.2,best_of=3)outputs = llm.generate(inputs, sampling_params=sampling_params)
for i in range(len(inputs)):print(outputs[i].outputs[0].text)

5. 参考链接

minicpm-v的官方飞书文档，真的学到了很多，群里面有问题回复也超及时，感谢官方和社区分享：https://modelbest.feishu.cn/wiki/SgGpwVz4aiSDwNkVMrmcMpHsnAF
swift的官方文档：https://swift.readthedocs.io/en/stable/index.html