58同城推广代运营_北京百度网讯科技有限公司_全球搜索引擎排名_武汉网站建设优化

下面我将介绍如何使用Hugging Face的Transformer框架对微调后的Qwen或DeepSeek模型进行非流式批量推理。

一、准备工作

首先确保已安装必要的库：

pip install transformers torch

二、批量推理实现代码

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from typing import Listclass BatchInference:def __init__(self, model_path: str, device: str = "cuda" if torch.cuda.is_available() else "cpu"):"""初始化模型和tokenizer:param model_path: 微调后的模型路径(Hugging Face模型ID或本地路径):param device: 推理设备(cpu/cuda)"""self.device = deviceself.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)self.model = AutoModelForCausalLM.from_pretrained(model_path,trust_remote_code=True,torch_dtype=torch.float16 if "cuda" in device else torch.float32,device_map="auto")self.model.eval()# 设置pad_token_id为eos_token_id如果不存在pad_tokenif self.tokenizer.pad_token_id is None:self.tokenizer.pad_token_id = self.tokenizer.eos_token_iddef batch_predict(self, prompts: List[str], max_new_tokens: int = 512, batch_size: int = 4, **generate_kwargs) -> List[str]:"""批量推理方法:param prompts: 输入提示列表:param max_new_tokens: 生成的最大token数:param batch_size: 批量大小:param generate_kwargs: 额外的生成参数:return: 生成结果列表"""all_results = []# 分批次处理for i in range(0, len(prompts), batch_size):batch_prompts = prompts[i:i + batch_size]# 编码输入inputs = self.tokenizer(batch_prompts,padding=True,truncation=True,return_tensors="pt",max_length=1024  # 可根据需要调整).to(self.device)# 生成输出with torch.no_grad():outputs = self.model.generate(**inputs,max_new_tokens=max_new_tokens,pad_token_id=self.tokenizer.pad_token_id,**generate_kwargs)# 解码输出并移除输入部分batch_results = []for j in range(len(outputs)):output = outputs[j]input_length = inputs["input_ids"][j].shape[0]generated = output[input_length:]batch_results.append(self.tokenizer.decode(generated, skip_special_tokens=True))all_results.extend(batch_results)return all_results# 使用示例
if __name__ == "__main__":# 替换为你的模型路径(本地或Hugging Face模型ID)MODEL_PATH = "Qwen/Qwen-7B-Chat"  # 或"deepseek-ai/deepseek-llm-7b"等# 初始化推理器inferencer = BatchInference(MODEL_PATH)# 示例输入prompts = ["请解释一下量子计算的基本原理","写一首关于春天的诗","如何用Python实现快速排序?","Transformer模型的主要创新点是什么?"]# 批量推理results = inferencer.batch_predict(prompts,max_new_tokens=256,batch_size=2,  # 根据GPU内存调整temperature=0.7,top_p=0.9)# 打印结果for prompt, result in zip(prompts, results):print(f"输入: {prompt}\n输出: {result}\n{'-'*50}")

三、关键点说明

设备管理:
- 自动检测并使用可用的GPU
- 支持半精度(fp16)推理以节省显存
批量处理:
- 将输入分成小批次处理，避免内存不足
- 自动填充(padding)使批次内样本长度一致
生成参数:
- max_new_tokens: 控制生成的最大长度
- temperature和top_p: 控制生成的随机性
- 可通过generate_kwargs传递其他生成参数
内存优化:
- 使用torch.no_grad()减少内存消耗
- 根据GPU内存调整batch_size

四、进阶优化

使用Flash Attention (如果模型支持):

model = AutoModelForCausalLM.from_pretrained(model_path,trust_remote_code=True,torch_dtype=torch.float16,use_flash_attention_2=True,  # 启用Flash Attentiondevice_map="auto"
)

2.量化推理 (减少显存使用):

model = AutoModelForCausalLM.from_pretrained(model_path,trust_remote_code=True,torch_dtype=torch.float16,load_in_8bit=True,  # 8位量化device_map="auto"
)

3.使用vLLM等优化库 (对于生产环境):

from vllm import LLM, SamplingParamsllm = LLM(model=MODEL_PATH)
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=256)
outputs = llm.generate(prompts, sampling_params)

以上代码提供了基于Transformer框架的Qwen/DeepSeek模型批量推理基础实现，可根据实际需求进行调整和优化。

58同城推广代运营_北京百度网讯科技有限公司_全球搜索引擎排名_武汉网站建设优化

一、准备工作

二、批量推理实现代码

三、关键点说明

四、进阶优化

最新新闻

热搜词