基于Spark AI的进行模型微调(DataWhale AI夏令营)

在这里插入图片描述

前言

Hello，大家好，我是GISer Liu😁，一名热爱AI技术的GIS开发者，本文参与活动是2024 DataWhale AI夏令营第四期 大模型微调 希望我的文章能帮助到你；😲

简介

在本文中，作者将详细讲解如何从零开始构建一个语文和英语高考选择题数据集，并且基于讯飞开发平台进行LLM微调训练一个模型，最后通过调用API进行测试。我们将分为以下几个步骤：

数据集准备：包括数据的读取、预处理、问题提取与答案提取。
模型训练：如何利用现有的语言模型，进行定制化的模型训练。
本地测试：训练后的模型在本地如何测试，包括如何与模型交互，验证模型的准确性。

一、数据集准备

在进行模型训练之前，首先需要准备高质量的数据集。这里的数据集由两部分构成：语文与英语高考选择题的数据集。

1.1 读取与预处理数据

首先，我们需要将原始的Excel文件数据加载到内存中，并对其中的一些字符进行替换操作，以确保数据格式的一致性。

# !pip install pandas openpyxl  # 没有安装需要取消注释后安装一下
import pandas as pd
import re# 读取Excel文件
df = pd.read_excel('训练集-语文.xlsx')
df = df.replace('．', '.', regex=True)  # 将全角句号替换为半角句号
df = df.replace('（', '(', regex=True)  # 将全角左括号替换为半角左括号# 读取第二行（即第三行）“选项”列的内容
second_row_option_content = df.loc[2, '选项']
# 显示第二行“选项”列的内容
print(second_row_option_content)

check dataset

这里查看一下数据

1.2 提取选择题内容

为了提取选择题中的问题和选项，我们使用正则表达式来匹配问题和选项的格式。这里的 chinese_multiple_choice_questions 函数实现了这个过程。

def chinese_multiple_choice_questions(questions_with_answers):# 输入的题目文本text = questions_with_answers# 定义问题和选项的正则表达式模式question_pattern = re.compile(r'\d+\..*?(?=\d+\.|$)', re.DOTALL)choice_pattern = re.compile(r'([A-D])\s*(.*?)(?=[A-D]|$|\n)', re.DOTALL)# 找到所有问题questions = question_pattern.findall(text)# 初始化选择题和简答题列表multiple_choice_questions = []short_answer_questions = []# 处理每个问题for id, question in enumerate(questions):# 检查是否是选择题if re.search(r'[A-D]', question):choices = choice_pattern.findall(question)  # 提取选项question_text = re.split(r'\n', question.split('(')[0])[0]  # 提取问题文本# 将问题与选项整理成字典multiple_choice_questions.append({'question': f"{id+1}.{question_text.strip()}",'choices': choices})else:short_answer_questions.append(question.strip())  # 处理简答题return multiple_choice_questions

这个函数的作用是将输入的文本分割成每一个问题，并提取其中的选项和对应的内容，最终输出为一个包含问题和选项的列表。
下面我们对问题进行提取：

questions_list = []
for data_id in range(len(df[:3])):second_row_option_content = df.loc[data_id, '选项']questions_list.append(chinese_multiple_choice_questions(second_row_option_content))

1.3 提取答案

为了从数据中提取正确答案，我们定义了 chinese_multiple_choice_answers 函数，通过正则表达式从文本中匹配出每个问题的答案。

def chinese_multiple_choice_answers(questions_with_answers):questions_with_answers = questions_with_answers.replace(" ", "").replace("\n", "")# 使用正则表达式匹配答案choice_pattern = re.compile(r'(\d+)\.([A-Z]+)')short_pattern = re.compile(r'(\d+)\.([^A-Z]+)')# 找到所有匹配的答案choice_matches = choice_pattern.findall(questions_with_answers)short_matches = short_pattern.findall(questions_with_answers)# 将匹配结果转换为字典choice_answers = {int(index): answer for index, answer in choice_matches}short_answers = {int(index): answer for index, answer in short_matches}# 按序号重新排序sorted_choice_answers = sorted(choice_answers.items())sorted_short_answers = sorted(short_answers.items())answers = []for id in range(len(sorted_choice_answers)):answers.append(f"{id+1}. {sorted_choice_answers[id][1]}")return answers

这里我们提取答案进行测试：

# 读取第二行（即第三行）“选项”列的内容
second_row_option_content = df.loc[60, '答案']
# 显示第二行“选项”列的内容
print(second_row_option_content)
chinese_multiple_choice_answers(second_row_option_content)

TEST
构建答案字段：

df['答案_processed'] = df['答案'].map(chinese_multiple_choice_answers)

1.4 构建提示词打包函数

def get_prompt_cn(text):prompt = f'''你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择题，包含题目选项，以及对应的答案，注意：不⽤给出原文，每道题由1个问题和4个选项组成，仅存在1个正确答案，请严格按照要求执行。 阅读文本主要是中文，你出的题目需要满足以下要点，紧扣文章内容且题干和答案为中文：### 回答要求(1)理解文中重要概念的含义(2)理解文中重要句子的含意(3)分析论点、论据和论证方法### 阅读文本{text}'''return prompt

1.5 构建中文数据集

通过调用上述函数，我们可以构建最终用于训练的数据集。在这个过程中，我们将所有的问题和答案格式化为所需的输入输出形式，并生成适用于模型训练的 prompt。

def process_cn(df): res_input = []res_output = []for id in range(len(df)):data_options = df.loc[id, '选项']data_answers = df.loc[id,'答案']data_prompt = df.loc[id,'阅读文本']data_options = chinese_multiple_choice_questions(data_options)data_answers = chinese_multiple_choice_answers(data_answers)data_prompt = get_prompt_cn(data_prompt)if len(data_answers) == len(data_options):res = ''for id_, question in enumerate(data_options):res += f"{question['question']}?\n"for choice in question['choices']:res += f"{choice[0]}. {choice[1]}\n"res += f"答案: {data_answers[id_].split('.')[-1]}\n"res_output.append(res)res_input.append(data_prompt)return res_input, res_outputcn_input, cn_output = process_cn(df)

如此一来，我们将每一行数据提取出问题和答案，并根据需要构建出模型所需的输入（input）和输出（output）。

1.6 构建英文数据集

同理，我们可以构建英文数据集，逻辑类似，完整代码如下：

import pandas as pd
import re# 读取Excel文件并对数据进行预处理
df = pd.read_excel('训练集-英语.xlsx')# 替换一些特殊符号，使其标准化
df = df.replace('．', '.', regex=True) \.replace('А.', 'A.', regex=True) \.replace('В.', 'B.', regex=True) \.replace('С.', 'C.', regex=True) \.replace('D.', 'D.', regex=True)def remove_whitespace_and_newlines(input_string):# 使用str.replace()方法删除空格和换行符result = input_string.replace(" ", "").replace("\n", "").replace(".", "")return result# 定义函数用于从答案列中提取答案
def get_answers(text):# 删除空格和换行符text = remove_whitespace_and_newlines(text)# 正则表达式模式，用于匹配答案pattern = re.compile(r'(\d)\s*([A-D])')# 查找所有匹配项matches = pattern.findall(text)res = []# 遍历所有匹配项，将答案存入列表for match in matches:number_dot, first_letter = matchres.append(first_letter)return res# 示例输入，测试get_answers函数
input_string = "28. A. It is simple and plain. 29. D. Influential. 30. D.33%. 31. B. Male chefs on TV programmes."
res = get_answers(input_string)
print(res)  # 输出提取出的答案列表# 定义函数用于从问题列中提取问题和选项
def get_questions(text):# 替换换行符并在末尾添加空格text = text.replace('\n', '  ')+'  '# 正则表达式模式，用于匹配问题和选项pattern = re.compile(r'(\d+\..*?)(A\..*?\s{2})([B-D]\..*?\s{2})([B-D]\..*?\s{2})(D\..*?\s{2})', re.DOTALL)# 查找所有匹配项matches = pattern.findall(text)# 存储结果的字典列表questions_dict_list = []# 遍历所有匹配项，提取问题和选项for match in matches:question, option1, option2, option3, option4 = match# 提取问题文本pattern_question = re.compile(r'(\d+)\.(.*)')question_text = pattern_question.findall(question.strip())[0][1]# 提取选项字母和内容options = {option1[0]: option1, option2[0]: option2, option3[0]: option3, option4[0]: option4}# 将问题和选项存入字典question_dict = {'question': question_text,'options': {'A': options.get('A', '').strip(),'B': options.get('B', '').strip(),'C': options.get('C', '').strip(),'D': options.get('D', '').strip()}}questions_dict_list.append(question_dict)return questions_dict_list# 调用get_questions函数并打印结果
questions = get_questions(text)
for q in questions:print(q)  # 输出提取出的每个问题及其选项# 定义函数生成用于模型训练的提示文本
def get_prompt_en(text):prompt = f'''你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择题，包含题目选项，以及对应的答案，注意：不⽤给出原文，每道题由1个问题和4个选项组成，仅存在1个正确答案，请严格按照要求执行。
The reading text is mainly in English. The questions and answers you raised need to be completed in English for at least the following points:### 回答要求(1)Understanding the main idea of the main idea.(2)Understand the specific information in the text.(3)infering the meaning of words and phrases from the context### 阅读文本{text}'''return prompt   # 定义处理整个数据集的函数
def process_en(df): res_input = []res_output = []# 遍历数据集中的每一行for id in range(len(df)):# 提取选项、答案和阅读文本data_options = df.loc[id, '选项']data_answers = df.loc[id,'答案']data_prompt = df.loc[id,'阅读文本']# 调用前面定义的函数处理选项和答案data_options = get_questions(data_options)data_answers = get_answers(data_answers)data_prompt = get_prompt_en(data_prompt)# 确保答案和问题数量一致if(len(data_answers) == len(data_options)):res = ''# 遍历每个问题，生成最终格式的文本for id, question in enumerate(data_options):res += f'''{id+1}.{question['question']}{question['options']['A']}{question['options']['B']}{question['options']['C']}{question['options']['D']}answer:{data_answers[id]}'''+'\n'res_output.append(res)res_input.append(data_prompt)return res_input, res_output# 处理数据集
en_input, en_output = process_en(df)

1.7 数据集合并

我们将构建的中文数据集和英文数据集进行合并，用于后续处理导出：

# 将两个列表转换为DataFrame
df_new = pd.DataFrame({'input': cn_input+cn_input[:30]+en_input+en_input[:20], 'output': cn_output+cn_output[:30]+en_output+en_output[:20]})df_new

datarame

可以看到数据已经导出成功！

完整代码如下：

import pandas as pd
import re
import json# 通用函数：删除空格和换行符
def remove_whitespace_and_newlines(input_string):result = input_string.replace(" ", "").replace("\n", "").replace(".", "")return result# 通用函数：提取答案
def get_answers(text):text = remove_whitespace_and_newlines(text)pattern = re.compile(r'(\d)\s*([A-D])')matches = pattern.findall(text)res = []for match in matches:number_dot, first_letter = matchres.append(first_letter)return res# 通用函数：提取问题和选项
def get_questions(text):text = text.replace('\n', '  ')+'  'pattern = re.compile(r'(\d+\..*?)(A\..*?\s{2})([B-D]\..*?\s{2})([B-D]\..*?\s{2})(D\..*?\s{2})', re.DOTALL)matches = pattern.findall(text)questions_dict_list = []for match in matches:question, option1, option2, option3, option4 = matchpattern_question = re.compile(r'(\d+)\.(.*)')question_text = pattern_question.findall(question.strip())[0][1]options = {option1[0]: option1, option2[0]: option2, option3[0]: option3, option4[0]: option4}question_dict = {'question': question_text,'options': {'A': options.get('A', '').strip(),'B': options.get('B', '').strip(),'C': options.get('C', '').strip(),'D': options.get('D', '').strip()}}questions_dict_list.append(question_dict)return questions_dict_list# 生成英文提示文本
def get_prompt_en(text):prompt = f'''你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择题，包含题目选项，以及对应的答案，注意：不⽤给出原文，每道题由1个问题和4个选项组成，仅存在1个正确答案，请严格按照要求执行。
The reading text is mainly in English. The questions and answers you raised need to be completed in English for at least the following points:### 回答要求(1)Understanding the main idea of the main idea.(2)Understand the specific information in the text.(3)infering the meaning of words and phrases from the context### 阅读文本{text}'''return prompt# 处理英文数据集
def process_en(df):res_input = []res_output = []for id in range(len(df)):data_options = df.loc[id, '选项']data_answers = df.loc[id, '答案']data_prompt = df.loc[id, '阅读文本']data_options = get_questions(data_options)data_answers = get_answers(data_answers)data_prompt = get_prompt_en(data_prompt)if len(data_answers) == len(data_options):res = ''for id, question in enumerate(data_options):res += f'''{id+1}.{question['question']}{question['options']['A']}{question['options']['B']}{question['options']['C']}{question['options']['D']}answer:{data_answers[id]}'''+'\n'res_output.append(res)res_input.append(data_prompt)return res_input, res_output# 读取并处理英文数据集
df_en = pd.read_excel('训练集-英语.xlsx')
df_en = df_en.replace('．', '.', regex=True) \.replace('А.', 'A.', regex=True) \.replace('В.', 'B.', regex=True) \.replace('С.', 'C.', regex=True) \.replace('D.', 'D.', regex=True)en_input, en_output = process_en(df_en)# 生成中文提示文本
def get_prompt_cn(text):prompt = f'''你是⼀个⾼考选择题出题专家，你出的题有⼀定深度，你将根据阅读文本，出4道单项选择题，包含题目选项，以及对应的答案，注意：不⽤给出原文，每道题由1个问题和4个选项组成，仅存在1个正确答案，请严格按照要求执行。
The reading text is mainly in Chinese. The questions and answers you raised need to be completed in Chinese for at least the following points:### 回答要求(1)理解文章的主要意思。(2)理解文章中的具体信息。(3)根据上下文推断词语和短语的含义。### 阅读文本{text}'''return prompt# 处理中文数据集
def process_cn(df):res_input = []res_output = []for id in range(len(df)):data_options = df.loc[id, '选项']data_answers = df.loc[id, '答案']data_prompt = df.loc[id, '阅读文本']data_options = get_questions(data_options)data_answers = get_answers(data_answers)data_prompt = get_prompt_cn(data_prompt)if len(data_answers) == len(data_options):res = ''for id, question in enumerate(data_options):res += f'''{id+1}.{question['question']}{question['options']['A']}{question['options']['B']}{question['options']['C']}{question['options']['D']}answer:{data_answers[id]}'''+'\n'res_output.append(res)res_input.append(data_prompt)return res_input, res_output# 读取并处理中文数据集
df_cn = pd.read_excel('训练集-中文.xlsx')
cn_input, cn_output = process_cn(df_cn)# 数据集整合
df_new = pd.DataFrame({'input': cn_input+cn_input[:30]+en_input+en_input[:20], 'output': cn_output+cn_output[:30]+en_output+en_output[:20]})# 数据集格式转换导出
# 打开一个文件用于写入 JSONL，并设置编码为 UTF-8
with open('output.jsonl', 'w', encoding='utf-8') as f:# 遍历每一行并将其转换为 JSONfor index, row in df_new.iterrows():row_dict = row.to_dict()row_json = json.dumps(row_dict, ensure_ascii=False,)# 将 JSON 字符串写入文件，并添加换行符f.write(row_json + '\n')# 打印确认信息
print("JSONL 文件已生成")

二、模型训练

完成数据准备后，我们就可以利用这些数据进行模型的微调训练。这里使用了 Spark A-13B 的预训模型。

2.1 数据格式转换

首先，我们将准备好的数据集转换为 JSONL 格式，以便后续用于模型训练。

import json# 将数据集保存为JSONL格式
with open('output.jsonl', 'w', encoding='utf-8') as f:for index, row in df_new.iterrows():row_dict = row.to_dict()row_json = json.dumps(row_dict, ensure_ascii=False)f.write(row_json + '\n')# 打印确认信息
print("JSONL 文件已生成")

2.2 上传数据集

首先我们进入讯飞开放平台官网网页，点击新建数据集：
new dataset
这里我们配置一下数据集的相关信息；
config
接着我们上传之前制作的数据集，并且选择正确的问题和答案字段；
config
等待数据集上传成功，然后开始训练；
wait

进入训练配置界面，我们配置模型名称，预训练模型，学习率，数据集等信息；
trainconfig
等待模型训练成功，这个过程需要至少30分钟这里我们可以喝杯咖啡等待一下！
wait-train
publish

如果大家没有应用请到 https://console.xfyun.cn/app/myapp 点击创建创建一个。

点击发布。稍等片刻，模型即可发布成功；内容如下：
publish_result

这个界面我们可以可以看到我们发布模型的相关参数，我们要保存好以下参数，用于后续测试使用：

serviceId：---------
resourceId：-----------
APPID:------
APIKey:---------
APISecret:------------

至此，模型训练部分完毕！

三、本地测试

模型训练完成后，我们需要对模型进行本地测试，确保其生成的题目符合预期。

3.1 测试代码

以下是本地测试的代码，通过向模型提供一个 prompt，我们可以查看模型生成的题目和答案。

from sparkai.llm.llm import ChatSparkLLM, ChunkPrintHandler
from sparkai.core.messages import ChatMessageSPARKAI_URL = 'wss://xingchen-api.cn-huabei-1.xf-yun.com/v1.1/chat'
#星火认知大模型调用秘钥信息，请结合飞书文档，前往讯飞微调控制台（https://training.xfyun.cn/modelService）查看
SPARKAI_APP_ID = 'xxxxxxx'
SPARKAI_API_SECRET = 'xxxxxxx'
SPARKAI_API_KEY = 'xxxxxxxxxxxxxxxxxxx'
serviceId = 'xxxxxxxxx'  
resourceId = 'xxxxxxxxx'if __name__ == '__main__':spark = ChatSparkLLM(spark_api_url=SPARKAI_URL,spark_app_id=SPARKAI_APP_ID,spark_api_key=SPARKAI_API_KEY,spark_api_secret=SPARKAI_API_SECRET,spark_llm_domain=serviceId,model_kwargs={"patch_id": resourceId},streaming=False,)messages = [ChatMessage(role="user",content=prompt)]handler = ChunkPrintHandler()a = spark.generate([messages], callbacks=[handler])print(a.generations[0][0].text)

运行结果如下：
test
输出正常！

总结

本文详细介绍了从数据准备、模型训练到本地测试的完整流程，着重介绍了大模型微调训练数据集的代码，并且通过讯飞开放平台，基于Spark 13B语言模型构建了一个的高考选择题生成模型。
最终，我们通过LLM本地调用发布的服务API对模型进行了测试；

希望这篇博客对各位读者构建类似系统有所帮助。

参考链接

代码文档
比赛链接

thank_watch

如果觉得我的文章对您有帮助，三连+关注便是对我创作的最大鼓励！或者一个star🌟也可以😂.