微调 DeepSeek R1做一个医疗推理模型

疯哥

本文介绍如何利用Deepseek根据医疗思维链数据集训练微调一个专业的医疗模型，可与用在医学问答、推理场景。该模型对于需要结构化推理的医学领域应用特别有用，因为DeepSeek R1开源而且免费所以我们大可以使用R1来制作自己的推理模型。

本文中我们使用DeepSeek-R1-Distill-Llama-8B，利用Hugging Face 上医疗数据集上对模型进行微调。DeepSeek-R1-Distill-Llama-8B是一个精简的 DeepSeek-R1 模型，它是通过在使用 DeepSeek-R1 生成的数据上对 Llama 3.1 8B 模型进行微调而生成的。具备与原始模型类似的推理能力。

DeepSeek R1 简介

DeepSeek开源的第一代推理模型 DeepSeek-R1 和 DeepSeek-R1-Zero，在数学、编码和逻辑等推理任务上的表现可与 OpenAI 的 o1 相媲美。

DeepSeek-R1-Zero

DeepSeek-R1-Zero 是第一个完全使用大规模强化学习(RL) 而不是监督式微调 (SFT) 作为初始步骤进行训练的开源模型。这种方法使模型能够独立思考(CoT) 推理、解决复杂问题并迭代优化其输出。然而，它面临着重复推理步骤、可读性差和语言混合等问题挑战，这些挑战会影响其精准度和可用性。

DeepSeek-R1

DeepSeek-R1 的推出是为了通过在强化学习之前使用冷启动的方式(冷启动让模型从零开始，自主探索，避免偏见，更灵活，且可能找到更好的策略)来克服 DeepSeek-R1-Zero 的局限性，为推理和非推理任务提供坚实的基础。

这种多阶段训练使模型能够在数学、编程和推理基准上实现与 OpenAI-o1 相当的性能和效果，同时提高其输出的可读性和连贯性。

DeepSeek 蒸馏

DeepSeek 不仅有大模型，还搞了一些小模型。大模型虽然厉害，但需要很强的电脑配置才能跑得动，而小模型更轻便、更高效，效果也不差。

这些小模型的参数从 1.5B 到 70B 不等，虽然个头小，但推理能力依然很强。比如 DeepSeek-R1-Distill-Qwen-32B，在多个测试中都比 OpenAI 的 o1-mini 表现更好。

这些小模型是从大模型“浓缩”出来的，继承了它们的推理能力，证明了这种“浓缩”方法确实管用。

微调 DeepSeek R1 的步骤

想微调 DeepSeek R1 模型，可以按照以下步骤来操作：

准备工作我们推荐使用 Kaggle 作为云端的开发环境，因为它可以免费使用 GPU，而且这些 GPU 通常比 Google Colab 提供的更强大。首先，打开 Kaggle，然后把 Hugging Face 和 Weights & Biases 的密钥添加进去。添加密钥的方法很简单：在 Kaggle 界面里，找到“Add-ons”选项，然后选择“Secrets”就可以了。密钥设置好后，安装一个叫 unsloth 的 Python 包。这个包是一个开源工具，能让微调大模型的速度提升 2 倍，而且更省内存。

阅读Unsloth 指南：优化和加速 LLM 微调以了解 Unsloth 的主要特性、各种功能以及如何优化你的微调工作流程。

%%capture
!pip install unsloth
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

使用从 Kaggle Secrets 中获取的 Hugging Face API 登录到 Hugging Face CLI。

from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

hf_token = user_secrets.get_secret("HUGGINGFACE_TOKEN")
login(hf_token)

使用你的 API 密钥登录 Weights & Biases（wandb）并创建一个新项目来跟踪微调进度。

import wandb

wb_token = user_secrets.get_secret("wandb")

wandb.login(key=wb_token)
run = wandb.init(
    project='Fine-tune-DeepSeek-R1-Distill-Llama-8B on Medical COT Dataset', 
    job_type="training", 
    anonymous="allow"
)

2. 加载模型和标记器

我们使用DeepSeek-R1-Distill-Llama-8B的 Unsloth 版本。此外，我们使用4 bit量化加载模型，以优化内存使用和性能。

from unsloth import FastLanguageModel

max_seq_length = 2048 
dtype = None 
load_in_4bit = True


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/DeepSeek-R1-Distill-Llama-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = hf_token, 
)

3. 微调前

在微调之前，我们需要给模型设计一个“提问模板”。这个模板会包含一个系统提示，以及一些占位符，用来生成问题和回答。这个模板的作用是引导模型一步步思考，让它给出逻辑清晰且准确的回答。

prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 

### Question:
{}

### Response:
<think>{}"""

在这个例子中，提供一个医疗方面的问题prompt_style，将其转换为标记，然后将标记传递给模型进行响应生成。

question = "A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?"


FastLanguageModel.for_inference(model) 
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])

即使没有微调，我们的模型也能生成推理过程，并在给出最终答案之前一步步思考。这些推理步骤会被放在 <think></think> 标签里。

那么，为什么我们还需要微调呢？虽然推理过程很详细，但它太啰嗦了，不够简洁。而且，最终答案是用项目符号列出来的，这和我们要微调的数据集结构风格不一样。

<think>
Okay, so I have this medical question to answer. Let me try to break it down. The patient is a 61-year-old woman with a history of involuntary urine loss during activities like coughing or sneezing, but she doesn't leak at night. She's had a gynecological exam and a Q-tip test. I need to figure out what cystometry would show regarding her residual volume and detrusor contractions.

First, I should recall what I know about urinary incontinence. Involuntary urine loss during activities like coughing or sneezing makes me think of stress urinary incontinence. Stress incontinence typically happens when the urethral sphincter isn't strong enough to resist increased abdominal pressure from activities like coughing, laughing, or sneezing. This usually affects women, especially after childbirth when the pelvic muscles and ligaments are weakened.

The Q-tip test is a common diagnostic tool for stress urinary incontinence. The test involves inserting a Q-tip catheter, which is a small balloon catheter, into the urethra. The catheter is connected to a pressure gauge. The patient is asked to cough, and the pressure reading is taken. If the pressure is above normal (like above 100 mmHg), it suggests that the urethral sphincter isn't closing properly, which is a sign of stress incontinence.

So, based on the history and the Q-tip test, the diagnosis is likely stress urinary incontinence. Now, moving on to what cystometry would show. Cystometry, also known as a filling cystometry, is a diagnostic procedure where a catheter is inserted into the bladder, and the bladder is filled with a liquid to measure how much it can hold (residual volume) and how it responds to being filled (like during a cough or sneeze). This helps in assessing the capacity and compliance of the bladder.

In a patient with stress incontinence, the bladder's capacity might be normal, but the sphincter's function is impaired. So, during the cystometry, the residual volume might be within normal limits because the bladder isn't overfilled. However, when the patient is asked to cough or perform a Valsalva maneuver, the detrusor muscle (the smooth muscle layer of the bladder) might not contract effectively, leading to an increase in intra-abdominal pressure, which might cause leakage.

Wait, but detrusor contractions are usually associated with voiding. In stress incontinence, the issue isn't with the detrusor contractions but with the sphincter's inability to prevent leakage. So, during cystometry, the detrusor contractions would be normal because they are part of the normal voiding process. However, the problem is that the sphincter doesn't close properly, leading to leakage.

So, putting it all together, the residual volume might be normal, but the detrusor contractions would be normal as well. The key finding would be the impaired sphincter function leading to incontinence, which is typically demonstrated during the Q-tip test and clinical history. Therefore, the cystometry would likely show normal residual volume and normal detrusor contractions, but the underlying issue is the sphincter's inability to prevent leakage.
</think>

Based on the provided information, the cystometry findings in this 61-year-old woman with stress urinary incontinence would likely demonstrate the following:

1. **Residual Volume**: The residual volume would be within normal limits. This is because the bladder's capacity is typically normal in cases of stress incontinence, where the primary issue lies with the sphincter function rather than the bladder's capacity.

2. **Detrusor Contractions**: The detrusor contractions would also be normal. These contractions are part of the normal voiding process and are not impaired in stress urinary incontinence. The issue is not with the detrusor muscle but with the sphincter's inability to prevent leakage.

In summary, the key findings of the cystometry would be normal residual volume and normal detrusor contractions, highlighting the sphincteric defect as the underlying cause of the incontinence.<｜end▁of▁sentence｜>

4.加载和处理数据集

我们会稍微调整一下数据集的提示模板，加一个占位符，用来放复杂的推理步骤。

train_prompt_style = """Below is an instruction that describes a task, paired with an input that provides further context. 
Write a response that appropriately completes the request. 
Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

### Instruction:
You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
Please answer the following medical question. 

### Question:
{}

### Response:
<think>
{}
</think>
{}"""

编写 Python 函数，在数据集中创建一个“文本”列，该列由训练提示样式组成。用问题、文本链和答案填充占位符。

EOS_TOKEN = tokenizer.eos_token  # Must add EOS_TOKEN


def formatting_prompts_func(examples):
    inputs = examples["Question"]
    cots = examples["Complex_CoT"]
    outputs = examples["Response"]
    texts = []
    for input, cot, output in zip(inputs, cots, outputs):
        text = train_prompt_style.format(input, cot, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }

我们将从FreedomIntelligence/medical-o1-reasoning-SFT数据集（可在 Hugging Face 中心获得）中加载前 500 个样本。之后，我们将text使用formatting_prompts_func函数映射列。

from datasets import load_dataset
dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", split = "train[0:500]",trust_remote_code=True)
dataset = dataset.map(formatting_prompts_func, batched = True,)
dataset["text"][0]

我们可以看到，模型打印出系统提示、说明、思路、答案。

“下面是一个描述任务的指令，与提供进一步上下文的输入配对。编写一个适当地完成请求的响应。在回答问题之前，仔细思考这个问题，并建立一个循序渐进的思维链，以确保一个合乎逻辑和准确的回答。你是一位在临床推理、诊断和治疗计划方面拥有先进知识的医学专家。请回答以下医学问题。问题：一位61岁的女性，在咳嗽或打喷嚏等活动中有很长时间的非自愿尿丢失，但在夜间没有漏尿，她接受了妇科检查和棉签检查。基于这些发现，膀胱术最有可能揭示她的剩余容量和逼尿肌收缩？\n\n###回应：\n\n好吧，让我们一步一步地想。这里有一位61岁的女士，每当她做一些增加腹部压力的事情时，比如咳嗽或打喷嚏，她就会不由自主地尿漏。这听起来很像压力性尿失禁。有趣的是，她在晚上没有任何问题；她睡觉时没有漏尿。这可能意味着当她没有身体压力时，她的膀胱容纳尿液的能力是正常的。嗯，这是一个线索，我们正在处理与压力有关的问题，而不是膀胱肌肉的问题。她接受棉签测试的事实也很有趣。该检查通常用于评估尿道活动。压力性尿失禁时，棉签可能明显移动，显示尿道活动过度。这种运动通常意味着支撑结构存在弱点，而支撑结构本应在腹部压力增加时保持尿道闭合。这和压力性尿失禁很吻合。现在，让我们想想在做膀胱术时会发生什么。由于压力性尿失禁通常不是突然的膀胱收缩，我不希望在这个测试中看到不自主的逼尿肌收缩。她的膀胱没有痉挛什么的；更多的是支撑结构在压力下失效。另外，她可能会完全排空膀胱，因为压力性尿失禁通常不会完全排空。所以她的剩余容量应该很正常。总而言之，如果他们给她做膀胱测容，很可能会显示出正常的剩余容量，没有不自主收缩。是的，考虑到她的症状和压力性尿失禁的典型表现，我认为这是有道理的。在压力性尿失禁的情况下，膀胱测量很可能显示正常的尿后残留体积，因为压力性尿失禁通常不涉及膀胱排空问题。此外，由于压力性尿失禁主要与体力消耗有关，而不是膀胱过度活动，所以在检查过程中，你不会期望看到任何不自主的逼尿肌收缩。”

5. 建立模型

我们会用目标模块来搭建模型，具体方法是通过添加“低秩适配器”来调整模型。

model = FastLanguageModel.get_peft_model(
    model,
    r=16,  
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,  
    bias="none",  
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,  
    loftq_config=None,
)

接下来，我们将通过提供模型、标记器、数据集和其他重要的训练参数来设置训练参数和训练器，以优化我们的微调过程。

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        # Use num_train_epochs = 1, warmup_ratio for full training runs!
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

6.模型训练

运行以下命令开始训练。

您可以通过登录网站并查看项目，在权重和权重仪表板上查看填充模型评估报告。

如果您在运行上述代码时遇到问题，请参阅微调 DeepSeek R1（推理模型） Kaggle 笔记本。

7. 微调后的模型推理

为了比较结果，我们将向微调模型提出与之前相同的问题，看看有什么变化。

question = "A 61-year-old woman with a long history of involuntary urine loss during activities like coughing or sneezing but no leakage at night undergoes a gynecological exam and Q-tip test. Based on these findings, what would cystometry most likely reveal about her residual volume and detrusor contractions?"


FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])

这个好多了，准确多了。思路很直接，答案很直接，就一个段落。微调成功了。

8. 本地保存模型

现在，让我们在本地保存采用者、完整模型和标记器，以便我们可以在其他项目中使用它们。

new_model_local = "DeepSeek-R1-Medical-COT"
model.save_pretrained(new_model_local) 
tokenizer.save_pretrained(new_model_local)

model.save_pretrained_merged(new_model_local, tokenizer, save_method = "merged_16bit",)

9. 将模型推送至 Hugging Face Hub

我们会把适配器、标记器和模型上传到 Hugging Face Hub，这样社区的人就可以直接拿来用，集成到他们的系统里了。

如果你想将模型部署到云端，可以按照《如何使用 BentoML 部署 LLM》的指南操作。这个指南会一步步教你如何用 BentoML 和 vLLM 等工具，高效又省钱地部署大模型。

如果你更喜欢在本地运行模型，可以把它转换成 GGUF 格式，然后在自己的电脑上跑。具体方法可以参考《微调 Llama 3.2 并在本地使用》的指南，里面有详细的说明。

结论

人工智能领域变化特别快。开源社区正在崛起，挑战过去几年里那些大公司独占市场的局面。

开源的大模型变得越来越强、越来越快、越来越高效，现在用更少的计算资源和内存就能微调它们了。

在这个教程里，我们研究了 DeepSeek R1 推理模型，并学会了如何微调它的精简版本来处理医疗问答任务。微调后的模型不仅性能更好，还能用在医学、急救服务和医疗保健这些重要领域。

为了应对 DeepSeek R1 的发布，OpenAI 也推出了两个强大的工具：一个是更先进的推理模型 OpenAI o3，另一个是 OpenAI 的操作员 AI 代理，它由新的 CUA 模型驱动，可以自己浏览网站并完成任务。