深度学习中的学习率调度器(lr_scheduler)详解:以 Cosine 余弦衰减为例
在深度学习模型训练过程中,学习率(Learning Rate, LR)对模型的收敛速度和最终性能起着至关重要的作用。如果学习率过大,可能导致梯度更新过大,模型难以收敛甚至发散;如果学习率过小,模型的训练时间可能会过长,并且可能陷入局部最优解。因此,合理地调整学习率是训练深度学习模型的关键策略之一。
在 PyTorch 和 Hugging Face transformers
库中,我们可以使用 lr_scheduler_type
来指定学习率的衰减方式。本文将详细介绍 Cosine 余弦衰减(cosine scheduler),并对其他常见的学习率调度策略进行对比分析。
1. 什么是学习率调度器(lr_scheduler)?
学习率调度器(Learning Rate Scheduler)用于动态调整训练过程中使用的学习率,以提高模型的收敛性并避免陷入局部最优解。
Hugging Face 允许使用 lr_scheduler_type
参数来控制学习率的衰减方式,例如:
lr_scheduler_type = "cosine"
这表示我们使用 余弦学习率衰减 方式来调整学习率。
不同的学习率调度策略在深度学习训练中的表现有所不同,合理选择可以让模型更快、更稳健地收敛。
2. 余弦衰减(Cosine Annealing)
lr_scheduler_type = "cosine"
代表使用 余弦衰减学习率(Cosine Annealing)。
余弦衰减学习率的公式如下:
η t = η min + 1 2 ( η max − η min ) ( 1 + cos ( t T π ) ) \eta_t = \eta_{\min} + \frac{1}{2} (\eta_{\max} - \eta_{\min}) \left(1 + \cos\left(\frac{t}{T} \pi\right)\right) ηt=ηmin+21(ηmax−ηmin)(1+cos(Ttπ))
其中:
- ( η t \eta_t ηt) 表示第 ( t t t) 轮训练的学习率。
- ( η max \eta_{\max} ηmax) 表示初始学习率。
- ( η min \eta_{\min} ηmin) 表示最终的学习率(一般设为 0)。
- ( T T T) 代表训练的总步数。
2.1 余弦衰减的作用
- 缓慢降低学习率:在训练初期,学习率较高,让模型快速探索最优解。随着训练进行,学习率逐渐下降,让模型稳定收敛。
- 避免梯度震荡:相比于直接线性衰减,余弦曲线能更平稳地调整学习率,避免梯度更新过快或过慢。
2.2 余弦衰减的效果
下面是余弦衰减学习率的变化曲线:
import numpy as np
import matplotlib.pyplot as pltdef cosine_annealing_lr(initial_lr, min_lr, total_steps):steps = np.arange(total_steps)lrs = min_lr + 0.5 * (initial_lr - min_lr) * (1 + np.cos(np.pi * steps / total_steps))return steps, lrs# 设置参数
initial_lr = 5e-3 # 初始学习率
min_lr = 1e-5 # 最小学习率
total_steps = 1000 # 总训练步数steps, lrs = cosine_annealing_lr(initial_lr, min_lr, total_steps)# 画出学习率曲线
plt.plot(steps, lrs, label="Cosine Annealing LR")
plt.xlabel("Training Steps")
plt.ylabel("Learning Rate")
plt.title("Cosine Annealing Learning Rate Schedule")
plt.legend()
plt.show()
结果:
余弦衰减的学习率从高到低逐渐下降,最后趋于最小学习率 ( η min \eta_{\min} ηmin),使得模型在后期训练时学习率较小,以稳定收敛。
3. 其他常见的学习率调度器
除了 cosine
余弦衰减,Hugging Face transformers
还提供了其他几种常见的 lr_scheduler_type
选项。
3.1 线性衰减(linear)
lr_scheduler_type = "linear"
公式:
η t = η max × ( 1 − t T ) \eta_t = \eta_{\max} \times \left(1 - \frac{t}{T}\right) ηt=ηmax×(1−Tt)
- 线性衰减(Linear Decay)就是让学习率从初始值逐步线性降低到 0。
- 适用于 模型已经稳定,但仍需小步收敛 的情况。
优点:
✅ 简单易用,适合大多数 Transformer 任务。
缺点:
❌ 下降速度恒定,不能灵活调整收敛速度。
3.2 余弦重启(cosine_with_restarts)
lr_scheduler_type = "cosine_with_restarts"
- 这是 余弦衰减的变种,它允许 学习率在下降到最低点后重启,适用于 周期性任务(如强化学习、元学习)。
- 学习率曲线呈现多个下降周期,而不是单调下降。
3.3 指数衰减(exponential)
lr_scheduler_type = "exponential"
公式:
η t = η max × e − λ t \eta_t = \eta_{\max} \times e^{-\lambda t}\ ηt=ηmax×e−λt
- 指数衰减(Exponential Decay)会让学习率按照指数速率衰减,初期下降较快,后期趋于平缓。
- 适用于梯度下降法(SGD)训练 CNN 或 ResNet 等模型。
优点:
✅ 适用于训练周期较长的任务。
缺点:
❌ 如果 λ
选择不当,可能会导致学习率下降过快或过慢。
3.4 恒定学习率(constant)
lr_scheduler_type = "constant"
- 学习率保持不变,不会衰减。
- 适用于短期训练任务,或用于微调时保持稳定学习率。
3.5 带 Warmup 的学习率
许多调度器可以与 Warmup(预热阶段) 结合,例如:
lr_scheduler_type = "linear_with_warmup"
- Warmup 作用:在训练初期,使用 较低的学习率,然后逐渐升高,以防止梯度爆炸。
- 适用于 Transformer 预训练模型(BERT、GPT、LLaMA)。
4. 选择哪种学习率调度策略?
不同任务适用于不同的学习率调度方式:
任务类型 | 适合的调度器 |
---|---|
Transformer 训练 | cosine / linear_with_warmup |
微调 BERT/GPT | linear / linear_with_warmup |
CNN 训练 | exponential |
长期优化任务 | cosine_with_restarts |
强化学习 | constant / cosine_with_restarts |
在 GRPOConfig
中,选择 cosine
是因为:
- 适合 LLM 训练,在初期快速优化,后期缓慢收敛。
- 比线性衰减更平稳,不会导致梯度剧烈变化。
5. 总结
lr_scheduler_type = "cosine"
采用 余弦学习率衰减,让学习率从高到低缓慢变化,提高训练稳定性。- 除了
cosine
,还有linear
、exponential
、constant
等不同策略,每种适用于不同的任务。 - 在 LLM 训练(如 GPT、LLaMA)中,推荐
cosine
或linear_with_warmup
。
选择正确的学习率调度器可以极大地提高模型的训练效率和性能!🚀
Understanding Learning Rate Schedulers in Deep Learning: The Role of Cosine Decay (cosine
)
In deep learning, the learning rate is a crucial hyperparameter that significantly affects model convergence and performance. If the learning rate is too high, the model may fail to converge, while if it’s too low, training may take excessively long and get stuck in local minima. To address this, learning rate schedulers (lr_scheduler_type
) dynamically adjust the learning rate during training.
In Hugging Face, the parameter:
lr_scheduler_type = "cosine"
indicates the use of cosine learning rate decay, which gradually reduces the learning rate following a cosine function. This article explains cosine decay and compares it with other common learning rate schedules.
1. What is a Learning Rate Scheduler?
A learning rate scheduler dynamically adjusts the learning rate throughout training to improve convergence and prevent oscillations.
Hugging Face allows specifying lr_scheduler_type
, such as:
lr_scheduler_type = "cosine"
This selects cosine learning rate decay.
Different schedulers impact model performance differently. The right choice can significantly improve training stability and speed.
2. Cosine Decay (cosine
Scheduler)
When lr_scheduler_type = "cosine"
, the learning rate gradually decreases following a cosine function:
η t = η min + 1 2 ( η max − η min ) ( 1 + cos ( t T π ) ) \eta_t = \eta_{\min} + \frac{1}{2} (\eta_{\max} - \eta_{\min}) \left(1 + \cos\left(\frac{t}{T} \pi\right)\right) ηt=ηmin+21(ηmax−ηmin)(1+cos(Ttπ))
Where:
- ( η t \eta_t ηt) is the learning rate at step ( t t t).
- ( η max \eta_{\max} ηmax) is the initial learning rate.
- ( η min \eta_{\min} ηmin) is the minimum learning rate (often 0).
- ( T T T) is the total number of training steps.
2.1 Why Use Cosine Decay?
- Smoothly decreases the learning rate: Starts with a high learning rate for exploration and gradually reduces it for fine-tuning.
- Prevents abrupt changes: Unlike linear decay, cosine decay ensures a more natural learning rate transition, reducing the risk of instability.
2.2 Cosine Decay in Practice
Below is a visualization of cosine decay:
import numpy as np
import matplotlib.pyplot as pltdef cosine_annealing_lr(initial_lr, min_lr, total_steps):steps = np.arange(total_steps)lrs = min_lr + 0.5 * (initial_lr - min_lr) * (1 + np.cos(np.pi * steps / total_steps))return steps, lrs# Parameters
initial_lr = 5e-3
min_lr = 1e-5
total_steps = 1000steps, lrs = cosine_annealing_lr(initial_lr, min_lr, total_steps)# Plot learning rate schedule
plt.plot(steps, lrs, label="Cosine Annealing LR")
plt.xlabel("Training Steps")
plt.ylabel("Learning Rate")
plt.title("Cosine Annealing Learning Rate Schedule")
plt.legend()
plt.show()
This shows how the learning rate smoothly decreases following a cosine curve.
3. Other Common Learning Rate Schedulers
Besides cosine
, Hugging Face provides multiple learning rate scheduling strategies:
3.1 Linear Decay (linear
)
lr_scheduler_type = "linear"
Formula:
η t = η max × ( 1 − t T ) \eta_t = \eta_{\max} \times \left(1 - \frac{t}{T}\right) ηt=ηmax×(1−Tt)
- Decreases learning rate linearly to zero.
- Common for fine-tuning Transformer models (BERT, GPT, LLaMA).
✅ Simple and effective.
❌ Reduces too fast for long training tasks.
3.2 Cosine Restart (cosine_with_restarts
)
lr_scheduler_type = "cosine_with_restarts"
- Similar to cosine decay but resets the learning rate periodically.
- Useful for reinforcement learning or tasks where retraining is frequent.
3.3 Exponential Decay (exponential
)
lr_scheduler_type = "exponential"
Formula:
η t = η max × e − λ t \eta_t = \eta_{\max} \times e^{-\lambda t} ηt=ηmax×e−λt
- The learning rate decreases exponentially.
- Common for CNN models like ResNet and VGG.
✅ Quickly reduces unnecessary updates.
❌ If the decay rate is too high, training may halt prematurely.
3.4 Constant Learning Rate (constant
)
lr_scheduler_type = "constant"
- Keeps the learning rate fixed.
- Best for short training tasks or online learning.
✅ No risk of premature decay.
❌ Inefficient for long training sessions.
3.5 Warmup with Schedulers
Some schedules include warmup, where the learning rate starts small and gradually increases before applying decay. Example:
lr_scheduler_type = "linear_with_warmup"
- Warmup helps prevent gradient instability in the initial training phase.
- Common in Transformer-based models (GPT, BERT, etc.).
4. Which Learning Rate Scheduler Should You Use?
The best scheduler depends on the task:
Task | Recommended Scheduler |
---|---|
Transformer training | cosine , linear_with_warmup |
BERT/GPT fine-tuning | linear , linear_with_warmup |
CNNs (ResNet, VGG) | exponential |
Long-term optimization | cosine_with_restarts |
Reinforcement learning | constant , cosine_with_restarts |
For LLM fine-tuning (GRPOConfig
), cosine decay is preferred because:
- More stable than linear decay.
- Gradual reduction helps maintain performance.
5. Conclusion
lr_scheduler_type = "cosine"
uses cosine learning rate decay, ensuring a smooth transition from high to low learning rates.- Alternative schedules like linear, exponential, and cosine restart offer different advantages.
- For large language models (GPT, LLaMA, BERT), cosine decay or linear with warmup is recommended.
Choosing the right learning rate schedule is essential for achieving optimal training performance! 🚀
后记
2025年2月21日16点48分于上海。在GPT4o大模型辅助下完成。