Skip to main content

PPO

info

Training language models to follow instructions with human feedback

总结:提出了一种通过人类反馈(human feedback)微调大型语言模型(LMs),使其更好地遵循用户指令并与用户意图对齐(align)的方法。

alt text

  1. Which one of the steps is NOT correct for the method introduced in this paper?

    InstructGPT 的训练方法包含三个主要步骤:

    • 监督微调(Supervised Fine-Tuning, SFT):用人工标注的提示–回复对来训练模型。
    • 奖励模型训练(Reward Model Training, RM):利用人工偏好(对比标注)训练奖励模型。
    • 强化学习(RLHF with PPO):用奖励模型来指导语言模型,通过 PPO 优化。

    所以如果出现的“错误步骤”是 直接用奖励模型来生成文本,不经过 RLHF 优化 或者 只用监督微调就完成整个流程,那就是 NOT correct 的。

  2. Which steps from the previous question can be iterated continuously?

    可以反复迭代的步骤是:奖励模型训练 (RM) 与 强化学习 (RLHF) 可以不断迭代。论文中提到,可以收集更多的人类反馈,更新奖励模型,再用新的奖励模型继续优化策略。

  3. For reward modeling, if the comparisons are simply shuffled, a single pass over the dataset would cause the reward model to overfit. How is this problem solved according to the paper?

    alt text

    • 问题:如果将所有比较(comparisons)简单地混入一个数据集中,并只做一次完整遍历,奖励模型(reward model)会过拟合。
    • 解决方案:对于每个提示(prompt),将其所有K2K^2 个 pairwise 比较视为同一个 batch 元素(batch element)来训练,而不是分散成多个训练样本。

    这样一来:每个提示只需一次前向传递(forward pass),而非K2K^2次;同时避免了过拟合,提升了验证集上的准确率与 log loss 表现。

  4. Use the loss function for the reward model mentioned in Section 3.5. For a given prompt x, the reward model r_sigma assigns scores to two responses y_w and y_l. Suppose the reward for y_w is 3.0 and the reward for y_l is 1.3 , what is the result of the core loss term of this single comparison?

    rθ(x,yw)rθ(x,yl)=3.01.3=1.7,σ(1.7)=11+e1.711+0.18270.845,loss=log(σ(1.7))=log(0.845)0.168.\begin{align*} r_\theta(x, y_w) - r_\theta(x, y_l) &= 3.0 - 1.3 = 1.7, \\[6pt] \sigma(1.7) &= \frac{1}{1 + e^{-1.7}} \approx \frac{1}{1 + 0.1827} \approx 0.845, \\[6pt] \text{loss} &= -\log \big( \sigma(1.7) \big) = -\log(0.845) \approx 0.168. \end{align*}