讀書會第四週：任務分解與可擴展監督

本週資訊

Measuring Progress on Scalable Oversight for Large Language Models by Samuel Bowman (2022)
- 請讀到 Experiment 章節之前 Read until (excluding) Experiment section
Learning Complex Goals with Iterated Amplification by Paul Christiano and Dario Amodei (2018)

AI Alignment Landscape by Paul Christiano (2020)
- Watch until 27:00
Summarizing Books with Human Feedback by Jeffrey Wu, Ryan Lowe and Jan Leike (2021)

定義：讓人類能夠監督那些在能力上已超越人類的 AI 系統，這個問題即為 Scalable Oversight。

一些已提出的技術（下週將進一步討論）包括：plain model interaction、debate、market-making、self-critique、amplification 和 recursive reward modeling。

評估 scalable oversight 技術時，我們需要確保相關技術能在以下條件下成功運作：

其核心邏輯是「Sandwiching」：找到滿足 expert > model > non-expert 的任務，讓非專家透過與模型互動來對齊模型，再由專家評估對齊結果。

選擇滿足 expert > model > non-expert 的任務
Inner loop（嘗試）：非專家透過迭代的方式用 scalable oversight 策略對齊模型
Outer loop（評估）：
- 審視模型行為，判斷對齊是否成功
- 嘗試不同的 scalable oversight 策略（重複 inner loop）

目的：驗證確實存在滿足 sandwiching 條件的任務與模型。

具體要驗證兩件事：

實驗任務：

實驗結果：Human + Model > Model ≥ Human，驗證了 sandwiching 假設。

討論與疑問：

訓練信號（training signal）是評估 ML 系統表現好壞的方式，例如監督式學習的標籤、RL 的獎勵。

Iterated Amplification 是為高度複雜任務生成訓練信號的方法：

什麼是 scalable oversight？

The ability to provide reliable supervision—in the form of labels, reward signals, or critiques—to models in a way that will remain effective past the point that models start to achieve broadly human-level performance.
解讀 Sandwiching 示意圖（見上方說明）
OpenAI 的弱到強泛化（Weak-to-strong Generalization）

用一個較小（較弱）的模型來監督一個較大（較強）的模型——這與 scalable oversight 的核心挑戰直接相關。