paper reading: Verifier

10 Oct 2024

Paper List

Cobbe: Training Verifiers to Solve Math Word Problems

Lightman: Let’s Verify Step by Step

Hendrycks: Measuring Mathematical Problem Solving With the MATH Dataset

Chain-of-thought: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Uesato: Solving math word problems with process-and outcome-based feedback

Main contributions

Cobbe:

  1. We introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems.
    • GitHub: https://github.com/openai/grade-school-math
    • Benchmark accuracy has rised from 75% in April to 96.7% in recent Qwen2: https://paperswithcode.com/sota/arithmetic-reasoning-on-gsm8k
  2. We propose training verifiers to judge the correctness of model completions.
    • Best-of-N: At test time, we generate many candidate solutions and select the one ranked highest by the verifier.
    • We show that, compared to a finetuning baseline, the use of verifiers results in approximately the same performance boost as a 30x model size increase, and that verifiers scale significantly better with increased data.
      • On the full dataset, 6B verification slightly outperforms a finetuned 175B model, thereby offering a boost approximately equivalent to a 30x model size increase.
  3. We show that dropout acts as a strong regularizer, significantly improving performance for both finetuning and verification.

  4. Dataset design principles:
    1. High Quality
    2. High Diversity
    3. Moderate Difficulty: We choose a problem distribution that is challenging for large state-of-the-art language models, without being completely intractable. GSM8K will help us better understand the data scaling trends of different models and methods in this difficulty sweet spot. Problems require no concepts beyond the level of early Algebra, and the vast majority of problems can be solved without explicitly defining a variable.
    4. Natural Language Solutions
  5. We use test@N to denote the percentage of problems solved correctly at least once when allowing the model to make N separate guesses for each problem.
    • T=0 for test@1; T=0.7 for test@100; Both temperature values were chosen empirically to produce the best results.
  6. Dataset:
    1. Each problem is solved twice in labeling.

Lightman:

  1. We also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.

  2. We show that process supervision can train much more reliable reward models than outcome supervision.
    1. Process supervision provides feedback for each intermediate reasoning step.
    2. Outcome supervision provides feedback for a final result.
  3. At each model scale, we use a single fixed model to generate all solutions - Generator.
    1. We do not improve the generator with RL.
    2. When we do outcome and process supervision, we are referring to the supervision given to the reward model.

Chain-of-thought:

Uesato:

  1. We find that pure outcome-based supervision produces similar final-answer error rates with less label supervision.

Reference