paper reading: Verifier

Paper List

Main contributions

Cobbe:

We introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems.
- GitHub: https://github.com/openai/grade-school-math
- Benchmark accuracy has rised from 75% in April to 96.7% in recent Qwen2: https://paperswithcode.com/sota/arithmetic-reasoning-on-gsm8k
We propose training verifiers to judge the correctness of model completions.
- Best-of-N: At test time, we generate many candidate solutions and select the one ranked highest by the verifier.
- We show that, compared to a finetuning baseline, the use of verifiers results in approximately the same performance boost as a 30x model size increase, and that verifiers scale significantly better with increased data.
  - On the full dataset, 6B verification slightly outperforms a finetuned 175B model, thereby offering a boost approximately equivalent to a 30x model size increase.
We show that dropout acts as a strong regularizer, significantly improving performance for both finetuning and verification.
Dataset design principles:
1. High Quality
2. High Diversity
3. Moderate Difficulty: We choose a problem distribution that is challenging for large state-of-the-art language models, without being completely intractable. GSM8K will help us better understand the data scaling trends of different models and methods in this difficulty sweet spot. Problems require no concepts beyond the level of early Algebra, and the vast majority of problems can be solved without explicitly defining a variable.
4. Natural Language Solutions
We use test@N to denote the percentage of problems solved correctly at least once when allowing the model to make N separate guesses for each problem.
- T=0 for test@1; T=0.7 for test@100; Both temperature values were chosen empirically to produce the best results.
Dataset:
1. Each problem is solved twice in labeling.

Lightman:

We also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.
We show that process supervision can train much more reliable reward models than outcome supervision.
1. Process supervision provides feedback for each intermediate reasoning step.
2. Outcome supervision provides feedback for a final result.
At each model scale, we use a single fixed model to generate all solutions - Generator.
1. We do not improve the generator with RL.
2. When we do outcome and process supervision, we are referring to the supervision given to the reward model.

Chain-of-thought:

Uesato:

We find that pure outcome-based supervision produces similar final-answer error rates with less label supervision.

paper reading: Verifier

Paper List

Main contributions

Cobbe:

Lightman:

Chain-of-thought:

Uesato:

Reference