On preference data quality

January 2026

There's a persistent belief in parts of the ML community that scale solves everything. More parameters, more tokens, more compute. And while scale has produced remarkable results, I've become increasingly convinced that the post-training phase deserves more attention than it gets.

The basic insight isn't new: a model's behavior is shaped not just by pretraining, but by the fine-tuning that follows. RLHF, DPO, and related methods all rely on preference data—examples of "better" and "worse" outputs that teach the model what we actually want.

Quality compounds

What I've seen in practice is that preference data quality has compounding effects. Clean, well-annotated examples don't just improve the immediate metric—they reduce the noise that causes instability in training, lead to faster convergence, and result in models that generalize better to novel situations.

The opposite is also true. Noisy preference data introduces contradictions that the model has to somehow reconcile. This often manifests as unpredictable behavior on edge cases, or a tendency to hedge excessively.

What "quality" means

When I say quality, I mean several things:

Consistency: annotators should agree with each other and with themselves over time
Clarity: the preference should be unambiguous given the task definition
Coverage: the data should represent the actual distribution of queries the model will see
Calibration: strong preferences should indicate large quality gaps, not annotator confidence

Getting all four right is surprisingly hard. Most datasets I've worked with fail on at least one dimension.

Implications

If I'm right about this, it suggests that smaller organizations can compete more effectively than the "scale is all you need" framing implies. Careful curation of training data, thoughtful annotation guidelines, and iterative refinement of the preference model can substitute for a lot of compute.

It also suggests we should invest more in evaluation. You can only improve what you can measure, and the standard benchmarks often miss the subtle behavioral changes that matter most for real-world usefulness.