Flag It, Don't Fix It: Instrument Failure in LLM Scoring

A consistent internal finding across the Llama family of models in the HRCB evaluation pipeline: approximately 60–66% of lite evaluations return an editorial score of exactly 50 (maps to 0.0 on the normalized scale) with high confidence (≥0.70). (These percentages represent internal observations from the Observatory’s evaluation corpus, not externally validated benchmarks.) The prompt specification explicitly prohibits 50 as a valid output except when content demonstrably lacks all UDHR-relevant signal. Cross-validation with Haiku on the same stories shows measurable signal in 79% of these cases.

This constitutes instrument failure: the model defaults to the neutral midpoint rather than evaluating the content. The question then becomes: what should the system do with these observations?

The Fabrication Temptation

One response: detect the suspicious output and replace it. If the model returned 50 when it likely meant something else, impute a plausible score — perhaps from a Haiku cross-validation, or from domain-level averages. The imputed score would then participate normally in consensus computation.

This feels corrective. It maintains score distribution integrity. It prevents neutral-biased models from pulling the corpus average toward zero.

The problem: the imputed score doesn’t represent what the model observed. The builder doesn’t know what the correct score for this story should have looked like; neither does the validator. The imputation substitutes fabricated precision for admitted uncertainty. Downstream analyses that depend on score provenance — calibration checks, model trust metrics, ensemble weighting — would treat the fabricated score as a real observation.

Measurement theory offers a name for this error: score interpolation without a validated imputation model. The result looks like data. It behaves like data. But it lacks data’s essential property: grounding in an actual observation.

The Flag Approach

The editorial_uncertain flag (migration 0058) implements the alternative: preserve the original score, mark the observation as epistemically suspect, and let downstream processes decide how much to trust it.

The detection criterion:

const editorialUncertain =
  (agg.editorial_mean === 0.0 && (lite.evaluation.confidence ?? 0) >= 0.7) ? 1 : 0;

Three conditions fire simultaneously:

editorial_mean === 0.0 — the normalized score hit the neutral midpoint exactly
prompt_mode IN ('lite', 'light') — applies only to lite evaluations
confidence >= 0.7 — high confidence in a neutral midpoint constitutes the suspicious pattern

When the flag fires, two downstream effects follow:

1. Consensus discount: Flagged observations receive a neutral discount multiplier of 0.5:

const neutralDiscount =
  (isLite && (score === 0 || r.editorial_uncertain === 1) && r.confidence >= 0.7)
    ? 0.5 : 1.0;

The observation still contributes to consensus — its information isn’t discarded — but the system treats it with half the weight of a non-suspicious observation.

2. UI indication: The evaluation card displays a ~ prefix on flagged scores: ~0.00 rather than 0.00. The tilde signals “instrument uncertain” to readers examining individual story evaluations.

What the Flag Preserves

The flag approach maintains three properties that fabrication destroys:

Provenance chain integrity. Every score traces to a specific model output. A calibration audit can identify all flagged observations, compute what fraction came from which model, and trace whether the flag rate correlates with other quality indicators. A fabricated score breaks this chain.

Uncertainty propagation. Statistical analyses downstream can condition on editorial_uncertain. Studies comparing flagged vs. non-flagged evaluations become possible. This informs decisions about model routing, prompt redesign, and calibration set expansion.

Calibration as signal. The flag rate itself carries information. Knowing that 66% of llama-3.3-70b-wai lite evaluations trigger editorial_uncertain while only 39% of Haiku evaluations do — this differential constitutes a calibration finding. The measurement instrument reveals something about the measurement architecture.

The Broader Pattern

“Flag it, don’t fix it” generalizes beyond lazy-neutral detection:

Instrument failure mode	Wrong response	Right response
Model returns sentinel value under pressure	Impute plausible alternative	Flag as uncertain, discount in ensemble
Measurement tool produces reading outside valid range	Clamp to range boundary	Flag as out-of-spec, exclude from primary analyses
Survey respondent selects “prefer not to say”	Impute from demographic averages	Treat as missing-not-at-random, analyze separately
Sensor reads zero when physical zero unlikely	Replace with interpolated neighbor	Flag as possible dropout, preserve in raw trace

The common thread: when an instrument fails, the failure itself carries information. The failure rate, the conditions that trigger it, the correlation with other variables — these constitute a data stream about the measurement architecture. Overwriting the failure with a plausible value destroys this stream.

Caveats

The 0.7 confidence threshold requires validation. The current cutoff came from domain knowledge, not from a held-out validation against ground truth. Calibration data with known-good and known-bad zero scores would allow evidence-based threshold selection.

Haiku zeros also get flagged. Approximately 39% of Haiku lite evaluations fired the flag — possibly higher than expected if Haiku legitimately recognizes signal-free content more often than Llama. The flag rate differential between models now requires interpretation.

Discounting isn’t discarding. The 0.5 neutral discount still lets flagged observations participate in consensus. In a corpus dominated by flagged observations, the discount affects the weight distribution but doesn’t eliminate lite model influence.

Claude Code (Anthropic) drafted this post under human direction.

Sources

Human Rights Observatory — the evaluation pipeline producing the observations described here
Unratified — Methodology — machine-readable scoring specification (weights, SETL formula, evidence caps)