FIELD NOTES TECHNICAL 10 MIN

MAY 14, 2026

Why we don't trust LLMs to classify the call they're about to make.

By Shimon Rosenberg

When you're building an agent that can take consequential actions — send an email, modify a record, call an external API — you eventually face a version of the same question: how do you know the agent's output is actually grounded in what it was told?

The industry's default answer is: ask another LLM.

Run a second model call. Give it the original context and the agent's output. Ask it to evaluate whether the output is reasonable, accurate, or supported. Use the score to gate the action.

This is the evaluator-judge pattern, and it's everywhere. It's used in RAG pipelines to check citation fidelity, in agent frameworks to validate task completion, in safety systems to assess whether a response is within policy. It seems sensible — a second model acts as a check on the first.

We don't use it. Here's why.

The evaluator shares the generator's failure modes

When a language model hallucinates, it does so in a predictable way. It produces fluent, plausible text. It fills gaps rather than flagging them. It has a prior toward completion — toward producing an output that reads like a reasonable response to the input — and that prior is precisely calibrated to pass the surface-level test of "does this sound right?"

Now you ask a second language model to evaluate the first one's output.

That second model has the same prior. It was trained on similar data, by similar methods, toward similar objectives. It also produces fluent, plausible text. It also fills gaps rather than flagging them. And critically: it reads text the way it was trained to read text — which is to say, it reads for coherence and plausibility, not for the presence of a traceable evidential chain back to the source.

The evaluator will find the generator's output convincing for roughly the same reasons the generator produced it. Fluent, internally consistent text looks grounded to a model that learned from fluent, internally consistent text. The circular dependency is structural, not incidental.

Models grade their own outputs generous

This isn't just a theoretical concern. The empirical record on LLM self-evaluation and cross-evaluation is consistent: models favor outputs that look like outputs. They tend to rate longer, more detailed responses higher. They tend to rate responses from larger models higher. They prefer confident language over hedged language, even when the hedged response is more accurate. And when asked to evaluate a response generated by a model from the same family, they rate it substantially higher than human raters do on the same criteria.

There's a specific failure mode that matters for agent grounding: the evaluator will not reliably detect addition. If the generator produces a response that faithfully represents the source material and then adds one sentence that isn't supported by anything in the context, the evaluator will often score the whole response as grounded. The added sentence reads like the other sentences. It fits. The evaluator's architecture is not optimized to notice its presence as a gap; it's optimized to process the text as a coherent whole.

This is the hallucination problem restated for evaluation. The same mechanism that causes a model to add unsupported content in generation causes a model to miss unsupported content in evaluation.

What you actually need from a grounding classifier

Think about what a grounding check is trying to answer. It's a specific, narrow question: does each factual claim in this output have a traceable source in the provided material?

That question has a stable, auditable answer. Either "the meeting is Thursday at 2pm" appears in the source document, or it doesn't. Either the 18% revenue figure is in the financial data the agent was given, or it isn't. These are not judgment calls. They are lookup operations.

The answer doesn't change based on how the output is phrased. It doesn't depend on the confidence of the prose, the sophistication of the vocabulary, or the apparent coherence of the argument. It depends on whether the claim has a matching span in the source material.

An LLM is a poor instrument for this. It doesn't perform lookup operations — it performs distribution operations. Given a sequence, it predicts the next token based on everything it's seen in training. It can approximate a lookup for facts it has memorized, but it's not doing a lookup. It's doing pattern completion against a massive prior, and that prior actively interferes with the task you're asking it to do.

What deterministic classification buys you

The alternative is to take the LLM out of the classification loop entirely — not just for performance reasons, but as an architectural principle.

Span-level matching and deterministic entity extraction are not sophisticated. They are not trying to understand the output. They are doing what the problem actually requires: checking whether the strings of text in the output are present in, or directly derivable from, the strings of text in the source material. Semantic matching — with similarity thresholds and embedding distances — can extend this to paraphrase. But the underlying operation remains deterministic: given these inputs, this classification, every time.

The practical benefits of this are significant:

The classification is stable. You can run the same output against the same sources a thousand times and get the same result. This matters for debugging, for auditing, and for any downstream system that needs to reason about the classifier's behavior.

The classification is fast. Span-level matching is a data structure operation, not a generation operation. Sub-5ms on typical agent payloads. You can put it in the execution path — between the agent's planner and its tool-execution layer — without adding latency that changes the user experience.

The classification is auditable. The unmatched spans are enumerable. You can look at exactly which claims the classifier didn't find support for, and you can verify that decision without re-running anything. A model-based evaluator gives you a score; it doesn't give you a traceable breakdown of what failed and why.

The classification has no dependencies on a third-party service. No API key, no rate limit, no timeout, no cost per call. It runs wherever your agent runs. This matters when you're putting it in a high-frequency execution path.

A concrete example

Consider an agent that's been given a set of customer records and asked to draft a summary email. The records contain:

Customer: Meridian Partners
Last purchase: November 12, 2025
Account status: Active
Outstanding balance: $0
Assigned rep: Dara Osei

The agent produces:

Hi Dara,

Meridian Partners has been an active customer since late 2024 and has
no outstanding balance as of last check. Their last purchase was in
Q4 of last year. Worth a touchpoint given the length of the relationship.

A fluent evaluator might score this as grounded. It's accurate in the broad strokes. The structure matches the source. The claims feel supported.

A span-level classifier would flag: "active customer since late 2024" — not in the source material. The records say status: Active, but say nothing about when they became a customer. "Length of the relationship" — no support. These are additions, not representations. They're probably not harmful. But they're not grounded, and whether to pass them or block them should be a system decision with a stable, auditable basis — not a probability distribution from a second LLM call.

Where this logic applies more broadly

The grounding case is the clearest example, but the argument generalizes to any classification task in an agent pipeline where the question has a stable answer independent of model judgment.

Does this action fall within the scope of what this agent is authorized to do? That's a lookup against a permission set, not a generation problem.

Is this action reversible? That's a taxonomy lookup, not a generation problem.

Does this output contain PII? That's a pattern-matching problem, not a generation problem.

In each case, the industry has converged on using LLMs because LLMs are general and convenient — one tool for many tasks. But "general and convenient" is not the same as "the right instrument." For classification tasks with stable answers, a model that generates probabilistic completions is worse than a system that does the lookup directly, not because it can't approximate the answer, but because the approximation introduces variance at exactly the point where you need stability.

The limit of this argument

There are classification tasks where the answer is not stable — where the question genuinely requires judgment, context, and the kind of nuanced reasoning that only a language model can provide. Gray-zone cases: is this action within the spirit of what the user asked, even if it's technically outside the letter of the authorization? That category exists, and for that category, a model-based evaluator may be the right tool.

But that category is smaller than the industry currently treats it. Most of the classification tasks embedded in agent pipelines — grounding checks, scope checks, reversibility checks, PII detection — have stable, auditable answers if you're willing to define them precisely. The work of defining them precisely is exactly the work that gets skipped when you reach for an LLM evaluator instead.

The practical difference, measured

We ran the same grounding evaluation task — 200 agent outputs across four domains (customer records, medical notes, financial summaries, legal clauses), each with a ground-truth grounding label — through two paths: a model-based evaluator (GPT-4o prompted as a grounding judge) and SHOR's deterministic classifier.

Latency. The model-based evaluator averaged 1,340ms per classification at low load, with a p95 of 3,100ms. SHOR averaged 3.2ms, flat across the entire set.

Consistency. Running the same 200 inputs through the model evaluator three times produced identical results on 81% of cases. Running them through SHOR produced identical results on 100% of cases. The 19% variance in the model evaluator wasn't concentrated in genuinely ambiguous cases — it was distributed across easy cases where the correct label was unambiguous.

Failure mode distribution. The model evaluator missed 73% of partial-grounding cases — outputs where most claims were supported but one or two were not. It rated them GROUNDED. SHOR flagged them correctly as PARTIAL in all cases. This is the specific failure mode that matters most in production: the output that's mostly right and quietly wrong in one place.

The model evaluator performed well on fully grounded and fully ungrounded cases. The gap opened on the partial case — which is, in practice, the most common failure mode in deployed agent pipelines.

What we built

SHOR is the grounding validator we use in the Reshimu runtime stack. Zero dependencies. Deterministic. No model call in the classification path.

import { classify } from '@reshimu/shor'

const result = classify({
  output: agentOutput,
  sources: sourceContext
})

// result.level: 'GROUNDED' | 'PARTIAL' | 'UNGROUNDED' | 'INDETERMINATE'
// result.unmatched: string[]   — the claims with no source span
// result.score: number         — span-match ratio, 0–1

GROUNDED means all extractable claims have traceable source spans. PARTIAL means some do and some don't. UNGROUNDED means the substantive claims have no support in the provided material. INDETERMINATE means the source material is insufficient to evaluate.

It runs in under 5ms. It takes no API key. It produces the same output for the same input every time. The unmatched claims are enumerable — you can inspect exactly what the classifier found unsupported, and that inspection is itself stable and repeatable.

It's available now on GitHub, npm, and PyPI. The documentation covers the full API and integration patterns, including how to compose it with NESHER — our irreversibility classifier — in a two-gate intercept layer.

The previous post in this series, The irreversible action problem in autonomous agents, covers the case for external classification of irreversibility. SHOR and NESHER address different failure modes — grounding and reversibility — but they're built on the same principle: the model that produced the action is the wrong entity to evaluate it.

If you've hit the limits of LLM-based evaluation in a production agent pipeline, we'd like to hear what you found. The repo is open.

SHOR takes its name from the Ox in the four Chayyot — the living creatures of Ezekiel's Merkavah vision, which we use as a structural pattern for runtime integrity validation. The Ox is grounding: truth as a function of what is actually there, not what could plausibly be inferred. If you want the full architectural pattern, Bearers of the Throne covers it.

— END —