Token-Level Truth: Real-Time Hallucination Detection for Production LLMs

Your LLM just called a tool, received accurate data, and still got the answer wrong. Welcome to the world of extrinsic hallucination—where models confidently ignore the ground truth sitting right in front of them.

Building on our Signal-Decision Architecture, we introduce HaluGate—a conditional, token-level hallucination detection pipeline that catches unsupported claims before they reach your users. No LLM-as-judge. No Python runtime. Just fast, explainable verification at the point of delivery.

The Problem: Hallucinations Block Production Deployment

Hallucinations have become the single biggest barrier to deploying LLMs in production. Across industries—legal (fabricated case citations), healthcare (incorrect drug interactions), finance (invented financial data), customer service (non-existent policies)—the pattern is the same: AI generates plausible-sounding content that appears authoritative but crumbles under scrutiny.

The challenge isn’t obvious nonsense. It’s subtle fabrications embedded in otherwise accurate responses—errors that require domain expertise or external verification to catch. For enterprises, this uncertainty makes LLM deployment a liability rather than an asset.

The Scenario: When Tools Work But Models Don’t

Let’s make this concrete. Consider a typical function-calling interaction:

User: “When was the Eiffel Tower built?”

Tool Call: get_landmark_info("Eiffel Tower")

Tool Response: {"name": "Eiffel Tower", "built": "1887-1889", "height": "330 meters", "location": "Paris, France"}

LLM Response: “The Eiffel Tower was built in 1950 and stands at 500 meters tall in Paris, France.”

The tool returned correct data. The model’s response contains facts. But two of those “facts” are fabricated—extrinsic hallucinations that directly contradict the provided context.

This failure mode is particularly insidious:

Users trust it because they see the tool was called
Traditional filters miss it because there’s no toxic or harmful content
Evaluation is expensive if you rely on another LLM to judge

What if we could detect these errors automatically, in real-time, with millisecond latency?

The Insight: Function Calling as Ground Truth

Here’s the key realization: modern function-calling APIs already provide grounding context. When users ask factual questions, models call tools—database lookups, API calls, document retrieval. These tool results are semantically equivalent to retrieved documents in RAG.

We don’t need to build separate retrieval infrastructure. We don’t need to call GPT-4 as a judge. We extract three components from the existing API flow:

Component	Source	Purpose
Context	Tool message content	Ground truth for verification
Question	User message	Intent understanding
Answer	Assistant response	Claims to verify

The question becomes: Is the answer faithful to the context?

Why Not Just Use LLM-as-Judge?

The obvious solution—call another LLM to verify—has fundamental problems in production:

Approach	Latency	Cost	Explainability
GPT-4 as judge	2-5 seconds	$0.01-0.03/request	Low (black box)
Local LLM judge	500ms-2s	GPU compute	Low
HaluGate	76-162ms	CPU only	High (token-level + NLI)

LLM judges also suffer from:

Position bias: Tendency to favor certain answer positions
Verbosity bias: Longer answers rated higher regardless of accuracy
Self-preference: Models favor outputs similar to their own style
Inconsistency: Same input can yield different judgments

We needed something faster, cheaper, and more explainable.

HaluGate: A Two-Stage Detection Pipeline

HaluGate implements a conditional two-stage pipeline that balances efficiency with precision:

Stage 1: HaluGate Sentinel (Prompt Classification)

Not every query needs hallucination detection. Consider these prompts:

Prompt	Needs Fact-Check?	Reason
“When was Einstein born?”	✅ Yes	Verifiable fact
“Write a poem about autumn”	❌ No	Creative task
“Debug this Python code”	❌ No	Technical assistance
“What’s your opinion on AI?”	❌ No	Opinion request
“Is the Earth round?”	✅ Yes	Factual claim

Running token-level detection on creative writing or code review is wasteful—and potentially produces false positives (“your poem contains unsupported claims!”).

Why pre-classification matters: Token-level detection scales linearly with context length. For a 4K token RAG context, detection takes ~125ms; for 16K tokens, ~365ms. In production workloads where ~35% of queries are non-factual, pre-classification achieves a 72.2% efficiency gain—skipping expensive detection entirely for creative, coding, and opinion queries.

HaluGate Sentinel is a ModernBERT-based classifier that answers one question: Does this prompt warrant factual verification?

The model is trained on a carefully curated mix of:

Fact-Check Needed (Positive Class):

Question Answering: SQuAD, TriviaQA, Natural Questions, HotpotQA
Truthfulness: TruthfulQA (common misconceptions)
Hallucination Benchmarks: HaluEval, FactCHD
Information-Seeking Dialogue: FaithDial, CoQA
RAG Datasets: neural-bridge/rag-dataset-12000

No Fact-Check Needed (Negative Class):

Creative Writing: WritingPrompts, story generation
Code: CodeSearchNet docstrings, programming tasks
Opinion/Instruction: Dolly non-factual, Alpaca creative

This binary classification achieves 96.4% validation accuracy with ~12ms inference latency via native Rust/Candle integration.

Stage 2: Token-Level Detection + NLI Explanation

For prompts classified as fact-seeking, we run a two-model detection pipeline.

Token-Level Hallucination Detection

Unlike sentence-level classifiers that output a single “hallucinated/not hallucinated” label, token-level detection identifies exactly which tokens are unsupported by the context.

The model architecture:

Input: [CLS] context [SEP] question [SEP] answer [SEP]
                                          ↓
                              ModernBERT Encoder
                                          ↓
                    Token Classification Head (Binary per token)
                                          ↓
              Label: 0 = Supported, 1 = Hallucinated (for answer tokens only)

Key design decisions:

Answer-only classification: We only classify tokens in the answer segment, not context or question
Span merging: Consecutive hallucinated tokens are merged into spans for readability
Confidence thresholding: Configurable threshold (default 0.8) to balance precision/recall

NLI Explanation Layer

Knowing that something is hallucinated isn’t enough—we need to know why. The NLI (Natural Language Inference) model classifies each detected span against the context:

NLI Label	Meaning	Severity	Action
CONTRADICTION	Claim conflicts with context	4 (High)	Flag as error
NEUTRAL	Claim not supported by context	2 (Medium)	Flag as unverifiable
ENTAILMENT	Context supports the claim	0	Filter false positive

Why the ensemble works: Token-level detection alone achieves only 59% F1 on the hallucinated class—nearly half of hallucinations are missed, and one-third of flags are false positives. We experimented with training a unified 5-class model (SUPPORTED/CONTRADICTION/FABRICATION/etc.) but it achieved only 21.7% F1—token-level classification simply cannot distinguish why something is wrong. The two-stage approach turns a mediocre detector into an actionable system: LettuceDetect provides recall (catching potential issues), while NLI provides precision (filtering false positives) and explainability (categorizing why each span is problematic).

Integration with Signal-Decision Architecture

HaluGate doesn’t operate in isolation—it’s deeply integrated with our Signal-Decision Architecture as a new signal type and plugin.

`fact_check` as a Signal Type

Just as we have keyword, embedding, and domain signals, fact_check is now a first-class signal type:

This allows decisions to be conditioned on whether the query is fact-seeking:

Note: Even frontier models show hallucination variance between releases. For example, GPT-5.2’s system card demonstrates measurable hallucination delta compared to previous versions, highlighting the importance of continuous verification regardless of model sophistication.

decisions:
  - name: "factual-query-with-verification"
    priority: 100
    rules:
      operator: "AND"
      conditions:
        - type: "fact_check"
          name: "needs_fact_check"
        - type: "domain"
          name: "general"
    plugins:
      - type: "hallucination"
        configuration:
          enabled: true
          use_nli: true
          hallucination_action: "header"

Request-Response Context Propagation

A key challenge: the classification happens at request time, but detection happens at response time. We need to propagate state across this boundary.

The RequestContext structure carries all necessary state:

RequestContext:
  # Classification results (set at request time)
  FactCheckNeeded: true
  FactCheckConfidence: 0.87

  # Tool context (extracted at request time)
  HasToolsForFactCheck: true
  ToolResultsContext: "Built 1887-1889, 330 meters..."
  UserContent: "When was the Eiffel Tower built?"

  # Detection results (set at response time)
  HallucinationDetected: true
  HallucinationSpans: ["1950", "500 meters"]
  HallucinationConfidence: 0.92

The `hallucination` Plugin

The hallucination plugin is configured per-decision, allowing fine-grained control:

plugins:
  - type: "hallucination"
    configuration:
      enabled: true
      use_nli: true  # Enable NLI explanations

      # Action when hallucination detected
      hallucination_action: "header"  # "header" | "body" | "block" | "none"

      # Action when fact-check needed but no tool context
      unverified_factual_action: "header"

      # Include detailed info in response
      include_hallucination_details: true

Action	Behavior
`header`	Add warning headers, pass response through
`body`	Inject warning into response body
`block`	Return error response, don’t forward LLM output
`none`	Log only, no user-visible action

Response Headers: Actionable Transparency

Detection results are communicated via HTTP headers, enabling downstream systems to implement custom policies:

HTTP/1.1 200 OK
Content-Type: application/json
x-vsr-fact-check-needed: true
x-vsr-hallucination-detected: true
x-vsr-hallucination-spans: 1950; 500 meters
x-vsr-nli-contradictions: 2
x-vsr-max-severity: 4

For unverified factual responses (when tools aren’t available):

HTTP/1.1 200 OK
x-vsr-fact-check-needed: true
x-vsr-unverified-factual-response: true
x-vsr-verification-context-missing: true

These headers enable:

UI Disclaimers: Show warnings to users when confidence is low
Human Review Queues: Route flagged responses for manual review
Audit Logging: Track unverified claims for compliance
Conditional Blocking: Block high-severity contradictions

The Complete Pipeline: Three Paths

Path	Condition	Latency Added	Action
Path 1	Non-factual prompt	~12ms (classifier only)	Pass through
Path 2	Factual + No tools	~12ms	Add warning headers
Path 3	Factual + Tools available	76-162ms	Full detection + headers

Model Architecture Deep Dive

Let’s look at the three models that power HaluGate:

HaluGate Sentinel: Binary Prompt Classification

Architecture: ModernBERT-base + LoRA adapter + binary classification head

Training:

Base Model: answerdotai/ModernBERT-base
Fine-tuning: LoRA (rank=16, alpha=32, dropout=0.1)
Training Data: 50,000 samples from 14 datasets
Loss: CrossEntropy with class weights (handle imbalance)
Optimization: AdamW, lr=2e-5, 3 epochs

Inference:

Input: Raw prompt text
Output: (class_id, confidence)
Latency: ~12ms on CPU

The LoRA approach allows efficient fine-tuning while preserving the pretrained knowledge. Only 2.2% of parameters (3.4M out of 149M) are updated during training.

HaluGate Detector: Token-Level Binary Classification

Architecture: ModernBERT-base + token classification head

Input Format:

[CLS] The Eiffel Tower was built in 1887-1889 and is 330 meters tall.
[SEP] When was the Eiffel Tower built?
[SEP] The Eiffel Tower was built in 1950 and is 500 meters tall. [SEP]
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                    Answer tokens (classification targets)

Output: Binary label (0=Supported, 1=Hallucinated) for each answer token

Post-processing:

Filter predictions to answer segment only
Apply confidence threshold (default: 0.8)
Merge consecutive hallucinated tokens into spans
Return spans with confidence scores

HaluGate Explainer: Three-Way NLI Classification

Architecture: ModernBERT-base fine-tuned on NLI

Input Format:

[CLS] The Eiffel Tower was built in 1887-1889. [SEP] built in 1950 [SEP]
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^       ^^^^^^^^^^^^^^^
                    Premise (context)                Hypothesis (span)

Output: Three-way classification with confidence:

ENTAILMENT (0): Context supports the claim
NEUTRAL (1): Cannot be determined from context
CONTRADICTION (2): Context conflicts with claim

Severity Mapping:

NLI Label	Severity Score	Interpretation
ENTAILMENT	0	Likely false positive—filter out
NEUTRAL	2	Claim is unverifiable
CONTRADICTION	4	Direct factual error

Why Native Rust/Candle Matters

All three models run natively via Candle (Hugging Face’s Rust ML framework) with CGO bindings to Go:

Benefits of this approach:

Aspect	Python (PyTorch)	Native (Candle)
Cold start	5-10s	<500ms
Memory	2-4GB per model	500MB-1GB per model
Latency	+50-100ms overhead	Near-zero overhead
Deployment	Python runtime required	Single binary
Scaling	GIL contention	True parallelism

This eliminates the need for a separate Python service, sidecars, or model servers—everything runs in-process.

Latency Breakdown

Here’s the measured latency for each component in the production pipeline:

Component	P50	P99	Notes
Fact-check classifier	12ms	28ms	ModernBERT inference
Tool context extraction	1ms	3ms	JSON parsing
Hallucination detector	45ms	89ms	Token classification
NLI explainer	18ms	42ms	Per-span classification
Total overhead	76ms	162ms	When detection runs

The total overhead (76-162ms) is negligible compared to typical LLM generation times (5-30 seconds), making HaluGate practical for synchronous request processing.

Configuration Reference

Complete configuration for hallucination mitigation:

# Model configuration
hallucination_mitigation:
  # Stage 1: Prompt classification
  fact_check_model:
    model_id: "models/halugate-sentinel"
    threshold: 0.6  # Confidence threshold for FACT_CHECK_NEEDED
    use_cpu: true

  # Stage 2a: Token-level detection
  hallucination_model:
    model_id: "models/halugate-detector"
    threshold: 0.8  # Token confidence threshold
    use_cpu: true

  # Stage 2b: NLI explanation
  nli_model:
    model_id: "models/halugate-explainer"
    threshold: 0.9  # NLI confidence threshold
    use_cpu: true

# Signal rules for fact-check classification
fact_check_rules:
  - name: needs_fact_check
    description: "Query contains factual claims that should be verified"
  - name: no_fact_check_needed
    description: "Query is creative, code-related, or opinion-based"

# Decision with hallucination plugin
decisions:
  - name: "verified-factual"
    priority: 100
    rules:
      operator: "AND"
      conditions:
        - type: "fact_check"
          name: "needs_fact_check"
    plugins:
      - type: "hallucination"
        configuration:
          enabled: true
          use_nli: true
          hallucination_action: "header"
          unverified_factual_action: "header"
          include_hallucination_details: true

Beyond Production: HaluGate as an Evaluation Framework

While HaluGate is designed for real-time production use, the same pipeline can power offline model evaluation. Instead of intercepting live requests, we feed benchmark datasets through the detection pipeline to systematically measure hallucination rates across models.

Evaluation Workflow

The evaluation framework treats HaluGate as a hallucination scorer:

Load Dataset: Use existing QA/RAG benchmarks (TriviaQA, Natural Questions, HotpotQA) or custom enterprise datasets with context-question pairs
Generate Responses: Run the model under test against each query with provided context
Detect Hallucinations: Pass (context, query, response) triples through HaluGate Detector
Classify Severity: Use HaluGate Explainer to categorize each flagged span
Aggregate Metrics: Compute hallucination rates, contradiction ratios, and per-category breakdowns

Limitations and Scope

HaluGate specifically targets extrinsic hallucinations—where tool/RAG context provides grounding for verification. It has known limitations:

What HaluGate Cannot Detect

Limitation	Example	Reason
Intrinsic hallucinations	Model says “Einstein was born in 1900” without any tool call	No context to verify against
No-context scenarios	User asks factual question, no tools defined	Missing ground truth

Transparent Degradation

For requests classified as fact-seeking but lacking tool context, we explicitly flag responses as “unverified factual” rather than silently passing them through:

x-vsr-fact-check-needed: true
x-vsr-unverified-factual-response: true
x-vsr-verification-context-missing: true

This transparency allows downstream systems to handle uncertainty appropriately.

Acknowledgments

HaluGate builds on excellent work from the research community:

Token-level detection architecture: Inspired by LettuceDetect from KRLabs—pioneering work in ModernBERT-based hallucination detection
NLI models: Built on tasksource/ModernBERT-base-nli—high-quality NLI fine-tuning
Training datasets: TruthfulQA, HaluEval, FaithDial, RAGTruth, and other publicly available benchmarks

We’re grateful to these teams for advancing the field of hallucination detection.

Conclusion

HaluGate brings principled hallucination detection to production LLM deployments:

Conditional verification: Skip non-factual queries, verify factual ones
Token-level precision: Know exactly which claims are unsupported
Explainable results: NLI classification tells you why something is wrong
Zero-latency integration: Native Rust inference, no Python sidecars
Actionable transparency: Headers enable downstream policy enforcement

The next time your LLM calls a tool, receives accurate data, and still gets the answer wrong—HaluGate will catch it before your users do.

Resources:

Join the discussion: Share your use cases and feedback in #semantic-router channel on vLLM Slack