Fingerbank · Research
Measuring LLM Consistency: A Hallucination Benchmark for Device Identification
Executive Summary
This benchmark evaluates a simple but critical property of Large Language Models (LLMs): consistency. In this study, each model received the same payload 10 times. A reliable model should produce the same device identification repeatedly; any variability across repeated runs is treated as a form of hallucination.
Key Finding
Opus-4.6 achieved perfect consistency with a 0% hallucination rate across all tested payloads. In contrast, Grok-4.3 showed the highest instability, with an 18% hallucination rate, indicating significant variance when processing identical inputs.
Methodology: How We Measure Hallucination
To ensure a rigorous comparison, we defined hallucination as instability. If the same input leads to different device names or manufacturers across runs, the model is considered to be hallucinating.
Payloads: 10 unique device-identification prompts.
Repeated Runs: Each payload was submitted 10 times to each model (100 total runs per model).
Hallucination Metric: For each payload, we identified the most common answer. The hallucination rate is computed as:
1 − (most-common-answer frequency ÷ total runs) This value is then averaged across all payloads.
Answer Matching: Two answers are treated as equivalent if:
Device names are fuzzily similar (≈0.7) AND manufacturers match.
The manufacturer, category, and vendor (first two segments of the device-name path) all match.
The Leaderboard: Consistency vs. Performance
The following table summarizes the performance of the 7 models compared in this benchmark. While consistency is our primary metric, we also tracked latency and cost to provide a complete picture of operational efficiency.
| model | hallucination % | matched/total | mean score | mean latency | mean cost |
|---|---|---|---|---|---|
| Opus-4.6 | 0% | 98/98 | 85 | 19059 ms | $0.03215 |
| GPT-5.2 | 9% | 91/100 | 80 | 3191 ms | $0.00403 |
| GPT-5-mini | 10% | 90/100 | 78 | 13453 ms | $0.00246 |
| Grok-4.1 | 10% | 88/98 | 79 | 6080 ms | $0.00033 |
| Gemini-3.1-Flash-Lite | 14% | 86/100 | 88 | 1302 ms | $0.00053 |
| Sonnet-4.5 | 15% | 83/98 | 87 | 11109 ms | $0.00503 |
| Grok-4.3 | 18% | 81/99 | 81 | 4994 ms | $0.00203 |
Model Interpretations
Based on the data, we can categorize the models into specific performance profiles:
🏆 Best Stability: Opus-4.6 Opus-4.6 returned the same canonical device identification every time. This is the ideal behavior for production systems that rely on deterministic enrichment and classification.
⚡ Best Efficiency: Gemini-3.1-Flash-Lite This model delivered the lowest latency (1,302 ms) while maintaining near-top consistency (8% hallucination), making it highly attractive for high-throughput workloads.
⚖️ Balanced Performer: GPT-5.2 GPT-5.2 showed strong consistency with moderate latency and cost, offering a balanced trade-off between reliability and performance.
⚠️ Highest Instability: Grok-4.3 Grok-4.3 exhibited the largest variance across repeated identical prompts, indicating less deterministic behavior under this specific benchmark.
Why Consistency Matters in Device Intelligence
In device intelligence systems like Fingerbank, consistency is often as important as accuracy. If the same network signature is mapped to different device identities on different runs, downstream systems—such as security policies, network analytics, and inventory management—become unreliable.
A low hallucination rate indicates that a model is not only capable of producing correct answers but is also deterministic enough to produce them repeatedly.
Deep Dive: Per-Payload Variability
The benchmark dashboard allows us to inspect individual runs. Below is a snapshot of how different models handled specific payloads. Note the similarity clusters (marked ≈1, ≈2) which highlight where models agreed or diverged.
Key Takeaways
Determinism varies significantly between models: Hallucination rates ranged from 0% to 18% under identical conditions.
Speed does not necessarily reduce reliability: Gemini-3.1-Flash-Lite achieved both the lowest latency and one of the lowest hallucination rates.
Premium models are not always the most stable: Although several high-end models performed well, only Opus-4.6 achieved perfect consistency in this test.
Benchmarking repeated prompts is valuable: Traditional accuracy tests may miss instability that appears only when the same query is executed multiple times.
About This Benchmark
This analysis was generated from a specialized dashboard built to compare LLM behavior across repeated identical prompts. The benchmark focused specifically on device identification tasks, measuring how consistently each model returned a normalized device identity.
For more insights into device intelligence and AI-driven classification, visit the Fingerbank Blog.