Measuring LLM Consistency A Hallucination Benchmark for Device Identification
Fingerbank Team
Executive Summary
This benchmark evaluates a simple but critical property of Large Language Models (LLMs): consistency. In this study, each model received the same payload 10 times. A reliable model should produce the same device identification repeatedly; any variability across repeated runs is treated as a form of hallucination.
Opus-4.6 achieved perfect consistency with a 0% hallucination rate across all tested payloads. In contrast, Grok-4.3 showed the highest instability, with an 18% hallucination rate, indicating significant variance when processing identical inputs.