Fingerbank · Research

Measuring LLM Consistency: A Hallucination Benchmark for Device Identification

Executive Summary

This benchmark evaluates a simple but critical property of Large Language Models (LLMs): consistency. In this study, each model received the same payload 10 times. A reliable model should produce the same device identification repeatedly; any variability across repeated runs is treated as a form of hallucination.

Key Finding

Opus-4.6 achieved perfect consistency with a 0% hallucination rate across all tested payloads. In contrast, Grok-4.3 showed the highest instability, with an 18% hallucination rate, indicating significant variance when processing identical inputs.

🔍 Click to enlarge

Methodology: How We Measure Hallucination

To ensure a rigorous comparison, we defined hallucination as instability. If the same input leads to different device names or manufacturers across runs, the model is considered to be hallucinating.

Payloads: 10 unique device-identification prompts.

Repeated Runs: Each payload was submitted 10 times to each model (100 total runs per model).

Hallucination Metric: For each payload, we identified the most common answer. The hallucination rate is computed as:

1 − (most-common-answer frequency ÷ total runs) This value is then averaged across all payloads.

Answer Matching: Two answers are treated as equivalent if:

Device names are fuzzily similar (≈0.7) AND manufacturers match.

The manufacturer, category, and vendor (first two segments of the device-name path) all match.

The Leaderboard: Consistency vs. Performance

The following table summarizes the performance of the 7 models compared in this benchmark. While consistency is our primary metric, we also tracked latency and cost to provide a complete picture of operational efficiency.

modelhallucination %matched/totalmean scoremean latencymean cost
Opus-4.60%98/988519059 ms$0.03215
GPT-5.29%91/100803191 ms$0.00403
GPT-5-mini10%90/1007813453 ms$0.00246
Grok-4.110%88/98796080 ms$0.00033
Gemini-3.1-Flash-Lite14%86/100881302 ms$0.00053
Sonnet-4.515%83/988711109 ms$0.00503
Grok-4.318%81/99814994 ms$0.00203

Model Interpretations

Based on the data, we can categorize the models into specific performance profiles:

🏆 Best Stability: Opus-4.6 Opus-4.6 returned the same canonical device identification every time. This is the ideal behavior for production systems that rely on deterministic enrichment and classification.

⚡ Best Efficiency: Gemini-3.1-Flash-Lite This model delivered the lowest latency (1,302 ms) while maintaining near-top consistency (8% hallucination), making it highly attractive for high-throughput workloads.

⚖️ Balanced Performer: GPT-5.2 GPT-5.2 showed strong consistency with moderate latency and cost, offering a balanced trade-off between reliability and performance.

⚠️ Highest Instability: Grok-4.3 Grok-4.3 exhibited the largest variance across repeated identical prompts, indicating less deterministic behavior under this specific benchmark.

🔍 Click to interact

🔍 Click to interact

Why Consistency Matters in Device Intelligence

In device intelligence systems like Fingerbank, consistency is often as important as accuracy. If the same network signature is mapped to different device identities on different runs, downstream systems—such as security policies, network analytics, and inventory management—become unreliable.

A low hallucination rate indicates that a model is not only capable of producing correct answers but is also deterministic enough to produce them repeatedly.

Deep Dive: Per-Payload Variability

The benchmark dashboard allows us to inspect individual runs. Below is a snapshot of how different models handled specific payloads. Note the similarity clusters (marked ≈1, ≈2) which highlight where models agreed or diverged.

🔍 Click to interact

Key Takeaways

Determinism varies significantly between models: Hallucination rates ranged from 0% to 18% under identical conditions.

Speed does not necessarily reduce reliability: Gemini-3.1-Flash-Lite achieved both the lowest latency and one of the lowest hallucination rates.

Premium models are not always the most stable: Although several high-end models performed well, only Opus-4.6 achieved perfect consistency in this test.

Benchmarking repeated prompts is valuable: Traditional accuracy tests may miss instability that appears only when the same query is executed multiple times.

About This Benchmark

This analysis was generated from a specialized dashboard built to compare LLM behavior across repeated identical prompts. The benchmark focused specifically on device identification tasks, measuring how consistently each model returned a normalized device identity.

For more insights into device intelligence and AI-driven classification, visit the Fingerbank Blog.