Blog

Research Jun 15, 2026

Measuring LLM Consistency A Hallucination Benchmark for Device Identification

Fingerbank Team

Executive Summary

This benchmark evaluates a simple but critical property of Large Language Models (LLMs): consistency. In this study, each model received the same payload 10 times. A reliable model should produce the same device identification repeatedly; any variability across repeated runs is treated as a form of hallucination.

Opus-4.6 achieved perfect consistency with a 0% hallucination rate across all tested payloads. In contrast, Grok-4.3 showed the highest instability, with an 18% hallucination rate, indicating significant variance when processing identical inputs.

Research Jun 10, 2026

FB vs LLMs V2 (Updated with Grok and Gemini)

Fingerbank Team

Executive Summary

In the rapidly evolving landscape of network security and device identification, the question often arises: Can general-purpose Large Language Models (LLMs) replace specialized, purpose-built engines? To answer this, we conducted a comprehensive benchmark comparing Fingerbank's production fingerprinting engine against eight leading LLMs: Claude Sonnet 4.5, Claude Opus 4.6, GPT-5.2, GPT-5 mini, Gemini 3.1 Flash Lite, Gemini 2.5 Flash, Grok 4.3, and Grok 4.1.

The benchmark evaluated each engine across 192 distinct scenarios, totaling 1728 runs. We assessed them based on speed (latency), cost, self-reported confidence, and manufacturer accuracy using the exact same input telemetry—including MAC addresses, DHCP fingerprints, User-Agents, TCP signatures, and more.

The results are in, and they highlight the enduring value of specialized engineering, even with the introduction of new, powerful LLMs.