Fingerbank · Research

Fingerprint Diagnostics: Benchmarking Fingerbank Against General-Purpose LLMs

Executive Summary

In the rapidly evolving landscape of network security and device identification, the question often arises: Can general-purpose Large Language Models (LLMs) replace specialized, purpose-built engines? To answer this, we conducted a comprehensive benchmark comparing Fingerbank's production fingerprinting engine against eight leading LLMs: Claude Sonnet 4.5, Claude Opus 4.6, GPT-5.2, GPT-5 mini, Gemini 3.1 Flash Lite, Gemini 2.5 Flash, Grok 4.3, and Grok 4.1.

The benchmark evaluated each engine across 192 distinct scenarios, totaling 1728 runs. We assessed them based on speed (latency), cost, self-reported confidence, and manufacturer accuracy using the exact same input telemetry—including MAC addresses, DHCP fingerprints, User-Agents, TCP signatures, and more.

The results are in, and they highlight the enduring value of specialized engineering, even with the introduction of new, powerful LLMs.

The Efficiency Frontier: Cost vs. Latency

When deploying a fingerprinting solution at scale, efficiency is paramount. Network traffic waits for no one, and processing millions of requests requires both speed and cost-effectiveness.

🔍 Click to enlarge

Figure 1: Average latency against average cost per call. Up-and-left represents the ideal "faster and cheaper" quadrant.

The data reveals a stark contrast between Fingerbank and the LLMs:

EngineAverage LatencyAverage Cost per Call
Fingerbank339 ms$0.0004
Gemini 3.1 FL2.32 s$0.0005
GPT-5.22.97 s$0.0040
Grok 4.38.72 s$0.0020
Grok 4.110.62 s$0.0003
GPT-5 mini14.02 s$0.0025
Sonnet 4.514.62 s$0.0070
Opus 4.619.67 s$0.0323
Gemini 2.5 F24.90 s$0.0006

Fingerbank remains the leader in efficiency. Fingerbank remains the leader in efficiency. At 339ms, it is significantly faster than any evaluated LLM. Gemini 3.1 Flash Lite emerges as the fastest LLM at 2.32 seconds. Notably, Grok 4.1 presents an intriguing cost profile, being the cheapest at $0.0003 per call, even slightly lower than Fingerbank, but at a much higher latency of 10.62 seconds. This highlights a trade-off between ultra-low cost and real-time performance.

Accuracy and Confidence: Knowing What You Don't Know

Speed and cost mean little if the identification is incorrect. We measured detection correctness across various device categories and evaluated how often each engine successfully matched the officially registered manufacturer.

🔍 Click to enlarge

Figure 2: Detection correctness by test category (left) and manufacturer match rate against Fingerbank’s OUI lookup (right).

Overall Accuracy and Manufacturer Match Rate

Fingerbank continues to significantly outperform all LLMs in both overall accuracy and manufacturer match rate:

EngineOverall AccuracyManufacturer Match RateAverage Confidence Score
Fingerbank84%92%40/100
Sonnet 4.551%40%75/100
Opus 4.651%50%69/100
Grok 4.146%42%65/100
Grok 4.342%44%66/100
GPT-5 mini34%27%64/100
GPT-5.233%28%58/100
Gemini 3.1 FL33%31%71/100
Gemini 2.5 F25%20%73/100

Note on Confidence Score: For Fingerbank, the confidence score is calculated based on the weight of signals that participated in the detection process. For LLMs, it represents their self-estimated confidence. Fingerbank achieved an overall accuracy of 84%, maintaining a significant lead over all evaluated LLMs. Among the LLMs, Sonnet 4.5 and Opus 4.6 show the highest overall accuracy at 51%, followed closely by Grok 4.1 at 46%. The confidence-accuracy gap remains a critical observation for LLMs; for instance, Sonnet 4.5 reports the highest confidence (75/100) but its accuracy is still far below Fingerbank's.

Category Breakdown

Fingerbank's specialized knowledge continues to shine:

Medical Devices, Video Equipment, and Firewall: Fingerbank achieved 100% accuracy in these critical categories, demonstrating its reliability where precision is paramount. LLMs struggled significantly, often scoring 0% or very low.

Tools (Automotive, Energy, and Tools): This category remains challenging for LLMs, with most scoring around 13-20% accuracy, compared to Fingerbank's 80%.

IoT Devices: While LLMs generally perform better in some IoT subcategories, Fingerbank maintains a strong overall lead in the broader IoT category (85%).

Signal Combinations: The Power of Context

Device fingerprinting relies on combining multiple signals (OUI, DHCP, mDNS, User-Agent, etc.) to form a conclusive identity. We analyzed how detection correctness varies based on which signals appear together in the payload.

🔍 Click to interact

Figure 3: Co-occurrence heatmap showing accuracy based on signal combinations.

The heatmap further emphasizes Fingerbank's robustness and ability to synthesize complex signal combinations effectively. It consistently maintains high accuracy across diverse signal combinations. LLMs, while capable of parsing individual signals, often struggle with the nuanced interpretation and weighting required for accurate device identification when multiple or conflicting signals are present.

Failure Rate: The Cost of Uncertainty

Beyond accuracy, the failure rate—how often an engine returned a device name containing "error" or "unknown"—is crucial for real-world deployments. A lower failure rate indicates a more reliable engine.

🔍 Click to enlarge

Figure 4: Share of cases with no confident device name (failure rate).

Opus 4.6 recorded the lowest failure rate among LLMs at 3%, closely followed by Fingerbank at 4%. Gemini 2.5 Flash, however, exhibited a significantly higher failure rate, indicating a greater propensity to return unidentifiable results.

Conclusion

The updated benchmark results, now including Grok and Gemini models, reinforce our initial findings: For network device fingerprinting, a specialized engine like Fingerbank remains vastly superior to general-purpose LLMs.

While newer LLMs like Grok and Gemini show varying strengths in cost or speed, none can match Fingerbank's combined performance in the critical areas of real-time latency, cost efficiency, and, most importantly, accuracy across diverse device categories and signal combinations. Fingerbank offers:

Real-time Performance: Significantly faster response times essential for network operations.

Cost Efficiency: Highly scalable operations at a fraction of the cost.

Superior Accuracy: Consistently higher accuracy and reliability, avoiding the confident hallucinations and high failure rates observed in LLMs.

As AI continues to evolve, there may be roles for LLMs in offline analysis or anomaly detection. However, for inline, real-time device fingerprinting, Fingerbank's purpose-built architecture is the undisputed champion.