False Positives in LLM Auditing: Why 20-40% Noise Makes Raw AI Output Unusable — Darkwave Log

The Core Problem: High-Noise Output by Default

There is a persistent assumption in the smart contract security space that adding an AI layer to an audit workflow automatically improves it. The assumption has a surface logic: LLMs can read code quickly, have been trained on millions of code samples, and can reason across vulnerability categories simultaneously. In raw capability terms, they are remarkable instruments. But capability alone is not the same as precision. And in a security audit, precision is everything.

The false positive rate of current mainstream LLMs in vulnerability detection is generally higher than 50%. Even when more carefully controlled experiments constrain the measurement to top-deployed contracts with known ground truth, the picture remains sobering. One study employing a GPT-4 model with a 32k token window achieved a success rate of approximately 59.76%, with the failure rate registering around 40.24% derived from 740 false positives and 41 false negatives.

These are not numbers a practitioner can ignore. A tool that incorrectly flags 40% of its findings as vulnerabilities does not merely add noise — it restructures an audit around managing that noise. Every flagged line must be triaged. Every false alarm consumes investigative time. Every incorrect finding that is inadvertently included in a report damages client trust and the credibility of the team that issued it.

The 20–40% false positive range referenced in production discussions is, if anything, an optimistic framing. It reflects conditions where prompts have been curated, models have been selected carefully, and the contract corpus is relatively conventional. In raw zero-shot usage against complex, real-world protocol code, rates can exceed this substantially. State-of-the-art LLM-based detection solutions are often plagued by high false-positive rates.

This article examines why that happens, which vulnerability classes are most affected, what the operational cost of that noise looks like inside an audit workflow, and what structured approaches exist to bring false positive rates down to a level where AI tooling becomes genuinely useful rather than merely impressive in demos.

Why LLMs Generate False Positives: The Structural Causes

The false positive problem in LLM-assisted auditing is not a bug that will be fixed in the next model release. It emerges from multiple structural properties of how large language models process and reason about code. Understanding these properties is a prerequisite for designing workflows that mitigate them.

Surface Pattern Overfitting

LLMs are trained to predict likely token sequences given prior context. When that prior context includes thousands of vulnerability write-ups, CTF solutions, audit reports, and security papers, the model learns strong associations between certain code patterns and vulnerability labels — even when those associations are not causal.

A model that has been exposed to hundreds of reentrancy examples learns that the pattern external call → state mutation is suspicious. It will flag this pattern even when the contract’s access control, mutex logic, or function ordering makes exploitation impossible. Variation in variable names, formatting, ordering of statements, or syntactic sugar can make it harder for an LLM to generalize vulnerabilities across code that is semantically equivalent but syntactically different. The reverse is also true: semantically dangerous code expressed in unfamiliar syntactic form can be missed entirely, while safe code expressed in familiar “vulnerable-looking” patterns gets flagged.

This is overfitting to surface features rather than semantic reality. The model is not reasoning about whether the vulnerability is exploitable — it is pattern-matching against training data representations of what vulnerable code tends to look like. When safe code happens to look like training examples of unsafe code, a false positive is the inevitable result.

Lack of Deep Semantic Understanding

Smart contract auditing is not fundamentally a pattern-recognition task. It is a semantic reasoning task. Determining whether a given piece of code is exploitable requires understanding: the complete execution context, the call graph, the state machine governing valid state transitions, the access control hierarchy, the trust model for external callers, the economic incentives that would motivate an attacker, and the interaction between multiple contracts in a protocol.

While LLMs excel at detecting explicit code quality violations, they struggle with subtle semantic problems that require a deeper understanding of smart contract design patterns and gas optimization principles. This gap between syntactic fluency and semantic depth is where false positives proliferate. Deep learning methods are typically limited to pattern matching for vulnerabilities and lack a deep understanding of code syntax and semantics, making it challenging to generate detailed vulnerability cause analysis and repair solutions.

Consider an access control check. A function may lack an onlyOwner modifier and yet be perfectly safe because it is called exclusively by another internal function that is itself protected. An LLM scanning the public-facing function in isolation will flag the missing modifier as a vulnerability. A human auditor who has traced the call graph will not. This distinction — between a vulnerability in isolation and a vulnerability in context — is precisely what LLMs are weakest at.

If the auditor is instructed to review a contract previously identified with arithmetic vulnerabilities, there is a risk of receiving misleading feedback; the model might erroneously flag the contract as vulnerable to both “Integer Overflow/Underflow” and “Reentrancy”, incorrectly generating false positives for reentrancy. This cross-contamination of vulnerability categories is a direct symptom of pattern-based rather than semantic reasoning.

Context Window Limitations

Real-world smart contract protocols are not simple, standalone files. They consist of dozens of interacting contracts, inherited base classes, imported libraries, and cross-contract function calls. The vulnerability surface of any single function may depend on state set by a function in a different file, called through a proxy, six steps earlier in the execution trace.

A context window is the amount of text an LLM can “see” at once, including the system prompt, the user prompt, chat history, tool responses, RAG retrieved chunks, and the model’s own intermediate reasoning tokens. Everything counts as tokens, and once you exceed the context limit, older tokens are either truncated or dropped.

When a contract exceeds the context window — or when the surrounding contracts that define its security invariants are simply not included — the model reasons from incomplete information. It cannot see the modifier defined in the base contract. It cannot trace the cross-contract call that resets the state flag before the external call happens. It evaluates a code fragment as if it were a complete program, which it is not.

The consequence of this incompleteness is systematic overestimation of risk. When context is missing, the model defaults to “this could be vulnerable” rather than “I cannot determine this.” If the correct answer is not strongly represented in the context, or if the prompt encourages guessing, the model will generate an answer that sounds correct. In a vulnerability detection context, the “sounds correct” failure mode is a confident-sounding false positive.

Even emerging AI-based methods suffer from hallucinations, context constraints, and a heavy reliance on expensive, proprietary large language models. Context constraint is not a temporary limitation of today’s models — it is a fundamental architectural property that even very large context windows do not eliminate for complex multi-contract systems.

Overconfident Generation Under Uncertainty

A related but distinct failure mode is what researchers describe as hallucination: the model’s tendency to generate plausible-sounding outputs when it lacks the information to produce accurate ones. AI misinterprets execution paths, overestimates risk, or fabricates pathways that don’t exist. In code analysis, this means the model describes a precise attack path — complete with function names, state variable manipulations, and economic impact — for a vulnerability that cannot actually be triggered.

ChatGPT is susceptible to hallucinations when interpreting code semantic structures and fabricating nonexistent facts. These hallucinated vulnerability reports are particularly dangerous in an audit workflow because they can be difficult to disprove quickly. The description sounds authoritative and specific. Refuting it requires tracing the described execution path in detail — consuming exactly the time that AI tooling was supposed to save.

Increased randomness in LLM answers boosts the chance of correct detection but raises false positives. This precision-recall tradeoff is not unique to smart contract auditing, but it has a particularly sharp edge here: the cost of investigating a convincingly described false positive is high, and the cost of accepting a false positive as a real finding in a published audit is higher still.

False Positive Rates by Vulnerability Class

Not all vulnerability categories are equally susceptible to false positives. The rates vary significantly based on how syntactically stereotyped the vulnerability pattern is, how much cross-contract context is required for an accurate determination, and how well-represented the category is in training data.

Reentrancy

Reentrancy is simultaneously one of the most over-represented vulnerabilities in LLM training data and one of the most context-dependent vulnerabilities in practice. Every LLM has been trained on extensive material about the DAO hack, reentrancy patterns, Checks-Effects-Interactions, and related content. This creates aggressive pattern-matching: any external call followed by a state update is treated as suspicious.

In practice, the majority of external calls in modern Solidity are to trusted contracts, and many of the remaining cases are protected by reentrancy guards, function ordering, or access control. A reentrancy analysis is only triggered correctly if the contract contains logic for handling Ether or tokens, as this vulnerability’s callbacks typically occur during currency transfers. LLMs routinely skip this prerequisite check, flagging generic external calls even in contexts where reentrancy is structurally impossible.

Access Control

Access control findings exhibit a different false positive pattern. The LLM identifies functions without explicit modifiers and flags them as “missing access control.” This is correct as a syntactic observation, but incorrect as a security conclusion when the function’s safety is guaranteed by other means — internal call restrictions, state machine invariants, or business logic that makes the function’s execution safe regardless of who calls it.

Reentrancy, complex fallback, and faulty access control policies are challenging for current verification solutions, which often generate false alarms or fail to detect them entirely.

Arithmetic Vulnerabilities

Post-Solidity 0.8, arithmetic overflow and underflow are handled by the compiler’s built-in checks. Despite this, the recall rate for detecting some specific vulnerabilities in Solidity v0.8 has dropped to just 13% compared to earlier versions (v0.4). Further analysis reveals the root cause of this decline: the reliance of LLMs on identifying changes in newly introduced libraries and frameworks during detection.

Models trained predominantly on older Solidity versions continue to flag unchecked arithmetic blocks in v0.8 code as potential overflow risks, even when those blocks are protected by the new default behavior. This is training data bias generating false positives: the model’s learned patterns correspond to an older threat landscape.

Business Logic and Oracle Manipulation

At the opposite end of the spectrum, conventional automated tools frequently suffer from high false positive rates and struggle to detect complex, context-dependent vulnerabilities. Such tools often lack the nuanced understanding of business logic and intricate blockchain-specific security patterns required to identify subtle flaws, such as sophisticated logic errors or vulnerabilities arising from inter-contract interactions.

Business logic vulnerabilities — price manipulation, flash loan attacks, governance capture, incentive misalignment — are precisely where LLMs produce the fewest false positives, not because they are accurate, but because they largely fail to detect them at all. The false positive problem concentrates in syntactically stereotyped categories; the false negative problem concentrates in semantically complex ones. This asymmetry is important to understand when designing a hybrid workflow.

When dealing with large inheritance trees, deep mappings, oracle interactions, or state machines, AI often generates incorrect conclusions.

The Cost of False Positives in an Audit Workflow

The false positive problem is not merely an inconvenience. It imposes concrete operational costs that compound across the lifecycle of an audit engagement.

Wasted Triage Time

Every finding flagged by an AI tool must be evaluated before it can be accepted or dismissed. In a workflow where the tool generates 40 findings and 25 of them are false positives, the human auditor is spending the majority of their investigation time on non-issues.

Running a tool, opening the report, manually triaging 200+ findings, and then realizing 180 are false positives before finally missing the real bug because of fatigue — the signal-to-noise ratio kills you. This is not a hypothetical. It describes what production usage of high-false-positive AI tooling actually looks like in practice.

The mathematics are straightforward: if an auditor spends 30 minutes investigating each flagged finding, and a tool generates 50 findings with a 40% false positive rate, 10 hours of investigative time goes to findings that were never real. That is a significant fraction of a typical audit engagement’s total budget, consumed entirely by noise management.

Alert Fatigue and Degraded Attention

Beyond raw time cost, high false positive rates cause alert fatigue — the progressive degradation of human attention and scrutiny that occurs when a reviewer is exposed to a sustained stream of low-quality findings. False-positive fatigue is one of the primary risks when using any security tooling.

An auditor who has investigated twenty false positives in a row begins to triage more quickly and less thoroughly. The finding that eventually represents a real vulnerability may receive the same cursory review as the preceding false alarms. This is not a failure of professional rigor — it is a well-documented human cognitive response to sustained noise exposure. The operational implication is that high false positive rates do not merely waste time: they actively degrade the quality of the human judgment they were supposed to augment.

Distinguishing a theoretical low-severity nitpick from a critical, protocol-draining exploit requires actual human judgment. That judgment is most reliable when it is applied selectively, to a filtered set of high-probability findings. It degrades when it is applied indiscriminately to everything a noisy tool produces.

Reduced Trust in Tooling

There is a subtler and longer-term cost: false positives erode trust in AI tooling across the team. When an auditor encounters false positive after false positive, they recalibrate their prior about the tool’s outputs. Over time, this rational recalibration can shade into dismissiveness — a tendency to discount tool findings without investigation, increasing the risk that a true positive is missed.

Key risks include false positives, limitations in handling highly complex or novel code structures, and reliance on training data that may not fully reflect emerging threats. Once a tool’s credibility is damaged within a team, recovering it requires substantial evidence of improved precision — not just isolated success stories but systematic performance data.

The practical endpoint of an untreated false positive problem is an AI tool that has technically been “integrated” into the workflow but whose findings are no longer read carefully. This outcome is worse than not using the tool at all, because it provides a false assurance that AI-assisted coverage is in place.

Strategies for Reducing False Positives

The research literature and production deployments have converged on several approaches that demonstrably reduce false positive rates. None of them eliminates the problem entirely, but in combination they can bring raw AI output to a level of precision that is useful in practice.

Prompt Engineering

Prompt design has an outsized effect on false positive rates. A naive prompt that asks “identify all vulnerabilities in this contract” produces dramatically different output than a prompt that asks the model to reason about exploit feasibility, specify the exact attack path with concrete preconditions, and distinguish between theoretical and practical risk. A well-designed prompt can reduce the false-positive rate by over 60%.

Effective prompt engineering for vulnerability detection typically includes several elements: explicit instructions to only report findings that can be demonstrated with a concrete execution trace; requirements to specify the caller, the state preconditions, and the economic conditions that would be necessary for exploitation; and directives to explicitly consider whether the surrounding code — guards, modifiers, state checks — prevents the flagged path from being reachable.

A dynamic prompt engineering strategy where prompts are iteratively refined throughout the audit lifecycle, based on real-time analytical feedback and the evolving audit context, enhances LLM accuracy, improving the reliability of security assessments, and mitigating common issues such as output misinterpretation or factual hallucination.

Vulnerability-specific prompts consistently outperform generic prompts. Rather than asking for all vulnerabilities simultaneously, prompting the model separately for reentrancy, then for access control, then for arithmetic issues — with each prompt designed around the specific semantic requirements of that vulnerability class — produces both higher recall and lower false positive rates. A high negative-recall rate corresponds to a low false-positive rate, which is critical for the practical deployment of vulnerability detectors in real-world auditing workflows, as each false-positive detection typically requires manual verification by security auditors, incurring substantial time and resource costs.

In the error detection task, chain-of-thought and tree-of-thought prompting substantially increase recall, often approaching 95–99%, but typically reduce precision, indicating a more sensitive decision regime with more false positives. This tradeoff is a fundamental design decision in prompt strategy: teams optimizing for recall at the cost of precision will need stronger downstream filtering.

Retrieval-Augmented Generation (RAG)

RAG addresses a key root cause of false positives: the LLM’s reliance on static training data that may not reflect current contract patterns, Solidity version behavior, or protocol-specific context. The successful integration of RAG with LLMs for code security analysis demonstrates how domain-specific knowledge retrieval can overcome limitations of standalone LLMs, particularly their reliance on static training data.

In a RAG-enabled audit pipeline, the model’s reasoning is grounded in retrieved documents at query time — including known vulnerability patterns for the specific contract type being audited, past audit findings for similar protocols, ERC standard specifications, and protocol documentation. SmartAuditFlow incorporates a hybrid approach by integrating inputs from traditional static analysis tools and retrieval-augmented generation. The RAG component leverages external knowledge bases to provide LLMs with up-to-date, authoritative security information and vulnerability patterns, thereby enriching contextual understanding and improving the accuracy and completeness of detected vulnerabilities.

By integrating retrieval-augmented generation, the model retrieves domain-specific knowledge from the ERC documentation during classification, enhancing its contextual understanding. Unlike static code analysis tools or zero-shot prompting, the model generates label names directly as outputs, improving interpretability and adaptability to complex scenarios.

The improvement in false positive rates from RAG comes primarily from two mechanisms. First, the model can retrieve examples of safe code that superficially resembles dangerous code, helping it calibrate against surface-pattern over-triggering. Second, it can retrieve the specific semantic rules governing the vulnerability class — including the conditions under which it is not triggered — providing the grounding that context window limitations would otherwise eliminate.

Proof-of-Concept Generation as a Filter

One of the most effective false positive filters is requiring the model to generate a concrete proof-of-concept exploit before a finding is considered valid. A model that can only identify a pattern but cannot construct a runnable test that demonstrates exploitability is, in effect, flagging a theoretical concern rather than a demonstrated vulnerability.

Within a multiphased workflow, a proof-of-concept writing phase where agents generate PoCs to validate findings can dramatically reduce false positives. This approach forces the model to move from pattern recognition to semantic reasoning. Constructing a working exploit requires specifying: the exact sequence of transactions, the state preconditions, the caller addresses, and the expected state change. Many LLM false positives collapse at this stage because the model cannot specify a coherent attack path — which is precisely the information that would confirm a true positive.

If prerequisite patterns are present, an LLM can be used to generate a concrete proof-of-exploit. To improve the accuracy of this output, the LLM should be guided using “testing patterns” tailored to each defect class. For the final stage, executing the tests using a symbolic or concrete execution engine verifies the presence or absence of a defect.

This chaining of LLM output with formal or concrete execution verification is currently one of the most reliable approaches to precision improvement. Combining LLM predictions with symbolic validators may reduce hallucination while retaining flexibility.

Human-in-the-Loop Filtering

No automated approach currently achieves the false positive rate required for direct, unreviewed inclusion of AI findings in a published audit report. While large language models can accurately identify some vulnerabilities in smart contract security audits, their high false positive rate still requires the involvement of manual auditors.

The correct framing for AI tooling at current capability levels is as a first-pass triage system, not a final authority. AI triage eliminates the majority of false positives and flags cross-finding interaction patterns. The future of smart contract auditing is AI handling the mechanical triage so humans can focus on the creative, adversarial thinking that catches the bugs that actually get exploited.

Human-in-the-loop filtering works most effectively when the AI tool provides not just a finding but a structured justification — including the specific code path, the preconditions, the attack mechanism, and an explicit confidence assessment. Self-reflective frameworks could be explored to help models estimate their own confidence and suppress low-certainty predictions. Findings accompanied by low-confidence scores or incomplete attack path specifications can be deprioritized or dismissed without full investigation, allowing human attention to concentrate on the findings most likely to be valid.

Deep fundamentals are the only thing that makes AI usable. That depth of understanding is what separates auditors who catch AI mistakes from those who ship them as findings.

Multi-Agent Adversarial Architectures

A structural approach to reducing false positives is to use multiple agents in an adversarial configuration: one agent produces findings, another attempts to refute them. GPTLens introduces an adversarial framework dividing the detection process into two phases — generation and discrimination — where the LLM serves as both auditor and critic. This dual-role approach, aiming to expand vulnerability detection while reducing inaccuracies, significantly outperforms traditional methods.

The discriminator agent’s task is specifically to challenge the auditor agent’s findings: to identify whether the described attack path is actually reachable, whether the stated preconditions can be satisfied, and whether the code contains mitigations the auditor agent missed. Findings that survive adversarial challenge are substantially more likely to be true positives than findings that pass through a single-agent pipeline unchallenged.

Contextual profiling optimizes context retention via graph clustering; model-agnostic auditing combines semantic and neurosymbolic reasoning for comprehensive detection; and false positive filtration rigorously filters hallucinations through adversarial feasibility checks, ensuring high-precision results at minimal cost.

Evaluating an AI Auditing Tool’s False Positive Rate Before Integration

Before committing to any AI auditing tool in a production workflow, teams should conduct a structured false positive rate evaluation. Marketing materials rarely present this metric accurately; vendor-reported numbers are typically measured on benchmark datasets that have been optimized for, and may not reflect the tool’s behavior on real-world production contracts.

Demand Two Separate Metrics: Detection Rate and Overreporting Index

A single accuracy or F1 score is insufficient for evaluating an auditing tool, because it can mask a high false positive rate behind a high detection rate. A two-dimensional analysis is far more informative than a single metric such as accuracy or F1-score, which might obscure how a high score was achieved. It is precisely by decomposing quality into these two metrics that developers can identify a model’s weaknesses and deliberately improve either its sensitivity or its selectivity.

The two key metrics are: detection rate (the fraction of real vulnerabilities the tool finds) and overreporting index (the rate at which the tool generates findings on clean contracts). The overreporting index is a measure of the tendency of an AI auditor to generate false positives in error-free contracts, normalized by the size of the codebase. A tool with a high detection rate but a high overreporting index is not suitable for production use without significant filtering infrastructure.

Test Against Known-Clean and Known-Vulnerable Corpora

The most important evaluation signals are precision (signal-to-noise ratio of findings), recall (coverage of known vulnerabilities), and F1 score, computed both overall and specifically for critical/high/medium findings. Additionally, the tool’s ability to ingest additional context, the quality of recommendations, the level of effort needed to run the tool, and whether the tool can integrate with a CI pipeline should all be factored in. A sound evaluation uses a benchmark dataset of real-world security audits comprising verified vulnerabilities.

The evaluation corpus should contain both vulnerable contracts and clean contracts. Contracts guaranteed to have no vulnerabilities — neither natural nor injected — should be included. Every alert triggered on these error-free contracts is a false positive. Many tools are evaluated only against vulnerable corpora; measuring their behavior on clean contracts is essential for understanding the noise floor.

Stratify Results by Vulnerability Category

Because false positive rates vary substantially by vulnerability class, aggregate precision numbers conceal important information about where a tool is and is not reliable. The dataset analysis may reveal no discernible pattern in the performance of a RAG-LLM system across different vulnerability types. Variability in success rates is evident, with specific contracts having the same vulnerability type demonstrating significant deviations in how effectively they were audited.

A practical evaluation should compute precision and false positive rate separately for reentrancy, access control, arithmetic issues, denial-of-service patterns, and oracle manipulation. This stratified view allows teams to use the tool selectively — relying on it for categories where it has demonstrated precision, and maintaining heavier human review for categories where its false positive rate remains high.

Evaluate on Contracts Representative of Your Protocol Type

Benchmark performance on simple ERC-20 tokens does not predict performance on complex DeFi protocols with multi-contract architectures, cross-chain interactions, or governance mechanisms. Smart contract code can vary widely in structure and complexity, especially in advanced use cases involving cross-chain interactions or highly customized logic. AI models must be continuously updated to handle these variations effectively, and gaps in training data can limit their ability to detect certain types of vulnerabilities.

Before integrating any tool, evaluate it on contracts that closely resemble the code your team will actually audit. The closer the evaluation corpus to the production use case, the more predictive the resulting false positive rate measurement will be.

Run Multiple Passes and Measure Consistency

It has been shown that each independent run can produce different outputs due to AI being non-deterministic. Allowing an agent to run multiple times within each phase, hunting for additional bugs with fresh context windows, can drive true positives higher. This non-determinism also matters for false positive evaluation: a finding that appears on only one of five independent runs deserves substantially less weight than a finding that appears consistently. Consistency across runs is itself a signal of finding quality.

The Practical Threshold: When Does AI Become Useful?

The goal is not zero false positives — no security tool achieves that — but a false positive rate low enough that the net time saved by AI-assisted detection exceeds the time spent managing noise.

A rough operational threshold: if an AI tool generates findings where more than one in three requires full investigation before dismissal, it is increasing auditor workload rather than reducing it. The economics only favor AI integration when the tool’s precision is high enough that the majority of its flagged findings warrant genuine investigation, and its recall is high enough that it surfaces real vulnerabilities the human pass might have missed.

While not yet ready to replace human auditors, the system shows promise as an auxiliary tool that could significantly enhance auditing efficiency. The ability to identify common vulnerabilities automatically could reduce overall auditing costs and time requirements.

Achieving that threshold requires treating the AI tool not as a drop-in vulnerability oracle but as a component in a pipeline — one whose output requires prompt engineering on the input side, RAG augmentation for domain grounding, PoC generation for feasibility filtering, and structured human review on the output side. The raw output of any current LLM, applied directly to production contract code, is not suitable for inclusion in a professional audit. The structured output of a well-designed pipeline, with calibrated filtering at each stage, is.

AI models become reliable only when operating under strict structure, controlled context, and deterministic constraints. Building that structure is the actual engineering work of AI-assisted auditing. The LLM is a powerful component within that system. It is not the system itself.

**Key takeaways for practitioners:**

Raw LLM false positive rates on clean contracts regularly exceed 40–50%. The 20–40% figure cited in production contexts reflects optimized, filtered pipelines — not zero-shot usage.
The primary causes are surface-pattern overfitting, lack of semantic depth, context window incompleteness, and hallucinated attack paths.
False positive rates are highest for syntactically stereotyped categories (reentrancy, basic access control) and lower — but so is recall — for semantic categories (business logic, oracle manipulation).
The cost of false positives is not just wasted time. It is alert fatigue, degraded human judgment, and eroded trust in tooling.
The strongest mitigations are: structured prompt engineering, RAG with domain-specific knowledge bases, PoC generation as a feasibility gate, and human-in-the-loop triage with confidence scoring.
Evaluate any tool before integration using both a vulnerable corpus and a clean corpus, stratified by vulnerability class, with overreporting index measured explicitly alongside detection rate.