Prompt Engineering for Smart Contract Security: Getting Useful Output from LLMs — Darkwave Log

Smart contract auditing with LLMs sits at an uncomfortable intersection: the models are genuinely capable of reasoning about code semantics, execution paths, and attack vectors, but they will also confidently produce generic, low-signal output if you let them. The difference is almost entirely in the prompt.

This article is a practitioner’s guide. It covers how to structure prompts for vulnerability detection from first principles, the trade-offs between prompting strategies, how to work within token constraints on large codebases, and a library of tested templates you can use or adapt immediately.

Why Prompt Quality Determines Everything

Through prompt engineering, LLMs can effectively harness their knowledge of programming patterns to perform complex tasks that traditionally demand human expertise. But that potential is only realized when the prompt is doing real work.

An LLM asked “is this contract secure?” will produce a surface-level review because the task is underspecified. The model has no target vulnerability class, no output format requirements, no severity rubric, and no constraint against hedging with “this code may potentially have issues.” The result looks like an audit but functions as noise.

By carefully defining agent roles and embedding explicit reasoning steps within structured prompts for large language models, the proposed method exploits the inherent reasoning capabilities of LLMs to identify security flaws in smart contracts without extensive model retraining.

Every prompt you send to an LLM for security analysis should answer four questions:

Who is the model? (Role and expertise framing)
What is the target? (Specific vulnerability class or analysis type)
What is the context? (Contract purpose, Solidity version, dependencies)
What is the output shape? (Structured format with severity, location, impact, recommendation)

Failing to answer any of these reliably degrades output quality.

Prompting Strategies: Zero-Shot, Few-Shot, and Chain-of-Thought

The three primary prompting strategies differ in how much reasoning scaffolding you provide upfront. Each has a distinct fit for security analysis tasks.

Zero-Shot Prompting

Zero-shot prompting involves asking the model to detect vulnerabilities without providing any prior examples. It relies entirely on what the model has internalized from training.

Zero-shot works adequately for well-known vulnerability patterns — classic reentrancy, tx.origin misuse, or unprotected selfdestruct — where the model has extensive training signal. It degrades for novel attack surfaces, protocol-specific logic bugs, or cross-contract interaction vulnerabilities that require understanding the broader system.

Zero-shot prompt example:

System: You are a senior smart contract security auditor specializing in Solidity.
Your task is to identify reentrancy vulnerabilities only.
Report findings in this exact format:
- LOCATION: [function name and line range]
- VULNERABILITY: [specific description]
- IMPACT: [what an attacker can achieve]
- SEVERITY: [Critical / High / Medium / Low]
- RECOMMENDATION: [concrete fix]
If no reentrancy vulnerability exists, respond: "No reentrancy vulnerabilities detected."

User: Analyze the following Solidity contract for reentrancy vulnerabilities:

[CONTRACT CODE]

Few-Shot Prompting

Few-shot prompting involves supplying the model with a few examples of vulnerable code snippets and their explanations to improve its understanding of the task.

For security analysis, few-shot examples serve two purposes: they calibrate the model’s sensitivity threshold (what counts as a real finding vs. a non-issue), and they establish the output format by demonstration rather than instruction. Format adherence is significantly more reliable when the model has seen examples of the exact structure you want.

The cost is tokens. Each example consumes context budget that could be occupied by contract code. In practice, two to three examples at 200–400 tokens each is a reasonable trade-off for most analysis tasks.

Few-shot prompt structure:

System: You are a senior smart contract security auditor. Analyze contracts
for access control vulnerabilities using the format shown in the examples below.

--- EXAMPLE 1 ---
Contract:
function setFeeRecipient(address _recipient) external {
    feeRecipient = _recipient;
}

Finding:
- LOCATION: setFeeRecipient()
- VULNERABILITY: Missing access control. Any caller can redirect protocol fees.
- IMPACT: Attacker can point fee accumulation to an address they control.
- SEVERITY: High
- RECOMMENDATION: Add onlyOwner or onlyRole(ADMIN_ROLE) modifier.

--- EXAMPLE 2 ---
Contract:
function initialize(address _owner) external {
    owner = _owner;
}

Finding:
- LOCATION: initialize()
- VULNERABILITY: Unguarded initializer. Can be called by anyone after deployment.
- IMPACT: Attacker can front-run deployment transaction to claim ownership.
- SEVERITY: Critical
- RECOMMENDATION: Add initializer modifier from OpenZeppelin or a one-time-call guard.

--- TARGET CONTRACT ---
[CONTRACT CODE]

Chain-of-Thought Prompting

The chain-of-thought technique operates through two core mechanisms: zero-shot cues and few-shot exemplars. Zero-shot cues are simple phrases like “Let’s think step by step” that nudge models to generate intermediate reasoning without providing examples.

Chain-of-thought transforms opaque predictions into auditable reasoning chains. When a model explains how it arrived at an answer, developers can verify logic, catch errors, and provide users with confidence in AI-generated recommendations.

For smart contract security, CoT is most valuable for complex multi-step vulnerability classes: cross-function reentrancy, price oracle manipulation, or flash loan attack vectors that require the model to trace execution paths across multiple calls. Forcing the model to reason step by step reduces the probability of plausible-sounding but incorrect findings.

To enhance the model’s performance for each vulnerability type, researchers employ three prompt engineering techniques within the zero-shot setting: baseline prompt, Chain-of-Thought (CoT), and Plan-and-Solve.

Chain-of-thought prompt example for arithmetic analysis:

System: You are a senior smart contract security auditor. When analyzing for
arithmetic vulnerabilities, follow this reasoning chain before producing findings:

Step 1 — Identify all arithmetic operations (addition, subtraction, multiplication,
         division, exponentiation, bit shifts).
Step 2 — For each operation, determine the Solidity version and whether overflow/
         underflow protection is active.
Step 3 — Identify any division operations and trace whether the divisor can be zero
         or very small, causing precision loss or reversion.
Step 4 — Trace whether arithmetic results flow into state variables controlling
         balances, shares, or access.
Step 5 — For each unsafe operation found, produce a finding in the required format.

Show your reasoning for each step before producing the final findings list.

User: Analyze this contract for arithmetic vulnerabilities.
[CONTRACT CODE]

The explicit reasoning chain produces findings that are easier to verify and more likely to be correct. The trade-off is output length — CoT responses are longer, which matters if you are processing many contracts programmatically.

Providing Contract Context Within Token Limits

Context management is the engineering problem beneath the prompting problem. GPT models may encounter detection failures in long smart contract code due to inherent token limitations. Dumping an entire codebase into a prompt and asking for a full audit is not an effective strategy.

An appropriate prompt design approach that takes context phrasing and length into account is necessary to assess LLMs for detecting vulnerabilities. A novel context-driven prompting technique for smart contract auditing aims to provide appropriate wording and enough code context based on direct code dependencies to uncover vulnerabilities.

The Lost-in-the-Middle Problem

Bigger context is not always better — cost, latency, and the “lost in the middle” problem all argue for using the minimum context needed. When critical code is buried in the middle of a large prompt, the model’s attention is weaker there than at the beginning or end. Structure your prompts so the most security-sensitive code appears near the top, immediately after the system prompt.

Function-Level Chunking

For large contracts, split analysis by function or logical module rather than sending the entire file. To provide sufficient code context while avoiding including the entire code within the prompt, a chunking approach aims to provide efficient prompting techniques that reduce prompt sizes, especially regarding complex codebases, while providing the model with enough data to analyze targeted functions effectively.

Chunking strategy:

For a 1,200-line DeFi protocol:

Pass 1: Analyze withdraw() and related internal functions (lines 45–180)
         for reentrancy. Include state variable declarations at top of prompt.

Pass 2: Analyze role assignment functions and modifiers (lines 200–310)
         for access control issues.

Pass 3: Analyze price calculation and share-minting functions (lines 400–560)
         for arithmetic and oracle issues.

Final pass: Provide a summary prompt listing all findings from previous passes
            and ask: "Are there any cross-function interactions between these
            findings that compound the risk?"

What Context to Include

Not all contract code is equally relevant to a given vulnerability class. Before sending code to the model, provide:

State variable declarations — Always include. Storage layout is essential for understanding reentrancy and access control.
Modifier definitions — Always include when analyzing access control. A modifier defined elsewhere is invisible without it.
Interface imports — Include when the vulnerability class involves external calls.
Inline comments describing protocol intent — Include selectively. They help the model understand what the code should do, enabling it to identify when behavior diverges from intent.
pragma solidity version — Always include. It determines whether built-in overflow protection is active and affects which patterns are vulnerabilities versus non-issues.

Context header template:

--- CONTRACT CONTEXT ---
Protocol: Lending protocol allowing users to deposit ERC-20 collateral and borrow ETH.
Solidity Version: 0.8.19
External Dependencies: OpenZeppelin ReentrancyGuard, Chainlink AggregatorV3Interface
Key State Variables:
  mapping(address => uint256) public balances;
  mapping(address => uint256) public debt;
  address public priceOracle;
  bool private _locked; // reentrancy flag

Audit Focus: Analyze ONLY the withdraw() function and any functions it calls internally.
--- CONTRACT CODE BELOW ---

In the context of smart contract auditing, a RAG-LLM system can retrieve examples of known vulnerabilities from a vector store, enhancing its ability to identify similar issues in new contracts. For teams running recurring audits, embedding historical vulnerability patterns in a retrieval layer and injecting only the most relevant examples saves significant tokens while improving detection accuracy.

System Prompts: Constraining Output to Actionable Findings

The system prompt is your most powerful tool for controlling output quality. Its job is to prevent the three failure modes that make LLM security analysis useless:

Hedging — “This code may contain a vulnerability…”
Catalog enumeration — Listing every possible vulnerability class with no evidence of any being present
Generic recommendations — “Consider adding access controls” without specifying which function or what guard

A well-constructed system prompt enforces an output contract.

Production system prompt for vulnerability detection:

You are a smart contract security auditor with expertise in Solidity and EVM execution.

RULES:
1. Report only vulnerabilities you have identified with specific evidence in the code.
   Do not report theoretical risks without concrete code evidence.
2. Never use hedging language ("may", "might", "could potentially").
   If you are not certain, do not include the finding.
3. Every finding must include: the exact function name and line number,
   the specific code path that creates the vulnerability, and a concrete fix.
4. Severity definitions:
   - Critical: Attacker can drain funds or gain unauthorized admin access
   - High: Attacker can cause significant financial loss or permanent DoS
   - Medium: Attacker can cause limited financial loss or temporary DoS
   - Low: Code quality issue or defense-in-depth improvement
   - Informational: Style or gas issue, no security impact
5. If you find no vulnerabilities, say: "No [VULNERABILITY_CLASS] vulnerabilities detected."
   Do not produce placeholder findings.
6. Output format is strict JSON:
   {
     "findings": [
       {
         "id": "F-01",
         "title": "",
         "location": "",
         "severity": "",
         "description": "",
         "impact": "",
         "recommendation": "",
         "code_reference": ""
       }
     ]
   }

The JSON output format is particularly valuable when integrating LLM analysis into automated pipelines. It eliminates parsing ambiguity and makes findings programmatically processable.

When defining prompts as scenarios that fulfill specific roles and needs, providing clear role definitions, prior knowledge, and response formats in prompts significantly improves output quality.

Vulnerability-Class-Specific Prompts

Different vulnerability classes require different analysis frames. Here are prompts calibrated for the three highest-impact classes.

Reentrancy

A reentrancy attack is the exploitation of a vulnerability in smart contracts where a function is repeatedly called before its previous execution completes, potentially allowing an attacker to manipulate the contract’s state. Modern reentrancy is more complex than the classic single-function pattern.

Cross-function reentrancy happens when two functions share state but only one has a guard. Cross-contract reentrancy occurs when Protocol A calls Protocol B, which calls back into Protocol A before the first call resolves. Read-only reentrancy through view functions that read stale state mid-callback is increasingly common in DeFi protocols that integrate with lending markets or AMMs.

Your prompt must cover all three patterns:

System: [Standard security auditor system prompt with JSON output]

User: Analyze the following Solidity contract for reentrancy vulnerabilities.
Check for all three patterns:

1. SINGLE-FUNCTION REENTRANCY: External call made before state update in the same function.
   Look for: call/transfer/send followed by storage writes, or storage writes that occur
   after external calls to user-controlled addresses.

2. CROSS-FUNCTION REENTRANCY: Two or more functions share a state variable (e.g., balances),
   only one has a reentrancy guard, and the unguarded function can be called mid-execution.

3. READ-ONLY REENTRANCY: A view function reads a state variable that could be in an
   inconsistent state during a reentrant callback. Flag if any external protocol could
   call a view function on this contract during a callback to obtain stale values.

For each finding, trace the exact execution path from external call to exploitable
state inconsistency. Show the call graph.

[CONTRACT CODE]

Access Control

Critical, admin-only functions such as setOwner and withdraw can have weak or missing checks, allowing any user to call them. The fix is to implement robust ownership and role-based controls using patterns like Ownable, and to correctly use function visibility modifiers.

System: [Standard security auditor system prompt with JSON output]

User: Audit this contract for access control vulnerabilities. Check each of
the following systematically:

1. MISSING MODIFIERS: Any external or public function that writes to privileged state
   (owner, fees, treasury address, paused state, upgrade logic) without an ownership
   or role check.

2. INCORRECT AUTHENTICATION: Use of tx.origin instead of msg.sender for authorization.
   Flag every occurrence.

3. UNPROTECTED INITIALIZERS: Any initialize() or init() function callable more than once
   or callable by any address after deployment.

4. ROLE ESCALATION: Any function that can grant roles to arbitrary addresses without
   requiring the caller to hold a higher privilege level.

5. VISIBILITY ISSUES: Functions declared public that should be internal or private,
   where the unintended exposure creates a security risk.

For each finding, state exactly what the attacker can do if they exploit it.

[CONTRACT CODE]

Arithmetic

Solidity 0.8.x and later has built-in overflow/underflow checks that automatically revert on errors. This does not eliminate arithmetic vulnerabilities — it shifts them to precision loss, division-by-zero, and rounding manipulation patterns.

System: [Standard security auditor system prompt with JSON output]

User: Analyze this contract for arithmetic vulnerabilities. The Solidity version is
[VERSION]. Follow this checklist:

1. OVERFLOW/UNDERFLOW: If Solidity < 0.8.0 and SafeMath is not used, flag every
   arithmetic operation on user-controlled values. If >= 0.8.0, skip this check.

2. PRECISION LOSS FROM DIVISION: Find all division operations. Flag cases where:
   - Integer division truncates a result that is then used to calculate a user's
     token balance or share entitlement
   - Division result is multiplied after truncation (multiply before divide pattern
     not followed)
   - The divisor can be zero or very close to zero

3. ROUNDING DIRECTION: In vault/share calculations, identify whether rounding
   consistently favors the protocol or the user. Flag cases where rounding favors
   the user in a way that enables profit extraction through dust attacks.

4. CASTING: Flag any unsafe downcasts from uint256 to smaller types that could
   silently truncate values on Solidity < 0.8.0.

[CONTRACT CODE]

Prompting for Differential Analysis Between Contract Versions

Differential analysis — comparing a patched contract against a previous version — is one of the highest-value LLM security tasks. The model can identify not only whether a patch correctly addresses the original vulnerability, but also whether it introduces new issues.

Differential analysis prompt:

System: You are a smart contract security auditor reviewing a patch.
Your task has two parts:
PART 1 — Patch Effectiveness: Verify whether the changes in VERSION B correctly
          remediate the vulnerability present in VERSION A.
PART 2 — Regression Analysis: Identify any new vulnerabilities or unintended
          behavioral changes introduced by the patch.

Output format:
{
  "patch_effectiveness": {
    "target_vulnerability": "",
    "is_remediated": true/false,
    "remediation_analysis": "",
    "residual_risk": ""
  },
  "regressions": [
    {
      "id": "R-01",
      "title": "",
      "location": "",
      "severity": "",
      "description": "",
      "introduced_by_change": ""
    }
  ],
  "behavior_changes": [
    {
      "function": "",
      "old_behavior": "",
      "new_behavior": "",
      "security_relevance": ""
    }
  ]
}

User:
--- VERSION A (original) ---
[ORIGINAL CONTRACT CODE]

--- VERSION B (patched) ---
[PATCHED CONTRACT CODE]

Analyze the difference. The reported vulnerability in version A was:
[VULNERABILITY DESCRIPTION FROM ORIGINAL AUDIT]

Framing the differential task with the original finding description focuses the model on the correct risk surface rather than asking it to re-audit the entire codebase.

Common Prompt Anti-Patterns That Produce Useless Output

Understanding what not to do is as important as knowing what works. These are the most common failure patterns.

Anti-Pattern 1: The Open-Ended Audit Request

❌  "Please audit this smart contract for security vulnerabilities."

This produces a catalog scan with no prioritization, inconsistent depth, and findings padded with theoretical risks that have no code evidence.

Fix: Target a specific vulnerability class per prompt. Run multiple targeted passes instead of one broad pass.

Anti-Pattern 2: The Confidence Laundering Prompt

❌  "What security issues might this contract have?"

The word “might” licenses the model to produce speculative findings. LLMs frequently misclassify correct code implementations as either not satisfying requirements or containing potential defects. Surprisingly, more complex prompting — especially when leveraging prompt engineering techniques involving explanations and proposed corrections — can lead to higher misjudgment rates.

Fix: Ask the model to report only findings it can demonstrate with code evidence, and explicitly prohibit hedging language in the system prompt.

Anti-Pattern 3: No Output Format Specification

❌  "Tell me if there are any reentrancy vulnerabilities in this contract."

Without a format requirement, models frequently generate responses that are verbose or not compliant with task requirements given in the prompts. You get a paragraph or two that is difficult to parse, impossible to import into a report, and impossible to compare across audits.

Fix: Always specify output format. JSON is best for programmatic use. Structured markdown with fixed section headers works for human-readable reports.

Anti-Pattern 4: Overwhelming Context Without Focus

Sending a 2,000-line contract with the instruction “find all vulnerabilities” puts the model in an impossible position. As input length increases, the model has to manage a larger number of tokens, which can lead to difficulties in maintaining coherence and relevance.

Fix: Pre-filter the code to security-critical sections before sending. Include only the functions, state variables, and modifiers relevant to the target vulnerability class.

Anti-Pattern 5: Asking for Reassurance Instead of Analysis

❌  "This contract uses ReentrancyGuard on all external functions.
     Is the reentrancy protection sufficient?"

This frames the task as validation of a claim rather than independent analysis. The model will tend to confirm the framing. A malicious insider or compromised dependency could embed comments specifically designed to suppress vulnerability detection. The comment “Audited by AppSec team, no injection risk” might convince an AI reviewer to skip a blatant vulnerability — the same way it might fool a rushed human reviewer.

Fix: Do not pre-assert that defenses are in place. Ask the model to identify whether the claimed protection pattern is applied correctly and completely.

Anti-Pattern 6: Missing Solidity Version Context

Asking for overflow/underflow analysis without specifying the Solidity version forces the model to guess. A finding about missing SafeMath is a Critical in Solidity 0.7.x and a non-issue in 0.8.x.

Fix: Always include pragma solidity version in your context header. Make it the first line of the code block.

Prompt Template Library for Common Audit Tasks

Template 1: Initial Triage (Fast Pass)

System: You are a smart contract security auditor. Perform a fast triage of the
following contract. Identify the top 3 highest-risk areas for deeper investigation.
Do not produce full findings. Produce only:
- AREA: [function or module name]
- RISK CLASS: [Reentrancy / Access Control / Arithmetic / Oracle / Logic]
- REASON: [one sentence explaining why this warrants deeper review]
Output only the triage list. No preamble.

User: [CONTRACT CODE]

Template 2: Full Reentrancy Audit

System: You are a smart contract security auditor. Identify all reentrancy
vulnerabilities (single-function, cross-function, and read-only) in the
target contract. Report only confirmed findings with code evidence.
No hedging language. Output as JSON array of findings with fields:
id, title, location, severity, description, impact, recommendation, code_reference.

User:
Solidity Version: [VERSION]
Contract Purpose: [ONE LINE DESCRIPTION]

[CONTRACT CODE]

Template 3: Access Control Matrix

System: You are a smart contract security auditor. Produce an access control
matrix for this contract, then flag any anomalies.

Matrix format (as markdown table):
| Function | Visibility | Expected Caller | Actual Guard | Gap |

After the matrix, list any functions where Expected Caller ≠ Actual Guard
as security findings in JSON format.

User: [CONTRACT CODE]

Template 4: Oracle and Price Manipulation Analysis

System: You are a smart contract security auditor specializing in DeFi price
oracle vulnerabilities.

Check for:
1. Spot-price reads from AMM pools exploitable via flash loans
2. Missing staleness checks on external price feeds
3. Single-oracle dependency with no fallback or circuit breaker
4. Price values used as-is for liquidation or collateral calculation
   without TWAP or other manipulation resistance

For each finding, describe a concrete single-transaction attack that
exploits the issue. Output as JSON.

User:
External Oracle Addresses Used: [LIST]
[CONTRACT CODE]

Template 5: Upgrade Safety Check

System: You are a smart contract security auditor reviewing a proxy-upgradeable
contract for upgrade safety.

Check:
1. Storage layout conflicts between proxy and implementation
2. Uninitialized implementation contract exploitable via delegatecall
3. Constructor logic in implementation that does not run via proxy
4. Missing storage gaps in base contracts that could corrupt derived contract layout
5. Unrestricted upgrade function callable by non-admin roles

Flag each issue with the specific storage slot or function name affected.
Output as JSON.

User:
Proxy Pattern: [UUPS / Transparent / Beacon]
[CONTRACT CODE]

Template 6: Finding Severity Calibration

When you need to calibrate findings from an automated tool or previous analysis pass:

System: You are a senior smart contract security auditor performing severity
calibration. You will receive a list of potential findings. For each finding,
determine whether it is a true positive or false positive, and if true positive,
assign final severity using this rubric:
- Critical: Direct fund loss with no preconditions
- High: Fund loss requiring specific on-chain state
- Medium: Griefing, temporary DoS, or bounded fund loss
- Low: Defense-in-depth, no direct loss path
- Informational: Style issue

Output: JSON array with fields: id, verdict (true_positive/false_positive),
final_severity, rationale.

User:
[LIST OF UNVERIFIED FINDINGS]

[CONTRACT CODE FOR REFERENCE]

Integrating These Techniques: A Practical Audit Workflow

The highest-signal workflow combines all of the above:

Context preparation — Extract state variables, modifiers, and interfaces. Annotate the Solidity version and protocol purpose. Build a context header.
Fast triage pass — Send the full contract (or a stripped version with function signatures and natspec) through Template 1. Identify the top three risk areas.
Targeted deep passes — Run one vulnerability-class-specific prompt per risk area identified in triage. Use the appropriate template. Use CoT for complex logic; use few-shot for output format consistency.
Cross-function synthesis — After all targeted passes, send a final prompt: “Given these findings [paste JSON], identify any interactions between them that compound the overall risk. Are there multi-step attack paths combining more than one finding?”
Differential review — If reviewing a new version of an existing contract, run the differential analysis template against the previous version with the original findings list.
Severity calibration — Run all findings through Template 6 to reduce false positives before delivering.

The resulting structured context incorporated into a carefully engineered prompt assembly stage, further enhanced through few-shot chain-of-thought prompting, strengthens the LLM’s reasoning capabilities and produces structured and interpretable vulnerability reports.

Limitations and Honest Expectations

LLM-based security analysis is not a replacement for manual review or formal verification. Chaining instruments effectively validates true positives, significantly reducing the manual verification burden, though inconsistency and the cost of LLMs remain potential limitations.

Every finding produced by an LLM should be verified by a human before it enters an audit report. The LLM’s contribution is coverage and speed — it can scan many vulnerability classes quickly, surface patterns that warrant manual investigation, and produce structured findings that a human auditor verifies and refines. It does not replace the adversarial mindset of an experienced auditor tracing execution paths by hand.

What prompt engineering does is close the gap between what the model is capable of and what your prompts actually elicit. The templates and patterns in this article represent the floor, not the ceiling. Every codebase has its own logic and its own failure modes. The best audit prompts are always the ones you write specifically for the protocol in front of you — using the structure above as a starting point.