Fine-Tuned Models vs Zero-Shot: What the Benchmarks Actually Show — Darkwave Log

The leaderboard narrative is seductive: a fine-tuned model trained specifically on smart contract vulnerabilities outperforms a general-purpose zero-shot LLM, therefore fine-tuning is the right strategy for production security tooling. The evidence is more complicated than that. Understanding why fine-tuned models win on benchmarks — and where those wins evaporate — requires pulling apart what fine-tuning actually does, what the benchmarks actually measure, and what both approaches fundamentally cannot do.

What Fine-Tuning Means in a Security Context

Fine-tuning is the process of adjusting a pre-trained model to specialize its capabilities for a specific task, involving further training on a narrower dataset for a specific domain or application. In the security context, this means starting from a general-purpose code model — something like CodeLlama, Llama 3, or Qwen — and continuing to train it on a labeled corpus of smart contract vulnerabilities.

The process encodes vulnerability knowledge directly into the model’s weights. After training, the model does not need to reason from first principles about what a reentrancy attack looks like; it has been shaped to pattern-match against thousands of prior examples. Fine-tuning locks knowledge into the model, which boosts accuracy but reduces adaptability compared to dynamic approaches.

There are two dominant fine-tuning approaches in the literature. Full-Parameter Fine-Tuning (FFT) updates every weight in the network, producing the strongest specialization but requiring the most compute and the largest training dataset. Parameter-Efficient Fine-Tuning (PEFT), most commonly via Low-Rank Adaptation (LoRA) or its quantized variant QLoRA, inserts a small number of trainable adapter matrices while leaving the base model frozen. A 7B model that needs 100–120GB VRAM for full fine-tuning can run on a single 24GB RTX 4090 with LoRA.

FFT significantly outperforms LoRA in both precision and recall, as LoRA’s parameter freezing limits the model’s learning capacity for complex vulnerability detection tasks. This tradeoff — cheaper to run but weaker on hard cases — shows up repeatedly in the benchmark literature and is worth keeping in mind when evaluating headline numbers from PEFT-based systems.

The Training Datasets: What They Contain and What They Miss

Fine-tuned models are only as good as the data they were trained on. The datasets used for smart contract security fine-tuning share a common structure: labeled Solidity files annotated with vulnerability types, usually drawn from one or more of the canonical classification registries.

Currently there are two popular taxonomies in the Ethereum community: the Decentralized Application Security Project (DASP) TOP 10 and the Smart Contract Weakness Classification (SWC) registry. DASP Top 10 lists 10 types of vulnerabilities, which are used by a framework called SmartBugs for analyzing Ethereum smart contracts. The SWC registry extends this to 37 weakness categories. Most training sets are structured around these classifications.

The specific vulnerability types that appear most frequently in training corpora are also the most classically well-understood: reentrancy, integer overflow/underflow, timestamp dependency, access control failures, and dangerous delegatecall. Having fine-tuned LLMs specifically for smart contract code classification should help in getting better results when detecting several types of well-known vulnerabilities, such as Reentrancy, Integer Overflow, Timestamp Dependency and Dangerous Delegatecall.

Several datasets have been assembled explicitly for this purpose. The EVuLLM dataset was designed to address the scarcity of diverse evaluation resources. One approach collected over 26,727 labeled smart contract vulnerabilities to fine-tune a 13B parameter Llama-2 model. Another introduced a comprehensive dataset of 215 real-world DApp projects (4,998 contracts), including hard-to-detect logical errors like token price manipulation, explicitly addressing the limitations of existing simplified benchmarks.

But dataset quality has a deeper problem than scale. The majority of contracts in many existing datasets are toy contracts with significantly fewer lines of code than real-world DApp projects. Achieving good results on a dataset predominantly composed of toy contracts cannot guarantee the tool’s effectiveness for real-world DApps.

Labeling itself is contested. Creating a reasonably sized dataset manually is time-consuming and prone to human error, and data tagging using analysis tools has various issues, including the lack of standardization in naming vulnerabilities detected by different analysis tools, compounded by the absence of a public registry to assist in mapping tool tags to widely recognized vulnerability labels.

The coverage gap is especially pronounced for real-world audit findings. Due to the limited labor force, most datasets only label codes with SWC weaknesses — yet 82.3% of the weaknesses found in real audit reports are not included in the SWC Registry at all. A fine-tuned model trained exclusively on SWC-labeled data has therefore never seen the vast majority of issues that professional auditors actually flag.

The SWC registry mainly includes classical and general weaknesses in smart contracts, such as Reentrancy and Integer Overflow, but the majority of reported weaknesses in audit reports are non-security issues (such as code optimization suggestions) or non-classical issues (like functional bugs specific for each DApp), which are hard to classify with general rules.

The manual process of dataset construction is labor-intensive and error-prone, which limits the scale, quality, and ability to keep up with the rapidly changing vulnerability landscape. One recent smart contract vulnerability dataset construction required 44 person-months of manual effort to compile 1,618 vulnerabilities across 682 DApps.

What the Benchmarks Actually Measure

Before accepting any performance claim, it is worth asking: what is the benchmark testing, exactly?

Most smart contract security benchmarks are structured as binary or multi-class classification tasks over curated, labeled corpora. A common structure evaluates LLMs on Solidity smart contract analysis using a balanced dataset of contracts under two tasks: (i) Error Detection, where the model performs binary classification to decide whether a contract is vulnerable, and (ii) Error Classification, where the model must assign the predicted issue to a specific vulnerability category.

The standard evaluation metrics — precision, recall, F1-score, and accuracy — are computed against these labels. The performance of fine-tuned LLMs is typically evaluated using metrics like precision, recall, and F1-score. These metrics are internally consistent and reproducible, but they measure performance on the benchmark distribution, not on the distribution of vulnerabilities that appear in production audits.

Three structural limitations recur across essentially all existing benchmarks.

Label source contamination. When training labels are generated by the same static analysis tools used to build the benchmark, a fine-tuned model is learning to replicate the behavior of Slither or Mythril rather than developing independent security judgment. Empirical evidence shows that existing tools like Slither and Mythril achieve 0% recall on diverse bad randomness patterns — meaning a model trained on their labels inherits their blind spots.

Pattern presence versus exploitability. A key limitation of existing detection tools is their inability to distinguish between pattern presence and actual exploitability. For example, a tool may flag a contract as vulnerable if it uses block.timestamp in randomness generation, then flag it as protected if an onlyOwner modifier exists — without verifying whether that modifier actually guards the relevant function.

Outdated taxonomy scope. The DASP taxonomy has version limitations: since DASP was last updated, vulnerabilities such as Short Addresses have been addressed by Solidity 0.5.0 and later versions of the compiler. Despite the SWC Registry not being updated with new entries since 2020, the sustained development of smart contract analysis tools for detecting SWC-listed weaknesses highlights their ongoing significance in the field — even as the ecosystem has moved on.

The practical consequence is that a model achieving 94% accuracy on a benchmark built from 2020-era labeled contracts may be solving a problem that is only partially representative of the contracts being audited today.

Where Fine-Tuned Models Actually Win

Within the constraints described above, fine-tuned models do demonstrate genuine advantages in specific, well-scoped scenarios.

Known vulnerability pattern detection. The clearest win for fine-tuning is on the same categories of vulnerabilities that appear in training data. Fine-tuned LLMs surpass the accuracy of other models by achieving over 90% accuracy, which provides evidence for LLMs’ ability to describe the subtle patterns in the code that traditional ML models could miss.

Hard-to-detect logical errors in DeFi. Fine-tuned LLMs exhibit outstanding performance in detecting price manipulation vulnerabilities, indicating that LLMs can be effectively applied to scenarios involving machine-unauditable contract vulnerabilities. This suggests that fine-tuned LLMs possess strong adaptability and generalization capabilities, enabling them to address complex and challenging-to-audit smart contract security issues.

Outperforming zero-shot baselines on structured tasks. Zero-shot prompting with ChatGPT-3.5 achieved an accuracy of 39.1%, recall of 38.7%, precision of 37.5%, and F1 score of 38.1%. ChatGPT-4 demonstrated slightly better precision at 42.3% and F1 score of 39.3%, despite having lower recall at 36.7%. Against these baselines, fine-tuned models show substantial improvements.

Smaller model footprint for equivalent task performance. A fine-tuned model based on CodeGemma achieved 94.78% accuracy on the EVuLLM benchmark and 92.52% on TrustLLM, surpassing GPT-4o despite using far fewer parameters. This matters operationally: a smaller, specialized model can deliver competitive benchmark performance at a fraction of the inference cost of a frontier general-purpose model.

Where Fine-Tuned Models Do Not Win

The performance inversion is sharp and consistent once the evaluation moves outside the training distribution.

Novel attack paths. Fine-tuned models are pattern-matchers. In the context of developing secure smart contracts, a detector that relies on variable names rather than execution logic is not reliable. This reliance makes an LLM a “stochastic auditor”: it can be effective when the code matches its training distribution, but it may fail on novel or sanitized logic that lacks descriptive identifiers.

Solidity version drift. The recall rate for detecting some specific vulnerabilities in Solidity v0.8 has dropped to just 13% compared to earlier versions (v0.4). The root cause of this decline is the reliance of LLMs on identifying changes in newly introduced libraries and frameworks during detection. A model trained predominantly on Solidity 0.4–0.6 contracts will systematically underperform on modern codebases.

Business logic vulnerabilities. This is where both fine-tuned and zero-shot models struggle most severely. Business logic flaws are protocol-specific: a vulnerability in a lending protocol’s liquidation logic is not a syntactic pattern but a semantic deviation from the intended invariants of that particular system. Existing static analysis techniques struggle to capture such high-level logic, while recent LLM-based approaches often suffer from unstable outputs and low accuracy due to hallucination and limited verification.

False positive rates at project scale. The paramount challenge is that LLMs can produce a large number of false positives. A measurement study on project-level vulnerability detection demonstrates that GPT-4 can only identify 32 out of 73 vulnerabilities on 52 DeFi attacks but produces 740 false positive cases, leading to an extremely low precision of 4.15%. A similar conclusion can be drawn from Claude, which achieves a precision of 4.3%. Fine-tuned models reduce but do not eliminate this problem. Higher detection capabilities often result in a higher false positive rate.

Precision on unseen real-world functions. Even on fine-tuned models evaluated rigorously, results on out-of-distribution data are sobering. Evaluation of 1,000 unseen functions shows precision of only 31–36% in predicting vulnerability categories. The fine-tuned LLM demonstrates potential as an auxiliary tool to identify vulnerable code and assist auditors — with the implication that unassisted deployment in a production pipeline is premature.

The Cost and Maintenance Burden of Fine-Tuned Models

A fine-tuned model does not ship once and run forever. It accumulates three distinct cost categories that are routinely underestimated in benchmark papers but are central to any production deployment decision.

Initial training costs. Even a relatively short fine-tuning run can cost several hundred to a few thousand dollars in compute alone. Full-parameter fine-tuning of larger models escalates sharply: fine-tuning requires GPUs with large amounts of memory such as NVIDIA A100 or H100, and a single A100 GPU can cost over $15 per hour on-demand from cloud providers like AWS or GCP.

Ongoing inference costs. Once a model is fine-tuned, it must be deployed on a server so the application can use it — and this is the largest ongoing expense. An always-on endpoint with a GPU capable of running the model 24/7 represents a significant monthly cost, and the cost of running LLMs for inference often dwarfs the initial fine-tuning cost over time. Running a 7B parameter model on a self-administered bare-metal server costs around $953/month. Scaling to 70B models can raise that cost to over $3,200/month.

Personnel and iteration costs. Teams need skilled ML engineers or data scientists who manage the entire lifecycle: setting up the infrastructure, preparing the data, running experiments, deploying the model, and monitoring its performance. Their salaries are a major part of the total cost. Most enterprises are still transitioning from LLM experimentation to production, discovering that fine-tuning costs can spiral quickly. Without deliberate optimization, GPU compute, data preparation, and iteration cycles compound into budgets that exceed initial projections by 2–5x.

The contrast with API-based zero-shot is stark. A zero-shot approach using a commercial API incurs only per-token inference costs, requires no infrastructure ownership, automatically benefits from the provider’s model improvements, and can be updated simply by modifying the prompt. The tradeoff is vendor dependency, data privacy considerations when sending contract code to external APIs, and the precision ceiling described in the section above.

In practice, unlike pre-training, fine-tuning can be conducted in a resource-constrained environment, typically using one or a few GPUs, and presents a compelling case for applications such as specialized question answering within enterprises, legal document analysis, and technical support — but only when the organization has the engineering capacity to sustain it.

Retrieval Augmentation as an Alternative to Fine-Tuning

Retrieval-Augmented Generation (RAG) occupies a conceptually different position from fine-tuning. Rather than baking domain knowledge into model weights, RAG keeps a base model frozen and provides relevant context at inference time by retrieving from an external knowledge store.

RAG searches relevant information and queries relevant data based on the user’s prompt, integrates this information into an enhanced prompt, and then invokes the LLM via its API to deliver a more accurate and relevant response.

In the smart contract security context, this typically means maintaining a vector database of vulnerability patterns, past audit reports, ERC standards, known exploit signatures, or annotated code examples. When a new contract is submitted for analysis, the system retrieves the most semantically similar entries and injects them into the model’s context window.

The empirical evidence for RAG in this domain is encouraging. An approach integrating ERC documentation into the vulnerability detection process addresses gaps in static analysis tools and zero-shot LLMs, with the proposed model achieving significant advancements over traditional tools like Mythril and Slither, and over zero-shot LLM prompting with ChatGPT-3.5 and ChatGPT-4.

Hybrid architectures that combine fine-tuning and RAG also show promise. ParaVul proposes a parallel LLM and retrieval-augmented framework to improve reliability and accuracy of smart contract vulnerability detection. It develops a sparse Low-Rank Adaptation (SLoRA) approach for LLM fine-tuning, then constructs a hybrid RAG system that integrates dense retrieval with BM25, assisting in verifying the results generated by the LLM. A meta-learning model then fuses the outputs of the RAG system and the LLM to generate the final detection results.

The comparative advantages of each approach map cleanly onto different threat models:

Unlike fine-tuning, which requires retraining for new information, RAG leverages external updates in real-time without modifying the model. This makes RAG substantially better suited to a domain where new attack vectors emerge constantly. A RAG system’s knowledge base can be updated with a new exploit pattern the day after it is published; a fine-tuned model requires a full retraining cycle to incorporate the same information.

Fine-tuned models excel at specific tasks or domains, producing highly relevant outputs tailored to specific use cases. Unlike RAG, fine-tuned models do not require external retrieval processes, making their responses faster at inference time.

RAG systems can struggle with the nuances of different industry-specific use cases, and retrieval quality is a real bottleneck: if the semantic search fails to surface the most relevant vulnerability patterns, the model’s reasoning is ungrounded. Fine-tuned models, by contrast, have internalized the domain knowledge and do not depend on retrieval quality.

In practice, the most effective solution often lies not in choosing between RAG and fine-tuning, but in combining both. A hybrid architecture merges the contextual adaptability of retrieval augmented generation with the task-specific accuracy of tuned models. For smart contract security specifically, this means a model fine-tuned on the canonical vulnerability taxonomy for fast, reliable pattern detection, augmented by a retrieval system that surfaces protocol-specific invariants and recent exploit signatures at query time.

What the Evidence Actually Shows

Pulling together the research landscape produces a more nuanced picture than either the pro-fine-tuning or pro-zero-shot advocates tend to present.

Fine-tuned models reliably outperform zero-shot on benchmark tasks involving known vulnerability patterns. Comparative experiments demonstrate significant improvements over prompt-based LLMs and state-of-the-art tools like GPTLENS and GPTSCAN. These improvements are real, reproducible, and meaningful within their scope.

Zero-shot models with advanced prompting strategies close the gap significantly. In the Chain-of-Thought setting, LLaMA-3 demonstrates the strongest overall performance with 96% accuracy and an F1-score of 0.82. CoT and Tree-of-Thought prompting substantially increase recall (often approaching 95–99%), but typically reduce precision, indicating a more sensitive decision regime with more false positives. The gap between a well-prompted zero-shot model and a fine-tuned model on clean classification benchmarks is often smaller than headline numbers suggest.

Both approaches share a fundamental ceiling. Although LLM-powered detection offers promising advantages, empirical study has exposed limitations inherent in current LLMs that inhibit them from reaching their full potential in practice. Existing vulnerability detection methods, including static and dynamic analysis as well as machine learning-based approaches, often struggle with emerging threats and rely heavily on large, labeled datasets.

The precision problem is severe at project scale. The 4.15% precision figure for GPT-4 on real DeFi attack analysis is not a prompting failure — it reflects the inherent difficulty of distinguishing true vulnerabilities from false alarms in complex, interacting multi-contract systems. Fine-tuned models reduce false positives within the training distribution; they do not solve the problem for out-of-distribution code.

Benchmark accuracy does not predict audit utility. An LLM can be effective when the code matches its training distribution, but it may fail on novel or sanitized logic that lacks descriptive identifiers. The fine-tuned LLM demonstrates potential as an auxiliary tool to identify vulnerable code and assist auditors — framing that appears consistently across the literature and is importantly different from “autonomous vulnerability detection.”

Practical Implications

The research evidence supports a few concrete conclusions for teams deciding how to deploy AI in smart contract security workflows.

Fine-tuning makes sense for specific, stable vulnerability classes where labeled data is abundant and the production distribution closely matches the training distribution. If the goal is automated triage of reentrancy, integer overflow, and access control issues across a large backlog of contracts, a fine-tuned classifier is the right tool. It will be faster, cheaper at inference time, and more precise than zero-shot on that narrow task.

Zero-shot or RAG-augmented approaches make more sense for novel codebases, complex DeFi protocols, and anything involving business logic. LLM-based frameworks can detect complex logic vulnerabilities that traditional tools have previously overlooked — but this capability requires the model to reason, not just classify, and reasoning quality is better supported by rich context injection than by fine-tuning alone.

Neither approach replaces human review for high-stakes audits. The false positive rates documented across the literature mean that any AI-assisted workflow must route model outputs through human triage before conclusions are acted upon. The appropriate framing is augmentation: reducing the surface area a human auditor needs to cover, surfacing candidate issues for review, and flagging unusual patterns — not producing a final vulnerability report autonomously.

The maintenance burden of fine-tuned models is routinely underweighted in decision-making. A model trained on contracts from a previous protocol generation needs retraining as Solidity versions, DeFi primitives, and attack patterns evolve. The absence of standardized classification rules leads to inconsistent vulnerability categories and labeling results across different datasets. Existing datasets employ different vulnerability classifications such as SWC or DASP10, which may overlap or describe the same vulnerability but with different names. This fragmentation of ground truth makes retraining an ongoing effort rather than a one-time investment.

The most defensible production architecture combines a fine-tuned model for known-pattern detection with a retrieval-augmented layer for novel-pattern reasoning, both feeding into human-in-the-loop triage. A hybrid approach works best: fine-tune a smaller model for common tasks and use RAG to handle long-tail queries. This reflects both the empirical evidence and the operational realities of maintaining AI tooling in a domain where the threat landscape evolves continuously.

The benchmarks are useful. They are not the finish line.