RAG for Smart Contract Auditing: Building a Vulnerability Knowledge Base — Darkwave Log

Smart contracts are immutable once deployed. A single logic flaw can drain millions in seconds, and no patch can be pushed after the fact. The burgeoning adoption of economically incentivized smart contracts faces persistent security vulnerabilities, resulting in significant financial losses due to their immutability post-deployment. The pressure on auditors is asymmetric: attackers need to find one path in; defenders need to find them all.

Large language models have entered the audit workflow, and they bring genuine value — they can reason over code, articulate attack scenarios in natural language, and surface edge cases that rule-based static analyzers miss. LLMs have demonstrated remarkable capabilities in understanding and generating human-like text, making them suitable for tasks that involve code analysis and pattern recognition. The problem is what they do with vulnerability classes they have never seen — or have seen too rarely in training data to generalize reliably. This is where Retrieval-Augmented Generation changes the equation.

What RAG Is and Why It Matters for Security

Retrieval-Augmented Generation is an architectural pattern that inserts an information-retrieval step between a user’s query and the LLM’s generation step. RAG incorporates an information retrieval step into the text generation process. Instead of relying solely on the model’s built-in knowledge, the system first fetches relevant reference data from a database, documents, or API and feeds it into the prompt. The LLM’s answer is then “grounded” in that external data.

The mechanism is straightforward. A query — in our context, a Solidity function or a description of suspicious behavior — is converted into a vector embedding. That embedding is compared against a pre-indexed knowledge base. The top-k most similar chunks are retrieved and stuffed into the LLM’s context window alongside the original query. The model then reasons over both its parametric knowledge and the retrieved evidence simultaneously.

The RAG component leverages external knowledge bases to provide LLMs with up-to-date, authoritative security information and vulnerability patterns, thereby enriching contextual understanding and improving the accuracy and completeness of detected vulnerabilities.

For domain-specific tasks like smart contract auditing, the benefit is qualitative, not just quantitative. With techniques like Retrieval-Augmented Generation and fine-tuning, LLMs can be adapted to the rapidly changing landscape of smart contract security. A general-purpose LLM trained on public code has broad coverage of common Solidity patterns, but it may have encountered only a handful of examples of a novel flash-loan manipulation variant or a cross-chain bridge re-entrancy pattern. RAG lets you close that gap without retraining.

By fine-tuning an open-source LLM and employing RAG, the model can dynamically incorporate domain-specific external knowledge during inference, significantly improving threat identification.

The key insight is the direction of trust: rather than hoping the model memorized the right security knowledge at training time, you supply that knowledge explicitly at inference time.

Architecture Overview

Before diving into each layer, it helps to have a mental model of the full pipeline. There are two distinct phases: ingestion and inference.

┌─────────────────────────────────────────────────────────────────┐
│                        INGESTION PHASE                          │
│                                                                 │
│  Raw Sources                                                    │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
│  │ Audit Reports│  │  Post-Mortems│  │  Research Papers /   │  │
│  │  (PDF, MD)   │  │  (blog, MD)  │  │  CWE / SWC entries   │  │
│  └──────┬───────┘  └──────┬───────┘  └──────────┬───────────┘  │
│         └─────────────────┴──────────────────────┘             │
│                           │                                     │
│                    Document Parser                              │
│                           │                                     │
│                    Chunking Engine                              │
│                  (function-level / semantic)                    │
│                           │                                     │
│                    Embedding Model                              │
│                  (code-aware / text)                            │
│                           │                                     │
│                    Vector Store + BM25 Index                    │
│                  (Pinecone / Weaviate / pgvector)               │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│                        INFERENCE PHASE                          │
│                                                                 │
│  Incoming Contract / Query                                      │
│         │                                                       │
│   Query Encoder (same embedding model)                         │
│         │                                                       │
│   ┌─────┴──────┐                                               │
│   │  Hybrid    │ ← Vector similarity (semantic)                │
│   │  Retriever │ ← BM25 keyword search                         │
│   └─────┬──────┘                                               │
│         │  top-k chunks                                        │
│   Cross-Encoder Reranker (optional but recommended)            │
│         │  reranked top-n                                      │
│   Context Assembly + Prompt Template                           │
│         │                                                       │
│         LLM (GPT-4 / Llama / Claude)                           │
│         │                                                       │
│   Structured Audit Output + Source Citations                   │
└─────────────────────────────────────────────────────────────────┘

Each box above represents a real engineering decision. The following sections work through them in order.

Building the Vulnerability Knowledge Base

The quality of your retrieval system is bounded by the quality of what you put in. Garbage in, hallucinated vulnerabilities out.

Source Categories

Audit reports. Professional firms publish post-audit disclosures. These typically contain the vulnerable code snippet, a natural-language description of the flaw, severity classification, and a recommended fix. They are gold-standard examples: paired (vulnerable code, explanation) tuples. Parsing them requires some care because PDF formatting destroys whitespace in Solidity code blocks, but the informational density is high.

Post-mortems. Exploit post-mortems, often published on Mirror, GitHub, or firm blogs, describe attacks that succeeded against deployed contracts. They frequently include transaction hashes, attacker flow diagrams, and root-cause analysis. These documents are especially valuable because they capture novel attack patterns — the kind that static analyzers have no rules for yet.

Research papers. Academic papers on smart contract security introduce formal classifications, new vulnerability categories, and empirical datasets. They are slower-moving but more precise in their definitions. Abstracts and vulnerability taxonomy sections are the highest-signal parts.

SWC Registry and CWE entries. The Smart Contract Weakness Classification registry provides concise, canonical descriptions of known vulnerability classes with identifiers (SWC-107 for re-entrancy, SWC-101 for integer overflow, and so on). These short entries serve as excellent anchors for query expansion and for labeling chunks elsewhere in the knowledge base.

Metadata Schema

Every document chunk in your vector store should carry structured metadata alongside its embedding vector. At minimum:

{
  "chunk_id": "uuid",
  "source_type": "audit_report | post_mortem | research_paper | swc",
  "vulnerability_class": ["reentrancy", "price_manipulation"],
  "severity": "critical | high | medium | low | informational",
  "protocol_type": "lending | dex | bridge | staking | nft",
  "language": "solidity | vyper | rust",
  "content": "...",
  "embedding": [...]
}

This metadata serves two purposes. First, it enables filtered retrieval — if an auditor is analyzing a lending protocol, you can bias retrieval toward chunks tagged protocol_type: lending. Second, it lets you quantify knowledge base coverage: how many post-mortems do you have per vulnerability class? Where are the gaps?

Chunking and Embedding Strategies

The Fundamental Tension

Chunking is the process of breaking documents into smaller segments before embedding and indexing them for vector search. It sounds like a preprocessing detail, but it’s one of the more important decisions in your RAG pipeline, affecting retrieval precision, index size, query latency, and the quality of LLM answers. Split documents too aggressively, and you get fragments stripped of context. Split them too conservatively, and your vector embeddings dilute multiple topics into a single representation that matches nothing well.

This tension is especially acute for smart contract security documents, which mix two very different content types: Solidity code and natural-language prose. Each requires a different chunking approach.

Chunking Prose Security Documentation

For audit reports, post-mortems, and research papers, semantic chunking outperforms fixed-size splitting. Rather than chopping every 512 tokens regardless of topic boundaries, semantic chunking uses embedding similarity between consecutive sentences to detect topic shifts and place chunk boundaries there.

Chunking determines what information the retriever can find, which directly affects answer quality, hallucination risk, and token usage across the entire RAG pipeline. Retrieval quality in a RAG system depends on how well documents are segmented. Chunking plays a central role because it shapes how information is stored and retrieved. Good chunking improves retrieval precision by returning exactly the passage the query needs instead of unrelated text. It also preserves meaningful context by keeping related ideas together rather than splitting them across segments. When the model receives complete and relevant context, it is less likely to hallucinate.

For structured audit reports that follow a predictable schema (executive summary → findings → recommendations), structure-aware chunking works even better. Structure-aware chunking yields higher overall retrieval effectiveness, particularly in top-K metrics, and incurs significantly lower computational costs than semantic or baseline strategies. You parse the document hierarchy and treat each finding as a self-contained chunk — keeping the vulnerability description, the code snippet, and the remediation advice together within a single retrievable unit.

Fixed-size chunking, which splits every N tokens, is convenient but destructive. It ignores document structure and splits sentences mid-thought. Avoid it for anything beyond prototyping.

Chunking Smart Contract Source Code

Code has its own natural boundaries: functions, modifiers, and events. The correct chunking unit for Solidity source is the function or function group, not an arbitrary token window.

A function-level chunker should:

Parse the AST (using solc’s JSON output or tools like solidity-parser-antlr) to identify function boundaries.
Emit one chunk per function, including its signature, NatSpec comments, state variable declarations it reads/writes, and any modifier names.
For very large functions, apply a secondary split at the statement level with a 20% overlap to preserve inter-statement context.
For contracts with heavy inheritance, include a brief inheritance chain summary in the chunk’s metadata so the retriever can surface parent-contract context.

Transforming key functions and dependencies of key functions related to vulnerabilities in the source code into graph features serves as vulnerability features for smart contracts. You do not need to go as far as full graph embedding for a RAG knowledge base, but the principle holds: the function’s call relationships matter, and excluding them from the chunk leaves the embedding model without the context it needs to distinguish a safe transfer() call from a re-entrant one.

Embedding Model Selection

Open-source LLMs are primarily pretrained on general code corpora without specific adaptation to Solidity, the programming language of smart contracts. The same limitation applies to embedding models. A general-purpose embedding model (e.g., text-embedding-ada-002 or all-MiniLM-L6-v2) will cluster semantically similar natural-language descriptions well, but it may not map vulnerable Solidity patterns close to their textual descriptions in the embedding space.

Your options, in order of complexity:

General-purpose text embeddings (OpenAI text-embedding-3, Cohere embed-v3): Easy to use, good for prose documentation. Weaker on raw Solidity code.
Code-aware embeddings (CodeBERT, UniXcoder, StarEncoder): Pretrained on multi-language code corpora. Better at capturing structural similarity between code snippets.
Fine-tuned code embeddings: Take a code-aware base model and fine-tune it on a contrastive dataset of (vulnerable function, vulnerability description) pairs. This is the highest-effort approach but directly optimizes the retrieval task.

Frameworks like SCPatcher use the static analysis tool Slither to extract contract relationships, build a Knowledge Graph in Neo4j, and embed it using CodeBERT. This knowledge base guides LLMs to generate accurate repairs. You can adopt the same embedding approach without necessarily building a full knowledge graph — CodeBERT embeddings on function-level chunks are a practical starting point.

A practical compromise is a dual embedding strategy: embed prose chunks with a text model and code chunks with a code model, maintaining two sub-indexes that are queried independently and whose results are merged before reranking.

Retrieval Strategies for Vulnerability Detection

Why Pure Semantic Search Is Insufficient

Semantic search excels when the query and the document express the same idea in different words. That matters when an auditor describes a pattern conversationally: “the function sends ETH before updating the user’s balance” should retrieve documents about re-entrancy even if the word “reentrancy” never appears in the query.

But vulnerability detection also involves precise technical identifiers: function names, opcode sequences, ERC standard numbers, CVE or SWC identifiers. BM25 dominates when queries contain specific terminology, industry-specific jargon and acronyms, or exact names with consistent naming conventions. Pure semantic search can rank a thematically adjacent but practically irrelevant document higher than a document that mentions the exact function name being analyzed.

Hybrid Retrieval: The Right Default

A hybrid Retrieval-Augmented Generation system that integrates dense retrieval with BM25 assists in verifying and contextualizing vulnerability detection results. This architecture runs both retrievers in parallel and merges their result sets before a final ranking pass.

Traditional methods like BM25 have proven effective for keyword-based searches but often struggle with semantic understanding and capturing contextual nuances. Conversely, vector search, powered by embedding models, excels at semantic similarity but can miss exact keyword matches. A hybrid approach combines the strengths of both BM25 and vector search to achieve superior retrieval performance.

The standard fusion algorithm is Reciprocal Rank Fusion (RRF):

RRF_score(d) = Σ 1 / (k + rank_i(d))

where k is a smoothing constant (typically 60) and rank_i(d) is the position of document d in the i-th retriever’s result list. Documents that appear high in both ranked lists receive the highest fused scores.

After fusion, a cross-encoder reranker scores each candidate against the original query using full attention — the bi-encoder (which produced the vector similarity scores) approximated relevance at indexing time; the cross-encoder reads the query and the chunk together and provides a more precise relevance signal. A Dual-Stage Retrieval pipeline is designed to maximize precision in high-noise policy documents. The system utilizes a two-step filtering process: first via dense vector similarity using a Bi-Encoder, and second via token-level attention using a Cross-Encoder.

Query Formulation for Audit Contexts

The query sent to the retrieval system should not be the raw contract source. Instead, construct a structured query that combines:

The function signature and NatSpec comments — providing the semantic description of intent.
Extracted features from static analysis — output from Slither or Aderyn flagging potential hotspots (e.g., “external call before state update in withdraw()”).
Protocol context — the type of contract being audited (e.g., “AMM liquidity pool”) to enable metadata filtering.

To improve reliability, the framework can incorporate structured reasoning, prompt optimization, and external tools such as static analyzers and Retrieval-Augmented Generation. Static analyzers and RAG are not competing approaches — they are complementary. The static analyzer narrows the search space; RAG provides the contextual explanation and historical precedent.

How RAG Reduces Hallucination in Security Contexts

Hallucination is the pathology where a model produces confident, fluent, and wrong output. In security contexts, a hallucinated vulnerability is worse than no finding at all — it wastes auditor time chasing phantom bugs and erodes trust in the tooling.

In high-stakes domains such as law, healthcare, or enterprise knowledge systems, hallucinations can be harmful. RAG was introduced as a way to ground LLM outputs in external data sources, reducing the risk of such fabrications.

The grounding mechanism works through citation pressure. When the prompt template instructs the model to cite the retrieved chunks that support each finding, the model must map its claims to specific evidence. If no retrieved chunk supports a claimed vulnerability, the model either acknowledges uncertainty or invents a citation — the latter is detectable if your evaluation harness checks citation validity.

By grounding the generation process in factual information from reliable sources, RAG can reduce the likelihood of hallucinating incorrect or made-up content, thereby enhancing the factual accuracy and reliability of the generated responses.

However, RAG does not eliminate hallucination. RAG-based LLMs still hallucinate even when provided with correct and sufficient context. A growing line of work suggests that this stems from an imbalance between how models use external context and their internal knowledge.

Three failure modes deserve specific attention in the security context:

Poor retrieval precision. Irrelevant retrieval — where the retriever surfaces documents that don’t answer the query — leads the LLM to “fill in the gaps.” When evidence is insufficient, the model may speculate instead of admitting uncertainty. The mitigation is a confidence threshold: if the top retrieved chunk’s similarity score falls below a threshold, the pipeline should return a low-confidence signal rather than proceeding to generation.

Terminology mismatch. Vulnerability descriptions in older audit reports may use different terminology than the query. An auditor asking about “read-only re-entrancy” may not retrieve documents that describe “view function re-entrancy” if the embedding space does not bridge that gap. Synonym expansion at query time — injecting known aliases for vulnerability classes — partially addresses this.

Context window dilution. If too many retrieved chunks are concatenated into the prompt, the LLM’s attention is spread too thin and the most relevant evidence may be “lost in the middle.” Limit the context to 5–8 well-reranked chunks and use the cross-encoder scores to prioritize the highest-signal evidence closest to the query in the prompt template.

Building the RAG Pipeline for Audit Assistance

Here is a concrete, end-to-end sketch of a production audit assistance pipeline.

Ingestion Service

Raw Document → Document Parser → Metadata Extractor
                                       │
                              ┌────────┴──────────┐
                              │                   │
                         Prose Chunks        Code Chunks
                         (semantic split)    (function-level AST split)
                              │                   │
                         Text Embedder       Code Embedder
                         (e.g., ada-003)     (e.g., CodeBERT)
                              │                   │
                         Vector Index A     Vector Index B
                              │                   │
                         ─────┴─────────────────── ┘
                                   │
                              BM25 Index
                           (shared full-text)

Run this ingestion service as a background job. New audit reports, post-mortems, and research papers are pulled from monitored sources (GitHub feeds, RSS, manual upload) and queued for processing. The embedding step is the bottleneck; batching document chunks with async API calls or a local GPU keeps throughput reasonable.

Inference Service

def audit_function(fn_source: str, fn_metadata: dict) -> AuditResult:
    # 1. Feature extraction
    static_flags = slither_analyze(fn_source)

    # 2. Query construction
    query = build_query(fn_source, fn_metadata, static_flags)

    # 3. Hybrid retrieval
    dense_results  = vector_search(query, top_k=20, index="both")
    sparse_results = bm25_search(query, top_k=20)
    fused_results  = reciprocal_rank_fusion(dense_results, sparse_results)

    # 4. Reranking
    reranked = cross_encoder_rerank(query, fused_results, top_n=6)

    # 5. Confidence gate
    if reranked[0].score < CONFIDENCE_THRESHOLD:
        return AuditResult(low_confidence=True)

    # 6. Prompt assembly and generation
    prompt = build_audit_prompt(fn_source, reranked)
    response = llm.generate(prompt)

    # 7. Citation validation
    validated = validate_citations(response, reranked)
    return validated

The validate_citations step cross-checks every finding reference against the retrieved chunks. A finding that cites “chunk_id: abc123” but whose content is not supported by that chunk’s text is flagged for human review.

Prompt Template Design

The prompt template for audit generation should:

Establish the model’s role and constraints explicitly: “You are a smart contract security auditor. Base every finding on the provided reference documents. Do not introduce vulnerability claims that are not supported by the references. If you are uncertain, say so.”
Present the retrieved chunks in descending relevance order, with their source metadata visible.
Request structured output: finding name, SWC or CVE identifier if applicable, severity, affected lines, and the reference chunk ID that supports the claim.
Include an explicit instruction to return an empty finding list if no vulnerabilities are supported by the retrieved evidence.

Evaluating Retrieval Quality for Security Tasks

Evaluating a RAG pipeline requires evaluating both stages independently and then end-to-end.

Retrieval-Stage Metrics

Build a small golden dataset: 50–100 (query, relevant_chunk_ids) pairs, where a security expert manually identifies which knowledge base chunks should be retrieved for a given vulnerable function. Then measure:

Recall@k: What fraction of the relevant chunks appear in the top k results? For security tasks, recall matters more than precision — a missed vulnerability is more costly than a false positive.
Mean Reciprocal Rank (MRR): How high does the first relevant chunk rank? A relevant chunk buried at position 15 is less useful than one at position 2.
Normalized Discounted Cumulative Gain (nDCG): A graded measure that accounts for partial relevance — a chunk that describes a similar but distinct vulnerability class has some value even if it is not the best match.

Generation-Stage Metrics

Key end-to-end evaluation dimensions include: Faithfulness — whether the answer is factually grounded in the provided context, which directly measures the reduction in hallucinations; Answer Relevancy — whether the answer directly addresses the user’s question; and Answer Correctness — whether the generated answer matches a ground-truth correct answer.

Frameworks like RAGAs, LlamaIndex, and DeepEval can automate this evaluation.

For security-specific evaluation, supplement automated metrics with a human expert review on a sampled subset. Ask an auditor to rate each generated finding on:

Precision: Is this a real vulnerability in the analyzed code?
Evidence quality: Is the cited reference actually relevant to the finding?
Actionability: Is the description specific enough to guide remediation?

Track these metrics per vulnerability class. If your pipeline has high precision for re-entrancy but poor recall for price oracle manipulation, the gap points to a coverage problem in the knowledge base for that class.

Adversarial Retrieval Testing

A retrieval system that performs well on average can still fail catastrophically on novel patterns. Construct adversarial test cases:

Vocabulary mismatch queries: Describe a reentrancy attack without using the word “reentrancy” — only the control flow pattern.
Paraphrased code: Functionally identical vulnerable code with renamed variables — does the embedding capture semantic similarity or just token overlap?
Near-miss queries: Vulnerable code that is similar but not identical to a known exploit pattern — does the system retrieve the closest match, or does it return irrelevant chunks?

Maintaining a Current Vulnerability Knowledge Base

Building the knowledge base is a one-time effort. Keeping it current is an ongoing operational commitment that teams systematically underestimate.

The Staleness Problem

The training data of an LLM is a static snapshot that only contains information that was true on the cutoff date. The same is true of your retrieval index. The external data can be as current or specialized as needed, which addresses the issue of training data being static or out-of-date. But that advantage disappears if the knowledge base itself falls behind the threat landscape.

DeFi attack patterns evolve quickly. A flash-loan attack variant published last month may not be in your knowledge base if you only run ingestion quarterly. New ERC standards introduce new contract interaction patterns that create new vulnerability surfaces. A knowledge base that was comprehensive at launch can become a false-confidence generator within months.

Ingestion Pipeline Automation

Automate ingestion from high-signal sources:

GitHub repositories of public audit firms (Code4rena, Sherlock, Immunefi disclosures) — monitor for new reports via the GitHub API and webhook triggers.
Rekt.news and similar post-mortem aggregators — RSS feeds or scrapers that trigger on new entries.
ArXiv cs.CR category — automated queries for smart contract security papers, filtered by abstract content.
SWC Registry — watch for additions and modifications to canonical vulnerability descriptions.

Each new document goes through the same ingestion pipeline: parse, chunk, embed, index. Metadata tagging should be automated where possible (classify vulnerability class with a lightweight classifier) and human-reviewed for high-stakes sources like new SWC entries.

Versioning and Deprecation

Knowledge bases need versioning. When a vulnerability class is reclassified — for example, when a pattern previously considered low-severity is elevated to critical based on new attack data — you need to update the affected chunks without silently corrupting retrieval for older queries.

Practical approach:

Use a document-level valid_from and valid_to timestamp. Mark deprecated entries as inactive rather than deleting them; they may still be historically relevant for understanding the evolution of an attack class.
Maintain an embeddings version alongside the model version. When you upgrade your embedding model, re-embed the full corpus — retrieval quality degrades when some chunks are embedded with an older model and some with a newer one.
Run your golden retrieval evaluation set after every batch ingestion to catch regressions before they affect live audits.

Curation vs. Coverage

The temptation is to ingest everything. A 10,000-chunk knowledge base of noisy, poorly structured documents will underperform a 2,000-chunk knowledge base of high-quality, well-tagged entries. One of the biggest challenges is the lack of clear, practical guidance on how to effectively structure and segment source documents to maximize retrieval quality and LLM performance.

Curation heuristics for smart contract security sources:

Prefer primary sources over secondary summaries. An original post-mortem has higher fidelity than a blog post summarizing it.
Require code snippets for vulnerability-describing chunks. Natural-language-only descriptions of code vulnerabilities embed poorly and retrieve unreliably.
Deduplicate aggressively. The same re-entrancy pattern described in 40 audit reports does not benefit from 40 copies in the index. Cluster similar chunks and retain the canonical representative — usually the most detailed, most recent, or most authoritative source.
Tag confidence levels. Unverified reports, rumors, or speculative analyses should be tagged with lower confidence and filtered out of high-stakes queries.

Limitations and What RAG Does Not Fix

RAG is not a substitute for deep protocol-specific expertise. The results are promising, but emphasize the need for human auditing. RAG-LLM systems are best understood as a proof of concept for a cost-effective smart contract auditing process, moving towards democratic access to security.

Several gaps remain:

Novel attack classes. If a vulnerability class has no representation in your knowledge base, the retriever will return irrelevant results, and the model will either hallucinate or produce generic output. RAG only helps when the answer exists somewhere in the knowledge base.

Cross-contract interactions. Many modern exploits span multiple contracts and transactions. Chunking at the function level discards inter-contract call context. Knowledge graph approaches — where contract relationships are modeled as edges — partially address this, but add significant implementation complexity.

Business logic flaws. The most expensive vulnerabilities in DeFi are not reentrancy or integer overflow — they are misaligned economic incentives, incorrect fee accounting, or protocol-specific oracle dependencies. These do not pattern-match against generic vulnerability descriptions in a knowledge base. They require auditors to understand the specific protocol’s intended behavior.

Latency under load. A full hybrid retrieval + reranking pass adds 200–800ms to each function analysis, depending on knowledge base size and whether the reranker runs locally or via API. At the scale of a full protocol audit (tens of thousands of lines), this accumulates. Caching retrieval results for repeated function signatures and batching embedding requests mitigates the overhead.

Putting It Together

The core thesis is simple: in the context of smart contract auditing, a RAG-LLM system can retrieve examples of known vulnerabilities from a vector store, enhancing its ability to identify similar issues in new contracts. By integrating this technology, it becomes possible to develop a scalable and cost-effective solution that democratizes access to smart contract security auditing.

The engineering reality is more demanding. A knowledge base that compounds audit reports, post-mortems, and research papers into a well-indexed, hybrid-searchable corpus — maintained with continuous ingestion, versioning, and regular quality evaluation — is a significant artifact. It needs dedicated ownership: someone responsible for what goes in, how it is tagged, and whether retrieval quality is trending up or down.

Teams that treat the knowledge base as a one-time setup and the retrieval pipeline as a commodity component will find that their RAG audit assistant produces increasingly stale, increasingly imprecise results over time. Teams that invest in the maintenance loop — the evaluation harness, the ingestion automation, the curation discipline — will find that the system compounds: every new post-mortem ingested makes the next audit slightly sharper.

The retrieval pipeline is infrastructure. Build it like infrastructure.