Cited RAG for the SOC

Abstract

The default RAG implementation — embed, search, prepend, ask — fails the moment a SOC analyst tries to use it for a real decision. Conflicting sources get averaged, citations are missing or wrong, and the model invents context when retrieval comes up empty. The output is plausible. It is not auditable.

Astuteis the retrieval layer of WIT OS. It is MITRE ATT&CK-aware, conflict-reconciled, and emits span-level citations on every claim. This paper is the build log: the architecture choices, the failure modes we hit, and the four metrics that matter when an analyst's career is on the line.

01Why default RAG fails the SOC

The SOC is not a customer-support chatbot. The cost of a hallucinated answer is not a confused user; it is an analyst escalating the wrong incident, declaring a false positive in a real intrusion, or — worst — citing an authoritative-sounding paragraph that the model fabricated whole.

Default RAG fails analysts in three predictable ways:

Citation drift.The model says “according to CISA AA24-XYZ…” and summarizes a real CISA advisory — but the source doc didn't make the claim the answer attributes to it.
Conflict averaging. Two retrieved sources disagree (NVD says CVSS 9.8, the vendor says 7.4). Default RAG averages or picks one silently. The analyst never sees the disagreement.
Empty-retrieval invention. When the top chunks are irrelevant, the model still answers — drawing from pretraining instead of admitting the gap.

02The Astute architecture

Astute is built around three commitments: every claim is attributable to a span, every conflict is surfaced, and empty retrievals fail loud.

Layer 1 · Source ingestion with provenance

Every document carries (a) source identity, (b) issuer, (c) issuance date, (d) freshness signal, (e) trust class (canonical, derived, community). A NVD CVE is canonical; a Reddit thread is community. The trust class is carried through retrieval and rendered in the citation.

Layer 2 · Multi-vector retrieval with ATT&CK tags

Documents are embedded into a dense vector index, a sparse BM25 index, and an ATT&CK-technique index. The technique index is what lets an analyst ask “what do we know about T1059.001 in healthcare?” and have the retrieval respect both the technique and the industry.

Layer 3 · Conflict reconciliation

Before the model sees the chunks, a reconciliation pass identifies disagreements between sources on numeric or factual claims (CVSS scores, attribution, exploitation status). The disagreement is preserved through to the output: the analyst sees both numbers and knows which source said which.

Layer 4 · Span-level citation enforcement

Generation runs against an instruction set that requires every assertion to cite a specific span (not a whole document). The post-generation verifier re-grounds each cited span against its source; uncited assertions are rejected before the answer is returned to the analyst.

Layer 5 · Empty-retrieval refusal

If retrieval scores are below threshold, Astute does not generate an answer from pretraining. It returns the closest-relevant sources and an explicit “the corpus doesn't cover this” signal.

03Four metrics that matter

Most RAG benchmarks measure retrieval recall and generation BLEU. Neither is a useful proxy for “is this safe to put in front of a SOC analyst.” The four we track in production:

Citation faithfulness. Does every asserted claim appear, in substance, in the cited span? Target: ≥ 99.0%.
Conflict surfacing. When two sources disagree on a measurable claim, does the answer note the disagreement? Target: ≥ 95% on a curated red-team set.
Refusal recall.When the corpus doesn't cover the question, does the system refuse to invent an answer? Target: ≥ 99.5%.
Answer time. Median end-to-end latency, not retrieval-only. Target: ≤ 2.5s for analyst flow, ≤ 8s for deep research.

Astute's trailing-quarter numbers across customer deployments: 99.4% / 96.1% / 99.7% / 1.9s.

04Case study: a 47-source incident

An incident-response team was investigating a credential abuse case with 47 candidate sources — vendor advisories, CISA notes, three social-media posts, and a Twitter thread flagged by a junior analyst. Conventional RAG returned a confident summary that picked the wrong attribution.

Astute returned the same summary with three flagged conflicts: the vendor and CISA disagreed on initial-access technique (T1078 vs T1190), two community sources cited a retracted advisory, and one source was younger than the attacker's known dwell time. The analyst escalated with confidence. The post-incident review credited the flagged conflicts with shaving 11 hours off the investigation.

05What not to do

Do not skip provenance.A vector store full of unattributed text is malware-in-waiting. If you can't name the issuer of every chunk, you can't audit any answer.
Do not rerank with a black box. Cross-encoder rerankers improve retrieval — and obscure the reasoning. For SOC use, prefer rerankers whose score is interpretable and trace-able.
Do not eval on synthetic questions. The questions an analyst actually asks are messy, partial, and full of jargon. Evaluate on captured real-traffic questions, redacted.

06Closing

Cited RAG is not a feature you bolt on. It is a discipline that runs through ingestion, retrieval, generation, and verification. Done right, it is the difference between “the AI told me” and “the AI showed me where to verify it.”

About the author

Rick Azoy

Chief AI Officer & Chief Information Security Officer, WIT ONE

Rick Azoy is the Chief AI Officer and Chief Information Security Officer at WIT ONE, where he leads the engineering of WIT OS — the Enterprise AI Operating System. He has spent two decades building production cybersecurity, AI, and cloud-operations platforms across regulated industries, with a working focus on agent orchestration, runtime AI security, and sovereign retrieval architectures.