/ Whitepaper · 2026
Runtime AI Security: A Reference Architecture
MITRE ATLAS-aligned blueprint for securing production AI agents — from prompt to tool to output. The controls, the failure modes, and the patterns that actually work in production.
/ Whitepaper · 2026
MITRE ATLAS-aligned blueprint for securing production AI agents — from prompt to tool to output. The controls, the failure modes, and the patterns that actually work in production.
Most enterprise AI projects ship without a runtime security model. Inspections live in the build pipeline; once an agent is live, it sees raw user input, calls real tools with real credentials, and writes outputs back to systems of record. That is exactly the surface MITRE ATLAS spent two years cataloguing.
This paper specifies a six-layer reference architecture for defending the runtime path of an enterprise AI agent. Each layer is mapped to specific ATLAS techniques, evaluated against measured production traffic, and traced to a control you can ship today. The architecture is the same one we run inside Sentinel, the runtime guardrail layer of WIT OS.
An agent in production has a wider blast radius than any classic web service. It interprets natural language at the edge, writes to a knowledge fabric in the middle, and calls authenticated tools at the back. A classical web app gets attacked at the request layer; an agent gets attacked at the semantic layer.
We anchor on MITRE ATLAS's tactics — initial access, execution, persistence, exfiltration — but the techniques rebrand for AI: prompt injection (AML.T0051), jailbreak via roleplay (AML.T0054), malicious context insertion through retrieved documents (AML.T0070), tool poisoning, and output exfiltration that reads as legitimate JSON.
The layers are ordered the way runtime traffic actually flows. Each layer can fail-closed; each layer is auditable independently of the others.
Every request is bound to an authenticated workspace identity before any model sees a token. SSO/OIDC at the door, signed identity claim carried through every hop. No identity claim, no agent.
The user's input is classified by intent, scope, and sensitivity. Known jailbreak patterns get rejected. Inputs that would invoke high-risk tools are routed to a stricter policy lane. This is where prompt injection ends.
Every retrieved chunk is scanned for embedded instructions, policy-attribute mismatch, and source provenance. A chunk with no provenance is treated as untrusted. A chunk that contains the string “ignore previous instructions” is flagged.
Tool invocations pass through a policy decision point that checks: (a) is the user authorized to invoke this tool, (b) is the workspace authorized for these arguments, (c) does the argument shape conform to the published schema. The default policy is deny.
Before the agent's response leaves the perimeter, it is scanned for PII, secrets, IP, and policy-violating content. Redaction is preferred to outright blocking — but every redaction is logged.
Every action — user, agent, tool, output — is signed, hashed, and written to an append-only audit fabric. The audit fabric is what an auditor walks; it is also what the detection-as-code pipeline reads to catch behavioral drift.
The reference architecture is testable: each layer has an explicit set of ATLAS techniques it is designed to counter, and each layer is exercised against a published red-team battery before it ships.
Across 14 customer deployments and a trailing six months of traffic, the prompt firewall (Layer 2) carries the heaviest load — 71% of all blocked or redacted actions originate there. The retrieval inspection layer (Layer 3) catches the most novel attacks; an attacker rarely sends raw injection in a prompt today, but they will plant it in a Wiki page the agent retrieves.
The most expensive class of incident is not blocked prompts — it is missed retrievals. A poisoned chunk that passes Layer 3 will steer the agent for the entire conversation. Detection-as-code at Layer 6 closes the loop by flagging session-level behavior the layers above did not catch in flight.
Bind every agent action to an SSO identity. Stand up the audit fabric. Turn on output validation in flag-only mode to map the surface area before you start blocking. This alone resolves the “we don't know what our agents did” problem most security teams have today.
Bring up Layer 2 with a starter ruleset (the published ATLAS-derived set is a good baseline) and a tenant-specific overlay. Move every tool behind a policy decision point; default deny on undocumented arguments.
Add provenance and instruction scanning to every retrieval hop. Wire the audit fabric into your SIEM. Run the red-team battery weekly; promote any net-new finding into detection-as-code.
Three patterns we've watched fail repeatedly:
Runtime AI security is a six-layer engineering problem with a one-line organizational ask: someone has to own it. In the customers we run inside WIT OS, that owner is usually the CISO; in others it's the CTO. Either way works. The wrong answer is “we'll figure it out later.”
This architecture is built into Sentinel and ESOS today. We publish it openly so that even if you choose to build it yourself, you start from the same threat model we do.
/ More from the research team
The architecture in this paper is the same one we run for every WIT ONE customer. Talk to the team about deploying it inside your environment.