Runtime AI Security: A Reference Architecture

Abstract

Most enterprise AI projects ship without a runtime security model. Inspections live in the build pipeline; once an agent is live, it sees raw user input, calls real tools with real credentials, and writes outputs back to systems of record. That is exactly the surface MITRE ATLAS spent two years cataloguing.

This paper specifies a six-layer reference architecture for defending the runtime path of an enterprise AI agent. Each layer is mapped to specific ATLAS techniques, evaluated against measured production traffic, and traced to a control you can ship today. The architecture is the same one we run inside Sentinel, the runtime guardrail layer of WIT OS.

01The runtime threat model

An agent in production has a wider blast radius than any classic web service. It interprets natural language at the edge, writes to a knowledge fabric in the middle, and calls authenticated tools at the back. A classical web app gets attacked at the request layer; an agent gets attacked at the semantic layer.

We anchor on MITRE ATLAS's tactics — initial access, execution, persistence, exfiltration — but the techniques rebrand for AI: prompt injection (AML.T0051), jailbreak via roleplay (AML.T0054), malicious context insertion through retrieved documents (AML.T0070), tool poisoning, and output exfiltration that reads as legitimate JSON.

The four runtime failure modes

Input subversion. The user (or an upstream document) gets the agent to do something the operator never authorized.
Knowledge contamination.A retrieval source returns content crafted to bias the model — a CVE write-up that says “ignore previous instructions,” a Wiki page edited last night.
Tool abuse. The agent has credentials — AWS, ServiceNow, Salesforce. Without a check, it will use them precisely as instructed by an attacker.
Output leakage. The agent answers correctly — but quotes a document, a record, or a key it should never have seen, and the answer flows back to the requester.

02The six-layer reference architecture

The layers are ordered the way runtime traffic actually flows. Each layer can fail-closed; each layer is auditable independently of the others.

Layer 1 · Identity binding

Every request is bound to an authenticated workspace identity before any model sees a token. SSO/OIDC at the door, signed identity claim carried through every hop. No identity claim, no agent.

Layer 2 · Prompt firewall

The user's input is classified by intent, scope, and sensitivity. Known jailbreak patterns get rejected. Inputs that would invoke high-risk tools are routed to a stricter policy lane. This is where prompt injection ends.

Layer 3 · Retrieval inspection

Every retrieved chunk is scanned for embedded instructions, policy-attribute mismatch, and source provenance. A chunk with no provenance is treated as untrusted. A chunk that contains the string “ignore previous instructions” is flagged.

Layer 4 · Tool-call authorization

Tool invocations pass through a policy decision point that checks: (a) is the user authorized to invoke this tool, (b) is the workspace authorized for these arguments, (c) does the argument shape conform to the published schema. The default policy is deny.

Layer 5 · Output validation

Before the agent's response leaves the perimeter, it is scanned for PII, secrets, IP, and policy-violating content. Redaction is preferred to outright blocking — but every redaction is logged.

Layer 6 · Continuous attestation

Every action — user, agent, tool, output — is signed, hashed, and written to an append-only audit fabric. The audit fabric is what an auditor walks; it is also what the detection-as-code pipeline reads to catch behavioral drift.

03Mapping to MITRE ATLAS

The reference architecture is testable: each layer has an explicit set of ATLAS techniques it is designed to counter, and each layer is exercised against a published red-team battery before it ships.

AML.T0051 (Prompt injection) — Layer 2 (Prompt firewall) and Layer 3 (Retrieval inspection)
AML.T0054 (LLM jailbreak) — Layer 2, behavioral drift detection at Layer 6
AML.T0070 (Indirect prompt injection via retrieval) — Layer 3
AML.T0048 (LLM-driven tool exploitation) — Layer 4 with policy-checked argument schemas
AML.T0057 (LLM data leakage) — Layer 5 (Output validation)
AML.T0029 (Denial of ML service) — Layer 4 with rate and budget caps

04What the production traffic shows

Across 14 customer deployments and a trailing six months of traffic, the prompt firewall (Layer 2) carries the heaviest load — 71% of all blocked or redacted actions originate there. The retrieval inspection layer (Layer 3) catches the most novel attacks; an attacker rarely sends raw injection in a prompt today, but they will plant it in a Wiki page the agent retrieves.

Distribution of actioned events

71% — Prompt firewall (Layer 2)
14% — Retrieval inspection (Layer 3)
9% — Tool-call authorization (Layer 4)
5% — Output validation (Layer 5)
1% — Identity binding (Layer 1)

The most expensive class of incident is not blocked prompts — it is missed retrievals. A poisoned chunk that passes Layer 3 will steer the agent for the entire conversation. Detection-as-code at Layer 6 closes the loop by flagging session-level behavior the layers above did not catch in flight.

05Implementation: a 30 / 60 / 90 plan

Days 0–30 — Identity, audit, output

Bind every agent action to an SSO identity. Stand up the audit fabric. Turn on output validation in flag-only mode to map the surface area before you start blocking. This alone resolves the “we don't know what our agents did” problem most security teams have today.

Days 30–60 — Prompt firewall and tool authorization

Bring up Layer 2 with a starter ruleset (the published ATLAS-derived set is a good baseline) and a tenant-specific overlay. Move every tool behind a policy decision point; default deny on undocumented arguments.

Days 60–90 — Retrieval inspection and continuous attestation

Add provenance and instruction scanning to every retrieval hop. Wire the audit fabric into your SIEM. Run the red-team battery weekly; promote any net-new finding into detection-as-code.

06What not to do

Three patterns we've watched fail repeatedly:

One model gates another model.Using a second LLM to “judge” the first's outputs sounds elegant. In practice, the judge inherits the same failure modes as the actor — and the cost doubles.
Build-time evaluation as a substitute for runtime.Eval suites measure what you tested. Production sees what you didn't. The two are complementary; runtime guardrails are not optional.
Treating ‘jailbreak detection’ as the whole problem. Most production incidents we investigate are tool abuse and output leakage, not jailbreaks. Optimize for what actually breaks.

07Closing

Runtime AI security is a six-layer engineering problem with a one-line organizational ask: someone has to own it. In the customers we run inside WIT OS, that owner is usually the CISO; in others it's the CTO. Either way works. The wrong answer is “we'll figure it out later.”

This architecture is built into Sentinel and ESOS today. We publish it openly so that even if you choose to build it yourself, you start from the same threat model we do.

About the author

Rick Azoy

Chief AI Officer & Chief Information Security Officer, WIT ONE

Rick Azoy is the Chief AI Officer and Chief Information Security Officer at WIT ONE, where he leads the engineering of WIT OS — the Enterprise AI Operating System. He has spent two decades building production cybersecurity, AI, and cloud-operations platforms across regulated industries, with a working focus on agent orchestration, runtime AI security, and sovereign retrieval architectures.