// investigation guide

LLM prompt injection — methodology

llm prompt injection is not ai agent runaway and not insider threat. the failure locus is the input stream — user prompt · retrieved RAG chunk · MCP tool result · uploaded attachment — not autonomous replanning on a benign deploy and not a human exfiltrating without a model in the path. evidence is the attempt log, the matched pattern cluster, the indirect-injection carrier artifact, and the guardrail bypass score — heuristic counts, not verdicts. see ai-agent-runaway vs llm-prompt-injection for the bright-line carveout.

intake — what this looks like in the wild

the alert is rarely labeled prompt injection. it arrives as a copilot answer that cites a poisoned internal doc, a chatbot that ignored its system prompt after a pdf upload, a red-team eval export showing a jailbreak template cluster, or an mcp session where a tool result contained imperative override text the model executed.

typical shapes:

  • RAG-poisoned doc made the model answer in-character as an attacker — the user asked a normal question; the retrieved chunk carried the jailbreak.
  • jailbreak template cluster surfaced in red-team eval — DAN · roleplay bypass · system override strings in attempt logs.
  • indirect injection from pdf or email body the user uploaded — hidden delimiter · html comment block · zero-width run in the carrier artifact.
  • MCP tool returning text the model executed as instructions — bridges llm-prompt-injection and api-agentic-action.
  • system-prompt-exfil attempt with structured output — reveal your system message · print hidden instructions.
  • distinct from ai agent runaway: the original deploy prompt was bounded. you see matched injection patterns or adversarial carriers — not prompt-vs-action divergence on a benign instruction with no jailbreak in the log.
  • distinct from insider threat: no human actor exfiltrating via file share without a model interpreting adversarial input. the model is in the path; the manipulation is in what it read.
  • distinct from generic malware: no binary payload required — text in context is the weapon.

preservation — what to collect first

stop serving from the suspect retrieval source before you analyze. every query against a poisoned vector index or kb article re-injects the carrier into live sessions. quarantine the doc · disable the chunk · pause the mcp tool — then preserve.

artifactvolatilitytime to loss
full multi-turn transcript — system prompt + tool definitions + user/assistant turnsrolling bufferhours to days — export before session expires
attempt log exportrollingvendor retention varies — see below
carrier artifact — poisoned doc · retrieved chunk · mcp tool result · email bodypersistent if savedoverwritten if kb re-indexes or doc is edited — snapshot hash now
guardrail evaluation scores + moderation flagsrollingoften 30–90 days — export with attempt log
vendor portal red-team / safety logs (anthropic · openai · google · azure)vendor-side · rollingretention windows are vendor-controlled — export immediately; do not assume parity across tenants
model version + safety-policy snapshot at incident timepersistent if loggedpolicy updates roll forward — capture version string from export metadata
rag retrieval log — chunk id · source uri · similarity rankrollingoften 7–30 days unless forwarded to siem

attempt-log retention is vendor-side and rolling — anthropic enterprise · openai api usage · google vertex ai · azure openai content filter logs each expose different fields and windows. fatcousin tools run on what you export; they cannot pull vendor logs for you.

the first 10 minutes

  1. stop serving from suspect retrieval source — quarantine doc · disable chunk · pause mcp tool.
  2. export full multi-turn transcript with system prompt and tool definitions — not chat ui scrollback alone.
  3. preserve the carrier artifact — the doc · chunk · tool result that contained the injection.
  4. export attempt log for the session id — llm-prompt-injection-attempt-log-forensic-analyzer input shape.
  5. export guardrail evaluation scores and moderation flags for the same window.
  6. pull vendor portal red-team / safety logs if the deployment exposes them.
  7. snapshot model id + safety-policy version string from export metadata.
  8. hash every file at collection time — sha-256 before any edit or re-index.
  9. notify platform owner + counsel — injection incidents touch data classification and breach notification.
  10. begin the path below on frozen exports — files never leave your device in fatcousin tools.

analysis — the path (llm-prompt-injection vertical)

the spine follows the llm-prompt-injection vertical: attempt log → uploaded doc → indirect carrier → mcp tool result → rag retrieval → jailbreak transcript → guardrail score. when mcp is in the path, cross-read the api-agentic-action vertical for tool-call context — but the primary question here is what adversarial text entered the model, not whether the agent autonomously replanned. merge exports with fatcousin-multi-tool-super-timeline-correlator when you have multiple vendor formats.

  1. 1. llm prompt injection attempt log forensic analyzer

    drop vendor attempt-log export — csv · json · jsonl. enumerates matched injection patterns across user turns, model responses, and safety flags. this is the spine: every other step confirms or narrows the carrier.why first: prompt injection is defined by adversarial input in the log. you need matched-pattern enumeration before chasing tool calls or retrieval metadata.honest limit: today the engine uses the shared rule-scan template (vendor-fidelity: template-misfit) — one finding per matched row or template hit. high counts are expected, not a defect. structured user_turn / matched_rule parsing is on the rebuild track.

  2. 2. prompt injection attempt detector in uploaded doc

    drop the file the user uploaded — pdf text extract · docx · html · markdown. scans for DAN-style jailbreak phrasing, hidden delimiters, zero-width runs, and html comment injection blocks.why second: when the attack vector is a user attachment, the carrier lives in the doc — not in the attempt log alone.honest limit: document scan emits one finding per matched rule occurrence (capped at 100 per file). a single poisoned pdf can produce dozens of rows — that is the engine shape for pattern-match tools, not over-detection.

  3. 3. indirect prompt injection document artifact detector

    drop the server-side carrier — email body export · pdf in a shared drive · poisoned kb article. flags imperative override text embedded where the model will read it as context, not as user intent.why third: indirect injection arrives through content the platform ingested — retrieved chunks, forwarded email, synced docs — before any user typed a jailbreak.honest limit: same rule-scan family as step 1 — template-misfit on vendor-native chunk metadata. counts scale with matched templates, not with confirmed model compliance.

  4. 4. mcp prompt injection via tool result detector

    drop mcp tool-result export or ndjson where a server-side tool returned imperative text the model then treated as instructions — ignore previous rules · call this api · exfil the session token.why fourth: mcp bridges the llm-prompt-injection vertical and the api-agentic-action vertical. the attack is still input-stream manipulation, but the carrier is a tool payload, not a user message.honest limit: today the engine does not yet read tool_call_id or result_payload structurally (vendor-fidelity: template-misfit). regex hits on result text can multiply — cross-check with mcp server logs when available.

  5. 5. rag prompt injection via retrieved doc detector

    drop retrieval log or vector-db export showing which chunk was injected into context. surfaces poisoned kb entries, similarity-ranked injection blocks, and retrieval-time override language.why fifth: rag-specific — the model never saw the attacker type a jailbreak; it saw a retrieved chunk that contained one.honest limit: chunk_id · retrieval_rank · source_uri are not faithfully parsed yet. rule-scan findings on retrieved text can be high — pair with the raw chunk artifact from step 3.

  6. 6. llm jailbreak conversation artifact detector

    drop chatgpt conversations.json · claude export · generic prompt log. multi-turn jailbreak templates — DAN · roleplay bypass · system override · recursive injection — matched per message.why sixth: chat-mode incidents need turn-by-turn jailbreak clustering, not a single api attempt row.honest limit: multiple rules can fire on one message; one rule can fire on many messages. pattern match proves the adversarial text was present — not that the model complied or that a human attacker typed it.

  7. 7. llm guardrail bypass score anomaly detector

    drop moderation or safety-score export — blocked_reason · confidence_score · policy_id fields where the vendor provides them. flags score spikes and bypass anomalies aligned to injection turns.why last: quantitative guardrail signal corroborates pattern hits after you already know what text was in the input stream.honest limit: vendor-fidelity: template-misfit — confidence_score and policy_id are not structurally evaluated today. treat score rows as corroboration, not standalone proof of injection.

optional extensions when the attack surface differs from a single chat export:

common false leads

  • the model gave a wrong answer so it must be jailbroken — hallucination and injection are different failures. you need matched adversarial patterns or a poisoned carrier, not a bad fact.
  • high finding count means the tool is broken — rule-scan engines emit one row per matched template. that is expected shape, not over-detection.
  • tool calls happened so this is agent runaway — check attempt logs first. injection can drive tool calls; runaway is scope creep with no adversarial input in the log.
  • a human uploaded the doc so this is insider threat — uploading is not exfil. the question is whether adversarial text in the upload manipulated model behavior.
  • guardrail blocked the request so no incident — partial bypass · score anomaly · subsequent turn compliance still matter. export the full multi-turn transcript.
  • RAG retrieved benign docs — the poison may be one chunk in a large index. preserve retrieval rank and chunk text, not just the user question.

reporting — what the report says · what it does not claim

use case-report-generator or the case binder on the case-type page to assemble a deterministic html/pdf package with sha-256 of every input. the report should read as a vector-and-carrier timeline, not a model psychology essay.

the report should state:

  • vector — user turn · uploaded doc · retrieved chunk · mcp tool result · email body
  • carrier artifact — file name, hash, and excerpt of imperative override text
  • pattern cluster — matched rule kinds (DAN · roleplay bypass · system override · hidden delimiter · etc.)
  • model behavior — what the assistant output after the carrier entered context (quoted, not paraphrased)
  • guardrail evasion — score anomaly rows and whether moderation flagged or missed the turn
  • attempt count — raw matched-pattern rows from each tool, with engine version noted
  • session id · model version · safety-policy snapshot from export metadata
  • which fatcousin tool produced each finding row
  • sha-256 of each export in the preservation log

the report must not claim:

  • intent of the attacker beyond what the artifact shows — pattern match is not attribution
  • model consciousness · sentience · or internal reasoning — you have transcript text, not weights
  • whether the model decided to comply — compliance is inferred from output, not proven from logits
  • complete enumeration of all injection variants — heuristic rules miss novel templates
  • vendor-native fidelity the vendor-fidelity audit marks template-misfit — cite honest limits inline
  • chain-of-custody admissibility — fatcousin is a local triage workbench, not records management software

handing it off

  • platform / ml ops: quarantined carrier, retrieval index diff, mcp tool allowlist, model + policy version rollback steps.
  • security / ir: attempt log, pattern cluster export, session id, guardrail score table.
  • privacy / counsel: data categories exposed in model output, notification timeline, preservation memo with hashes.
  • vendor support: session id, model version, export timestamps — for anthropic · openai · google · microsoft ticket escalation.

court — declaration outline · expert-witness language

no plug-and-play citation snippet on this page — unlike bec, prompt injection declarations vary too much by vendor, carrier type, and jurisdiction. below is an outline counsel can adapt. this is not legal advice.

  • qualifications — dfir practice, llm deployment familiarity, tools used
  • preservation order — what was quarantined, when, and sha-256 of each export at collection
  • matched pattern artifact — rule kinds, excerpts, and source file for each carrier
  • transcript with system prompt — full multi-turn export including tool definitions
  • model-version + safety-policy snapshot — version strings from export metadata at incident time
  • guardrail bypass score table — moderation flags aligned to injection turns
  • methods — local browser analysis, deterministic tools, no upload of evidence to fatcousin servers
  • limits — heuristic pattern match, template-misfit engines, no live vendor portal access, no logit inspection
  • conclusion — adversarial text entered the input stream via [vector]; matched patterns [kinds]; model output [quoted behavior]; counts are evidence rows, not verdicts on intent

further reading

reference investigation

flagship synthetic fixture for this case type is in progress on the atlas track. until it ships, run the seven primary tools on your own frozen exports and compare counts against vendor-fidelity audits on each tool page — expect high row counts from rule-scan engines.

case playbook: case type tools · vertical: llm-prompt-injection · compare: ai-agent-runaway vs llm-prompt-injection · stack: llm-prompt-injection-integrity-kit

ready