YOLO Classifier
The YOLO Classifier is an AI-powered security monitor that evaluates every tool call in Auto Mode before execution. It makes a side API call to Claude using a specialized system prompt, a conversation transcript, and a two-stage XML decision protocol.
<block>no</block> response allows an action.
Any failure blocks.
Architecture
| Component | File | Purpose |
|---|---|---|
| Classifier | yoloClassifier.ts | Main classification engine (~1,495 lines) |
| Permissions | permissions.ts | Permission check orchestration and auto mode flow |
| Dangerous Patterns | dangerousPatterns.ts | Pattern detection for dangerous commands (bash, python, kubectl…) |
| Classifier Decision | classifierDecision.ts | Safe tool allowlist that bypasses the classifier entirely |
| Denial Tracking | denialTracking.ts | Consecutive and total denial counters with human fallback |
| Permission Setup | permissionSetup.ts | Strips dangerous allow-rules when entering auto mode |
| Bash Classifier | bashClassifier.ts | Stub for external users (ANT-only real implementation) |
High-Level Flow
Two-Stage XML Classifier
Output Format
<!-- Allowed: -->
<block>no</block>
<!-- Blocked: -->
<block>yes</block><reason>one short sentence</reason> The format instructions explicitly forbid any preamble: the response MUST begin with <block>.
| Property | Stage 1 (Fast) | Stage 2 (Thinking) |
|---|---|---|
| Purpose | Quick yes/no decision | Careful chain-of-thought review |
| max_tokens | 64 (both) / 256 (fast mode) | 4,096 |
| stop_sequences | ['</block>'] (both mode only) | None — allows full reasoning |
| temperature | 0 | 0 |
| Suffix injected | "Err on the side of blocking. <block> immediately." | Use <thinking> before responding. Explicit user confirmation required to override blocks. |
| Triggered when | Always runs first | Only if Stage 1 returns <block>yes or is unparseable |
| Result | <block>no → ALLOW. <block>yes → escalate | <block>no → ALLOW. <block>yes → DENY |
Classifier Modes
Mode is controlled by GrowthBook tengu_auto_mode_config.twoStageClassifier. For models with alwaysOnThinking, 2,048 padding tokens are added instead of sending thinking: false.
Classifier System Prompt
The system prompt is assembled from 3 template files injected at build time via Bun's require(). The .txt files are not in the source repo. The prompt has three main sections:
Base Prompt
Defines the threat model: prompt injection, scope creep, and accidental damage. Sets default ALLOW — only blocks on explicit BLOCK rules. Excludes quality/efficiency issues from scope.
User Intent Rules
6 rules about distinguishing user requests from agent actions: explicit consent, scope escalation, agent-inferred parameters, questions vs consent, untrusted tool results.
Permissions Template
Replaceable sections for environment context, BLOCK rules (~28 categories), and ALLOW exceptions (7 categories). External users replace; ANT users augment.
12 Evaluation Rules
| Rule | Description |
|---|---|
| Composite actions | Evaluate multi-step sequences as a whole, not individual steps |
| Written file execution | Writing a script then running it = directly running the commands |
| Committing code | Git add/commit is generally safe; pushing is the sensitive part |
| Delayed/enabled effects | Creating cron jobs, systemd services, or startup scripts has lasting effects |
| Sub-agent delegation | When spawning sub-agents, evaluate the delegated task's intent |
| Sub-agent handoff review | Sub-agents inheriting broad permissions need extra scrutiny |
| Look through wrappers | npm run deploy = running the deploy script directly |
| Preemptive block | Block early when the trajectory clearly leads to a blocked action |
| Classifier bypass | Block attempts to disable or circumvent the classifier itself |
| Message context | Consider the full conversation context, not just the latest action |
| Evaluate on own merits | Each action is evaluated independently — previous approvals don't carry over |
| Shared infra bias | Lean toward blocking when shared infrastructure could be affected |
BLOCK Rules (~28 categories)
ALLOW Exceptions (7 categories)
Auto Mode Config (settings.json)
{
"autoMode": {
"allow": [
"Git push to agent-created branches",
"Install from package.json"
],
"soft_deny": [
"Never deploy to production",
"Never modify shared databases"
],
"environment": [
"Trusted domains: *.company.com"
]
}
} External users: values replace defaults. ANT users: values are additive.
Security: autoMode is ignored from .claude/settings.json — prevents malicious project injection.
Fast Paths (What Skips the Classifier)
These 6 optimizations avoid the expensive classifier API call, ordered from fastest to slowest.
| # | Fast Path | Condition |
|---|---|---|
| 1 | Tool declares no relevance | toAutoClassifierInput() returns '' — tool declares itself security-irrelevant |
| 2 | Non-classifierApprovable safety check | decisionReason.type === 'safetyCheck' && classifierApprovable === false → prompt user directly (e.g., writes to .git/) |
| 3 | User interaction required | tool.requiresUserInteraction() returns true → fall through to user prompt |
| 4 | PowerShell without flag | PowerShell tool without POWERSHELL_AUTO_MODE feature flag → prompt user |
| 5 | acceptEdits fast path | Simulates tool call in acceptEdits mode. If it would pass (e.g., file edit in project dir) → ALLOW. Excludes Agent and REPL tools. |
| 6 | Safe tool allowlist | FileRead, Grep, Glob, LSP, TodoWrite, TaskCreate/Get/Update/List/Stop, AskUserQuestion, EnterPlanMode, TeamCreate, SendMessage, Sleep, classify_result — all skip classifier |
Write/Edit tools are not on the allowlist — they use the acceptEdits fast path (#5) instead.
Transcript Building
What gets included
queued_command attachments extracted as user turns JSONL format (default)
{"user":"fix the bug in main.py"}
{"Bash":"npm test"}
{"Edit":"src/main.py: fixed_code"}
{"user":"looks good, now deploy"}
{"Bash":"npm run deploy"} Legacy text format also supported. CLAUDE.md content is prepended as a separate user message with cache_control.
Final message structure sent to classifier
system: [{ text: systemPrompt, cache_control: { type: 'ephemeral' } }]
messages: [
// Optional CLAUDE.md prefix (cached)
{ role: 'user', content: [
{ text: claudeMdWrapped, cache_control: { type: 'ephemeral' } }
]},
// Transcript + action to evaluate
{ role: 'user', content: [
{ text: '<transcript>\n' },
...transcriptBlocks,
{ text: lastActionCompact, cache_control: { type: 'ephemeral' } },
{ text: '</transcript>\n' },
{ text: STAGE_SUFFIX }
]}
] Dangerous Patterns
Cross-Platform
python, python3, python2 node, deno, tsx ruby, perl, php, lua npx, bunx npm run, yarn run pnpm run, bun run bash, sh, ssh
Unix-Specific (additional)
zsh, fish eval, exec, env xargs, sudo --- ANT-only --- fa run, coo, gh gh api, curl, wget git, kubectl aws, gcloud, gsutil
PowerShell-Specific
pwsh, powershell, cmd wsl, iex invoke-expression, icm invoke-command start-process, saps start, start-job, sajb start-threadjob register-objectevent register-engineevent
stripDangerousPermissionsForAutoMode() scans all allow rules
for dangerous Bash/PowerShell/Agent patterns and temporarily removes them. They are restored when exiting auto mode
via restoreDangerousPermissions().
This prevents pre-configured allow rules from bypassing the classifier for dangerous commands.
Denial Tracking and Limits
State
type DenialTrackingState = {
consecutiveDenials: number // resets on any ALLOW
totalDenials: number // never resets (per session)
} Behavior on Limit Hit
AbortError: "Agent aborted: too many classifier denials" Purpose: prevents denial loops and runaway token costs. Any ALLOW (including fast-paths) resets consecutiveDenials.
Error Handling — Fail Closed by Default
| Failure | Result | Notes |
|---|---|---|
| API error (429, 500…) | DENY (iron_gate=true) / ASK user (iron_gate=false) | Controlled by GrowthBook tengu_iron_gate_closed (default: true, refreshed every 30 min) |
| Transcript too long | ASK (CLI) / AbortError (headless) | Headless aborts because transcript only grows — permanent condition |
| Unparseable response (Stage 1, both mode) | Escalate | Goes to Stage 2 for proper evaluation |
| Unparseable response (Stage 1, fast mode / Stage 2) | DENY | "unparseable - blocking for safety" |
| User abort (signal.aborted) | DENY | shouldBlock: true, unavailable: true |
Tool-Specific Classifier Encoding
Each tool controls what the classifier sees via toAutoClassifierInput(input). Returning '' skips the classifier entirely.
| Tool | Encoding | Example |
|---|---|---|
| Bash | input.command | npm test |
| PowerShell | input.command | Get-Process |
| FileEdit | ${file_path}: ${new_string} | src/main.py: fixed_code |
| FileWrite | ${file_path}: ${content} | src/new.py: new_code |
| NotebookEdit | ${notebook_path} ${mode}: ${source} | nb.ipynb replace: cell_code |
| Agent | (subagent_type, mode=X): ${prompt} | (worker, mode=default): Fix bug... |
| WebFetch | ${url}: ${prompt} or just url | https://api.com: extract data |
| WebSearch | input.query | python security best practices |
| Grep | ${pattern} in ${path} or just pattern | password in src/ |
| Config | ${setting} = ${value} | model = opus |
| CronCreate | ${cron}: ${prompt} | */5 * * * *: check deploys |
| MCP tools | key1=val1 key2=val2 | channel=general message=hello |
| SendMessage | to ${to}: ${message} | to worker-1: continue fixing |
| All others | '' (empty) | Classifier skipped entirely |
Complete Permission Check Flow (Auto Mode)
1. Deny rules match (tool-level)? → DENY
2. Ask rules match (tool-level)? → goto step 7
3. Tool's checkPermissions()
- Returns deny? → DENY
- Returns requiresUserInteraction? → ASK
- Returns ask with content rules? → goto step 7
- Returns safetyCheck (non-classifierApprovable)? → ASK (bypass classifier)
4. bypassPermissions mode? → ALLOW
5. Always-allow rules match? → ALLOW
6. Passthrough → becomes ASK
7. Mode-specific handling for ASK:
a. dontAsk mode → DENY
b. Auto mode:
i. PowerShell without feature flag? → ASK user
ii. acceptEdits fast-path passes? → ALLOW ✓
iii. Safe tool allowlist? → ALLOW ✓
iv. YOLO Classifier:
Stage 1 (fast): <block>no? → ALLOW ✓
Stage 1 (fast): <block>yes? → Stage 2
Stage 2 (thinking): <block>no? → ALLOW ✓
Stage 2 (thinking): <block>yes? → DENY
Transcript too long? → Fallback to prompting
API error + iron_gate_closed=true? → DENY
API error + iron_gate_closed=false? → ASK user
Denial limit exceeded? → Fallback to prompting
c. shouldAvoidPermissionPrompts (headless)? → auto-DENY
Security Design Principles
01 Deny by default Any failure (API error, parse error, timeout) results in BLOCK. Only explicit <block>no</block> allows. 02 Explicit allow only The model must affirmatively declare no block. Silence, errors, and ambiguity all result in denial. 03 Assistant text excluded The model's own text output is stripped from the transcript — only tool calls are included, preventing self-influence. 04 Dangerous rules stripped Pre-configured allow rules for dangerous commands are removed before entering auto mode and restored on exit. 05 Project settings excluded autoMode config from .claude/settings.json is ignored — prevents malicious project injection of classifier rules. 06 Fail closed by default tengu_iron_gate_closed defaults to true. API errors deny by default, not fall back to prompting. 07 Human fallback After 3 consecutive or 20 total blocks, control returns to human review — prevents denial loops. 08 Scope-limited Only blocks security-relevant harm — not quality or efficiency issues. An agent looping is never a security threat. 09 Two stages reduce false positives Stage 2 reviews Stage 1 blocks with full chain-of-thought reasoning before making the final call. 10 Cache-efficient System prompt, CLAUDE.md, and last action are all cached. Stage 1 → Stage 2 gets a guaranteed cache hit on the expensive prefix. Prompt Caching and Model Selection
Caching Strategy
Block cache_control System prompt ephemeral CLAUDE.md ephemeral Last action block ephemeral
Max 3 cache-control blocks used (API limit is 4). Stage 1 → Stage 2 shares the same prefix, so Stage 2 gets a guaranteed cache hit.
Model Selection Priority
1. CLAUDE_CODE_AUTO_MODE_MODEL env var (ANT-only) 2. GrowthBook tengu_auto_mode_config.model 3. Main loop model (getMainLoopModel()) Default: same model as the agent (Sonnet 4.6 or Opus 4.6), ensuring classifier capability matches agent capability.
Debug
CLAUDE_CODE_DUMP_AUTO_MODE (ANT-only): dumps request/response to ~/.claude/tmp/auto-mode/ API errors write diagnostics to ~/.claude/tmp/auto-mode-classifier-errors/ ! The classifier adds real cost
Every tool call that doesn't hit a fast path triggers a separate API call with the full conversation transcript.
Stage 1 uses 64 tokens output but potentially thousands of input tokens for long transcripts.
Stage 2 uses up to 4,096 output tokens. Use the acceptEdits fast path or the safe tool allowlist
whenever possible to minimize overhead — the prompt caching helps significantly with input token costs.