YOLO Classifier

The YOLO Classifier is an AI-powered security monitor that evaluates every tool call in Auto Mode before execution. It makes a side API call to Claude using a specialized system prompt, a conversation transcript, and a two-stage XML decision protocol.

2 classifier stages 28 block rule categories 6 fast paths 10 security principles

i Side-query, not inline reasoning

The classifier is a separate API call to Claude — not part of the agent's own reasoning. It runs on the same model tier as the agent and has a dedicated system prompt built from 3 template files injected at build time. Only an explicit <block>no</block> response allows an action. Any failure blocks.

Architecture

Component	File	Purpose
Classifier	yoloClassifier.ts	Main classification engine (~1,495 lines)
Permissions	permissions.ts	Permission check orchestration and auto mode flow
Dangerous Patterns	dangerousPatterns.ts	Pattern detection for dangerous commands (bash, python, kubectl…)
Classifier Decision	classifierDecision.ts	Safe tool allowlist that bypasses the classifier entirely
Denial Tracking	denialTracking.ts	Consecutive and total denial counters with human fallback
Permission Setup	permissionSetup.ts	Strips dangerous allow-rules when entering auto mode
Bash Classifier	bashClassifier.ts	Stub for external users (ANT-only real implementation)

High-Level Flow

Step 1

Always-deny rules?

→ DENY immediately

Step 2

Always-allow rules?

→ ALLOW immediately

Step 3

Fast paths pass?

→ ALLOW (no API call)

Step 4

2-stage XML classifier

→ ALLOW or DENY

Two-Stage XML Classifier

Output Format

<!-- Allowed: -->
<block>no</block>

<!-- Blocked: -->
<block>yes</block><reason>one short sentence</reason>

The format instructions explicitly forbid any preamble: the response MUST begin with <block>.

Property	Stage 1 (Fast)	Stage 2 (Thinking)
Purpose	Quick yes/no decision	Careful chain-of-thought review
max_tokens	64 (both) / 256 (fast mode)	4,096
stop_sequences	`['</block>']` (both mode only)	None — allows full reasoning
temperature	0	0
Suffix injected	"Err on the side of blocking. <block> immediately."	Use <thinking> before responding. Explicit user confirmation required to override blocks.
Triggered when	Always runs first	Only if Stage 1 returns <block>yes or is unparseable
Result	<block>no → ALLOW. <block>yes → escalate	<block>no → ALLOW. <block>yes → DENY

Classifier Modes

both (default)

Stage 1 first; Stage 2 only if Stage 1 blocks. Best balance of cost vs accuracy.

fast

Stage 1 only (max_tokens=256, no stop_sequences). Fastest and cheapest.

thinking

Stage 2 only (skips Stage 1). Most thorough, highest latency and cost.

Mode is controlled by GrowthBook tengu_auto_mode_config.twoStageClassifier. For models with alwaysOnThinking, 2,048 padding tokens are added instead of sending thinking: false.

Classifier System Prompt

The system prompt is assembled from 3 template files injected at build time via Bun's require(). The .txt files are not in the source repo. The prompt has three main sections:

Base Prompt

Defines the threat model: prompt injection, scope creep, and accidental damage. Sets default ALLOW — only blocks on explicit BLOCK rules. Excludes quality/efficiency issues from scope.

User Intent Rules

6 rules about distinguishing user requests from agent actions: explicit consent, scope escalation, agent-inferred parameters, questions vs consent, untrusted tool results.

Permissions Template

Replaceable sections for environment context, BLOCK rules (~28 categories), and ALLOW exceptions (7 categories). External users replace; ANT users augment.

12 Evaluation Rules

Rule	Description
Composite actions	Evaluate multi-step sequences as a whole, not individual steps
Written file execution	Writing a script then running it = directly running the commands
Committing code	Git add/commit is generally safe; pushing is the sensitive part
Delayed/enabled effects	Creating cron jobs, systemd services, or startup scripts has lasting effects
Sub-agent delegation	When spawning sub-agents, evaluate the delegated task's intent
Sub-agent handoff review	Sub-agents inheriting broad permissions need extra scrutiny
Look through wrappers	`npm run deploy` = running the deploy script directly
Preemptive block	Block early when the trajectory clearly leads to a blocked action
Classifier bypass	Block attempts to disable or circumvent the classifier itself
Message context	Consider the full conversation context, not just the latest action
Evaluate on own merits	Each action is evaluated independently — previous approvals don't carry over
Shared infra bias	Lean toward blocking when shared infrastructure could be affected

BLOCK Rules (~28 categories)

✕ Git destructive (force push, delete remote branches)

✕ Git push to default branch (main/master)

✕ Download and execute code from untrusted URLs

✕ Cloud storage mass delete (S3, GCS)

✕ Production deploy without explicit user request

✕ Remote shell writes (SSH commands)

✕ Blind apply patches/diffs without review

✕ Logging/audit tampering

✕ Permission escalation or admin account creation

✕ TLS/auth weakening

✕ Security feature disabling

✕ Spawn agents with broad permissions

✕ Interfere with other users' processes

✕ Modify shared resources (configs, databases, queues)

✕ Irreversible local destruction (rm -rf, drop tables)

✕ Create RCE surfaces (reverse shells, web shells)

✕ Expose local services to 0.0.0.0 or tunnels

✕ Credential leakage (print/log/send secrets)

✕ Credential store browsing beyond task needs

✕ Data exfiltration to external services

✕ Unauthorized persistence (cron jobs, systemd services)

✕ Self-modification of agent code or config

✕ External API writes without user request

✕ Content integrity violations (impersonate users)

✕ Real-world transactions (purchases, emails, posts)

ALLOW Exceptions (7 categories)

✓ Test artifacts (fixtures, mock data)

✓ Local operations within project directory

✓ Read-only operations (ls, cat, git status)

✓ Declared dependencies (package.json, requirements.txt)

✓ Toolchain bootstrap (dev tool setup)

✓ Standard credentials (configured API keys for the task)

✓ Git push to feature branches (not main/master)

Auto Mode Config (settings.json)

{
  "autoMode": {
    "allow": [
      "Git push to agent-created branches",
      "Install from package.json"
    ],
    "soft_deny": [
      "Never deploy to production",
      "Never modify shared databases"
    ],
    "environment": [
      "Trusted domains: *.company.com"
    ]
  }
}

External users: values replace defaults. ANT users: values are additive.

Security: autoMode is ignored from .claude/settings.json — prevents malicious project injection.

Fast Paths (What Skips the Classifier)

These 6 optimizations avoid the expensive classifier API call, ordered from fastest to slowest.

#	Fast Path	Condition
1	Tool declares no relevance	`toAutoClassifierInput()` returns `''` — tool declares itself security-irrelevant
2	Non-classifierApprovable safety check	`decisionReason.type === 'safetyCheck' && classifierApprovable === false` → prompt user directly (e.g., writes to `.git/`)
3	User interaction required	`tool.requiresUserInteraction()` returns true → fall through to user prompt
4	PowerShell without flag	PowerShell tool without `POWERSHELL_AUTO_MODE` feature flag → prompt user
5	acceptEdits fast path	Simulates tool call in acceptEdits mode. If it would pass (e.g., file edit in project dir) → ALLOW. Excludes `Agent` and `REPL` tools.
6	Safe tool allowlist	FileRead, Grep, Glob, LSP, TodoWrite, TaskCreate/Get/Update/List/Stop, AskUserQuestion, EnterPlanMode, TeamCreate, SendMessage, Sleep, classify_result — all skip classifier

Write/Edit tools are not on the allowlist — they use the acceptEdits fast path (#5) instead.

Transcript Building

What gets included

✓ User messages: text blocks only (no images, no tool results)

✓ Assistant messages: tool_use blocks only (what the model DID)

✓ Queued commands: queued_command attachments extracted as user turns

✕ Assistant text: excluded — model-authored text could influence the classifier

JSONL format (default)

{"user":"fix the bug in main.py"}
{"Bash":"npm test"}
{"Edit":"src/main.py: fixed_code"}
{"user":"looks good, now deploy"}
{"Bash":"npm run deploy"}

Legacy text format also supported. CLAUDE.md content is prepended as a separate user message with cache_control.

Final message structure sent to classifier

system: [{ text: systemPrompt, cache_control: { type: 'ephemeral' } }]
messages: [
  // Optional CLAUDE.md prefix (cached)
  { role: 'user', content: [
    { text: claudeMdWrapped, cache_control: { type: 'ephemeral' } }
  ]},
  // Transcript + action to evaluate
  { role: 'user', content: [
    { text: '<transcript>\n' },
    ...transcriptBlocks,
    { text: lastActionCompact, cache_control: { type: 'ephemeral' } },
    { text: '</transcript>\n' },
    { text: STAGE_SUFFIX }
  ]}
]

Dangerous Patterns

Cross-Platform

python, python3, python2
node, deno, tsx
ruby, perl, php, lua
npx, bunx
npm run, yarn run
pnpm run, bun run
bash, sh, ssh

Unix-Specific (additional)

zsh, fish
eval, exec, env
xargs, sudo
--- ANT-only ---
fa run, coo, gh
gh api, curl, wget
git, kubectl
aws, gcloud, gsutil

PowerShell-Specific

pwsh, powershell, cmd
wsl, iex
invoke-expression, icm
invoke-command
start-process, saps
start, start-job, sajb
start-threadjob
register-objectevent
register-engineevent

! Dangerous rules stripped on auto mode entry

When entering auto mode, stripDangerousPermissionsForAutoMode() scans all allow rules for dangerous Bash/PowerShell/Agent patterns and temporarily removes them. They are restored when exiting auto mode via restoreDangerousPermissions(). This prevents pre-configured allow rules from bypassing the classifier for dangerous commands.

Denial Tracking and Limits

State

type DenialTrackingState = {
  consecutiveDenials: number  // resets on any ALLOW
  totalDenials: number        // never resets (per session)
}

Max consecutive 3

Max total per session 20

Behavior on Limit Hit

CLI: Fall back to interactive prompting — human reviews and decides

Headless: AbortError: "Agent aborted: too many classifier denials"

Purpose: prevents denial loops and runaway token costs. Any ALLOW (including fast-paths) resets consecutiveDenials.

Error Handling — Fail Closed by Default

Failure	Result	Notes
API error (429, 500…)	DENY (iron_gate=true) / ASK user (iron_gate=false)	Controlled by GrowthBook `tengu_iron_gate_closed` (default: true, refreshed every 30 min)
Transcript too long	ASK (CLI) / `AbortError` (headless)	Headless aborts because transcript only grows — permanent condition
Unparseable response (Stage 1, both mode)	Escalate	Goes to Stage 2 for proper evaluation
Unparseable response (Stage 1, fast mode / Stage 2)	DENY	"unparseable - blocking for safety"
User abort (signal.aborted)	DENY	`shouldBlock: true, unavailable: true`

Tool-Specific Classifier Encoding

Each tool controls what the classifier sees via toAutoClassifierInput(input). Returning '' skips the classifier entirely.

Tool	Encoding	Example
Bash	`input.command`	npm test
PowerShell	`input.command`	Get-Process
FileEdit	`${file_path}: ${new_string}`	src/main.py: fixed_code
FileWrite	`${file_path}: ${content}`	src/new.py: new_code
NotebookEdit	`${notebook_path} ${mode}: ${source}`	nb.ipynb replace: cell_code
Agent	`(subagent_type, mode=X): ${prompt}`	(worker, mode=default): Fix bug...
WebFetch	`${url}: ${prompt}` or just url	https://api.com: extract data
WebSearch	`input.query`	python security best practices
Grep	`${pattern} in ${path}` or just pattern	password in src/
Config	`${setting} = ${value}`	model = opus
CronCreate	`${cron}: ${prompt}`	/5 * * *: check deploys
MCP tools	`key1=val1 key2=val2`	channel=general message=hello
SendMessage	`to ${to}: ${message}`	to worker-1: continue fixing
All others	`''` (empty)	Classifier skipped entirely

   Complete Permission Check Flow (Auto Mode)
  1. Deny rules match (tool-level)?                    → DENY
2. Ask rules match (tool-level)?                     → goto step 7
3. Tool's checkPermissions()
   - Returns deny?                                   → DENY
   - Returns requiresUserInteraction?                → ASK
   - Returns ask with content rules?                 → goto step 7
   - Returns safetyCheck (non-classifierApprovable)? → ASK (bypass classifier)
4. bypassPermissions mode?                           → ALLOW
5. Always-allow rules match?                         → ALLOW
6. Passthrough                                       → becomes ASK

7. Mode-specific handling for ASK:
   a. dontAsk mode                                   → DENY
   b. Auto mode:
      i.   PowerShell without feature flag?          → ASK user
      ii.  acceptEdits fast-path passes?             → ALLOW ✓
      iii. Safe tool allowlist?                      → ALLOW ✓
      iv.  YOLO Classifier:
           Stage 1 (fast): <block>no?               → ALLOW ✓
           Stage 1 (fast): <block>yes?              → Stage 2
           Stage 2 (thinking): <block>no?           → ALLOW ✓
           Stage 2 (thinking): <block>yes?          → DENY
           Transcript too long?                      → Fallback to prompting
           API error + iron_gate_closed=true?       → DENY
           API error + iron_gate_closed=false?      → ASK user
           Denial limit exceeded?                    → Fallback to prompting
   c. shouldAvoidPermissionPrompts (headless)?       → auto-DENY
 
 
   Security Design Principles
   01  Deny by default
 Any failure (API error, parse error, timeout) results in BLOCK. Only explicit <block>no</block> allows.
 
 
 02  Explicit allow only
 The model must affirmatively declare no block. Silence, errors, and ambiguity all result in denial.
 
 
 03  Assistant text excluded
 The model's own text output is stripped from the transcript — only tool calls are included, preventing self-influence.
 
 
 04  Dangerous rules stripped
 Pre-configured allow rules for dangerous commands are removed before entering auto mode and restored on exit.
 
 
 05  Project settings excluded
 autoMode config from .claude/settings.json is ignored — prevents malicious project injection of classifier rules.
 
 
 06  Fail closed by default
 tengu_iron_gate_closed defaults to true. API errors deny by default, not fall back to prompting.
 
 
 07  Human fallback
 After 3 consecutive or 20 total blocks, control returns to human review — prevents denial loops.
 
 
 08  Scope-limited
 Only blocks security-relevant harm — not quality or efficiency issues. An agent looping is never a security threat.
 
 
 09  Two stages reduce false positives
 Stage 2 reviews Stage 1 blocks with full chain-of-thought reasoning before making the final call.
 
 
 10  Cache-efficient
 System prompt, CLAUDE.md, and last action are all cached. Stage 1 → Stage 2 gets a guaranteed cache hit on the expensive prefix.
 
 
 
 
   Prompt Caching and Model Selection
   Caching Strategy
     Block  cache_control  
 
   System prompt  ephemeral  
  CLAUDE.md  ephemeral  
  Last action block  ephemeral  
 
 
 
 Max 3 cache-control blocks used (API limit is 4). Stage 1 → Stage 2 shares the same prefix, so Stage 2 gets a guaranteed cache hit.
 
  Model Selection Priority
   1. CLAUDE_CODE_AUTO_MODE_MODEL env var (ANT-only) 
  2. GrowthBook tengu_auto_mode_config.model 
  3. Main loop model (getMainLoopModel()) 
 Default: same model as the agent (Sonnet 4.6 or Opus 4.6), ensuring classifier capability matches agent capability.
 
  Debug
  CLAUDE_CODE_DUMP_AUTO_MODE (ANT-only): dumps request/response to ~/.claude/tmp/auto-mode/
 API errors write diagnostics to ~/.claude/tmp/auto-mode-classifier-errors/
 
 
 
 
 
    !  The classifier adds real cost 
  
Every tool call that doesn't hit a fast path triggers a separate API call with the full conversation transcript.
    Stage 1 uses 64 tokens output but potentially thousands of input tokens for long transcripts.
    Stage 2 uses up to 4,096 output tokens. Use the acceptEdits fast path or the safe tool allowlist
    whenever possible to minimize overhead — the prompt caching helps significantly with input token costs.

Block	cache_control
System prompt	ephemeral
CLAUDE.md	ephemeral
Last action block	ephemeral