Skip to the content.

Agent Inference: The Shift from Model to Agent Optimization

Contact me


When we talk about AI inference today, we mostly mean model inference—the infrastructure for making LLMs fast and efficient. We discuss GPUs, KV cache optimization, transformer layers, batch processing, and frameworks like vLLM and SGLang. These are critical concerns for serving models at scale, and naturally, this is where the industry focus currently lies.

But as agents become the basic unit of AI applications—with standardized lifecycles and building practices—the focus will shift to a higher abstraction layer: Agent Inference.

Just as model inference sits above hardware and optimizes token generation, agent inference sits above model inference and optimizes agent behavior. This is the infrastructure layer that makes agents production-ready, reliable, and scalable. As the agent technology matures, this layer will become the primary engineering concern.


Two Layers, One System

The AI inference stack has two distinct layers:

┌─────────────────────────────────────────────┐
│         Production Requirements              │
│    (What "working correctly" means)          │
├─────────────────────────────────────────────┤
│         AGENT INFERENCE LAYER                │
│  • Agent Role & System Prompts               │
│  • Skill Discovery & SOPs                    │
│  • Tool Discovery & Routing                  │
│  • Memory Management                         │
│  • Lifecycle Orchestration                   │
│  • Sub-Agent Delegation                      │
├─────────────────────────────────────────────┤
│         MODEL INFERENCE LAYER                │
│  • Token Generation                          │
│  • KV Cache Management                       │
│  • Batch Processing                          │
│  • Hardware Optimization (GPU/TPU)           │
│  • Request Scheduling                        │
└─────────────────────────────────────────────┘

Model inference focuses on making models fast:

Agent inference focuses on making agents work:

Both layers are essential. Fast token generation is useless if the agent can’t complete tasks. Perfect task completion is impractical if it takes too long.


Agents as Stateless Job Units

When agent inference becomes standardized, agents transform into template-based, stateless job units—similar to how containers standardized application deployment.

The Standardized Agent Pattern

A standardized agent is:

  1. Template-based: Generic agent spec that can handle a class of tasks
    • Agent role defining identity
    • System prompt defining behavior
    • Skill/SOP for domain expertise
    • Tool set for capabilities
    • Memory configuration
    • Success criteria
  2. Stateless: Each invocation is independent
    • Task sent → Agent loop executes → Response returned
    • No persistent state between invocations (unless explicitly designed)
    • Clean lifecycle: start → process → complete
  3. On-demand: Ready to trigger as long as a task is sent
    • Skills auto-discovered and loaded based on task context
    • Spins up, processes, returns result
    • Scales horizontally like serverless functions
  4. Uniform handling: Standardized interface for all tasks
    • Input: Task description + context
    • Process: Agent loop (LLM call → tool call → continue/break)
    • Output: Result + execution trace

Example: Code Review Agent as Job Unit

agent_spec:
  name: "code-review-agent"
  type: "template"
  
  role: "code_reviewer"
  
  system_prompt: |
    You are a code review agent. For each code change:
    1. Check for syntax errors
    2. Verify test coverage
    3. Assess security issues
    4. Provide improvement suggestions
  
  skills:
    - name: "code-review-sop"
      version: "2.3"
      source: "mcp://skills/code-review"
    - name: "security-audit-sop"
      version: "1.5"
      source: "mcp://skills/security-audit"
  
  tools:
    - run_linter
    - run_tests
    - check_coverage
    - security_scan
  
  memory:
    type: "summary_based"
    context_window: 32000
  
  lifecycle:
    max_iterations: 10
    break_condition: "all_checks_complete"

Usage:

# Agent is just a job unit waiting for tasks
result = agent_inference_runtime.invoke(
    agent="code-review-agent",
    task={
        "files": ["src/auth.py", "src/db.py"],
        "diff": "git diff HEAD~1"
    }
)

# Agent inference runtime handles:
# 1. Auto-discovers and loads relevant skills (code-review-sop, security-audit-sop)
# 2. Constructs dynamic context from skills + system prompt
# 3. Runs standardized agent loop:
#    - LLM reads diff with skill-enhanced context
#    - Calls run_linter tool
#    - Calls run_tests tool
#    - Calls check_coverage tool
#    - Calls security_scan tool
#    - LLM synthesizes results using skill guidelines
# 4. Returns review + trace

print(result.review)
print(result.execution_trace)
print(result.skills_used)  # Which skills were loaded

The agent has no memory of previous reviews. It’s invoked, executes, returns results. Stateless. Uniform. Ready.

Benefits of Standardization

When agents become standardized job units:

This is the promise of agent inference as infrastructure: agents become first-class job units in production systems, not ad-hoc scripts wrapped around model calls. The key differentiator is skill auto-discovery and dynamic context construction—agents adapt their expertise on-demand.


The Core Components of Agent Inference

Agent inference requires infrastructure for six key areas:

1. Agent Role & Prompt Management

Agent identity comes from the system prompt. The prompt defines what the agent is (“You are a code review agent”, “You are a data analyst”) and how it behaves. Skills and tools provide capabilities, but the core identity is in the prompt. System prompts are configuration, not code. Agent inference engines must support:

Example:

# Agent inference engine resolves role and prompts at runtime
agent_runtime.load_agent_spec(
    agent_id="code-review-agent",
    role="code_reviewer",
    prompt_source="mcp://prompts/code-review/v2.3"
)

2. Tool Infrastructure & Distribution

Tools are the “hands” of agents. Tools will become standard distributable packages—this is a core area of agent inference.

Tool registry and execution:

From agent-gen.md, the tool interface:

Tool = {
    "name": str,
    "description": str,
    "parameters": Dict[str, Type],
    "execute": Callable
}

Package-like distribution model:

Just like npm packages, Python wheels, or Docker images, tools will become installable, versioned packages. Agent specs would reference tool packages:

agent_spec:
  tools:
    - from: toolset://linters@3.2.1
      select: [run_pylint, run_prettier]  # Use subset
    - from: toolset://test-runners@latest
    - custom_tool  # Local tool

This enables dependency resolution, version management, and distribution through public registries, private repos, or enterprise marketplaces. Agent inference engines would handle the entire tool lifecycle—from package installation to execution.

3. Memory Orchestration

Agents need state management across invocations. Agent inference provides:

Example: Automatic summarization when threshold reached

# Agent inference runtime monitors conversation length
agent_config = {
    "memory": {
        "type": "conversation",
        "max_tokens": 30000,
        "summarize_threshold": 25000,  # Trigger summarization at 80% capacity
        "strategy": "auto_summarize"
    }
}

# During agent loop:
# Iteration 1-10: conversation = 15,000 tokens (normal operation)
# Iteration 11: conversation = 26,000 tokens (threshold exceeded)
#   → Runtime automatically calls summarize_messages tool
#   → Compresses messages 1-8 into single summary
#   → New conversation = 18,000 tokens (continues processing)
# Iteration 12-15: conversation stays under threshold

The agent inference layer automatically manages memory. Developers configure policies; runtime enforces them.

4. Lifecycle Orchestration

The agent loop is standardized infrastructure:

┌─────────────────────────────────────┐
│ 1. Receive Task                     │
├─────────────────────────────────────┤
│ 2. Load Agent Spec                   │
│    • Discover & inject skills        │
│    • Load system prompt + role       │
│    • Register tools                  │
├─────────────────────────────────────┤
│ 3. LLM Call (with role + prompt +   │
│    skills + conversation + tools)    │
├─────────────────────────────────────┤
│ 4. Parse Response                    │
│    • Text message? → Return          │
│    • Tool call? → Execute (goto 5)   │
├─────────────────────────────────────┤
│ 5. Execute Tool Call                 │
│    • Invoke tool function            │
│    • Capture result                  │
├─────────────────────────────────────┤
│ 6. Add Tool Result to Conversation   │
│    → Loop back to step 3             │
├─────────────────────────────────────┤
│ 7. Break Condition                   │
│    • No more tool calls              │
│    • Max iterations reached          │
│    • Success criteria met            │
├─────────────────────────────────────┤
│ 8. Return Result + Trace             │
└─────────────────────────────────────┘

From agent-gen.md:

The agent continues its loop if there’s a tool call in the LLM response, and stops if there isn’t.

Agent inference engines implement this loop as first-class infrastructure, not application code.

5. Sub-Agent Delegation

Complex tasks require task decomposition. Sub-agents are just agents invoked through the same agent inference infrastructure—there’s no special “sub-agent” type. Agent inference provides:

From context-offload-via-subagent.md:

Main agent dynamically generates task instructions → Sub-agent runs complete isolated lifecycle → Full sub-agent conversation compresses to single result message

Key insight: A “sub-agent” is simply another agent invocation. They all share the same infrastructure, but can be configured differently:

# Main agent configuration
main_agent = {
    "role": "project_manager",
    "skills": ["task_decomposition@2.0"],
    "tools": ["create_subagent", "aggregate_results"],
    "memory": {"type": "persistent", "backend": "git"}
}

# Sub-agent 1: Code reviewer (different skills, tools, memory)
code_reviewer = {
    "role": "code_reviewer",
    "skills": ["code_review_sop@2.3", "security_audit@1.5"],
    "tools": ["run_linter", "run_tests", "check_coverage"],
    "memory": {"type": "stateless"}  # Different memory config
}

# Sub-agent 2: Test engineer (different configuration again)
test_engineer = {
    "role": "test_engineer",
    "skills": ["test_coverage_analysis@3.0"],
    "tools": ["pytest", "coverage_report", "mutation_testing"],
    "memory": {"type": "conversation", "max_tokens": 20000}
}

Shared infrastructure across all agents:

Different configurations per agent:

This unified-yet-configurable approach makes agent inference scalable—from one agent to hundreds of specialized sub-agents.

6. Skill Discovery & Distribution

Skills are containers for SOPs (Standard Operating Procedures)—the “programs” for agents. A skill packages domain expertise (instructions, workflows, tools, and success criteria) that agents load on-demand. Skills follow the same package distribution model as tools—this is a core area of agent inference.

Agent inference engines provide:

From agent-intelligence-sop.md:

SOPs are first-class artifacts: Tools + Workflow + Instructions + Success Criteria

Package distribution for skills:

Skills would follow the same package distribution model as tools (see section 2), be installable and versioned, declare dependencies on other skills and toolsets, and be referenced in agent specs:

agent_spec:
  skills:
    - code-review@2.3.0      # From registry
    - security-audit@1.5.0   # From registry
    - ./custom-sop.md        # Local skill

Examples of skill systems:


Agent Inference: Optimization and Production Concerns

Just as model inference optimizes token generation, agent inference optimizes task completion. But agent optimization is more complex—it must balance efficiency, reliability, observability, and cost.

Parallel Concerns with Model Inference

Model Inference Agent Inference
Batching: Process multiple requests together Parallel tool execution: Run independent tools concurrently
Model routing: Send requests to appropriate model Sub-agent routing: Delegate subtasks to specialized agents

Efficiency: Agent-Specific Optimizations

Model inference efficiency: Tokens/second, GPU utilization
Agent inference efficiency: Tasks/hour, context utilization, success rate

  1. Conversation Management
    • Edit/delete messages to stay within context limits
    • Summarize long conversations (as shown in Memory Orchestration example)
    • From conversation-manage.md: edit_message, delete_message, summarize_messages
  2. Tool Call Efficiency
    • Batch independent tool calls
    • Cache tool results for identical calls
    • Parallel execution where possible
  3. Context Compression
    • Sub-agent offloading to isolate context
    • Git snapshots for long-term state
    • Selective message retention
  4. Validation Loops
    • Factory → Runtime → Observer pattern from agent-base.md
    • Iterative improvement based on success criteria
    • Process supervision: Design discussion → Implementation

Observability: Tracing Agent Decisions

Model inference traces: Token generation, latency, throughput
Agent inference traces: Decision rationale, tool calls, conversation flow, skill loading

Requirements:

Example trace output:

{
  "agent_id": "code-review-agent",
  "task_id": "review-123",
  "skills_loaded": ["code-review-sop:v2.3", "security-audit-sop:v1.5"],
  "iterations": 5,
  "tool_calls": [
    {"tool": "run_linter", "duration_ms": 340, "result": "3 issues"},
    {"tool": "run_tests", "duration_ms": 2100, "result": "all passed"},
    {"tool": "check_coverage", "duration_ms": 180, "result": "87%"}
  ],
  "memory_events": [{"type": "summarize", "iteration": 3, "tokens_saved": 8000}],
  "completion_status": "success",
  "total_duration_ms": 4500
}

Reliability: Predictable Agent Behavior

Model inference reliability: Model availability, response consistency
Agent inference reliability: Task success rate, predictable behavior, graceful degradation

Requirements:

Cost Model: Multi-Dimensional Pricing

Model inference costs: Input tokens + output tokens
Agent inference costs: (Multiple LLM calls × tokens) + tool execution + memory storage + sub-agent overhead + skill loading

Requirements:


Agent Inference as Production Standard

As agent technology matures, agent inference will become the primary engineering concern—just as model inference is today for serving LLMs.

The Evolution

Phase 1 (Current): Ad-hoc agent implementations

Phase 2 (Emerging): Standardized agent patterns

Phase 3 (Future): Agent inference as infrastructure

Key Observations

As agent inference becomes production standard, several critical patterns emerge:

1. Package Distribution as Core Infrastructure

As described in sections 2 and 6, the package distribution model for tools and skills is the defining characteristic of mature agent inference. This creates network effects: as more skills are packaged and shared, every agent becomes more capable. The ecosystem compounds—much like npm transformed JavaScript development, agent inference will create a marketplace of reusable capabilities where domain experts package their expertise for millions of agents.

2. Unified Infrastructure with Configurable Specialization

All agents share the same infrastructure, but configure it differently. This is the key to scalable agent orchestration:

A project manager agent and a code reviewer agent run on the same runtime—they just load different configurations. Sub-agents aren’t special; they’re just agent invocations with different specs. This unified-yet-configurable model is what makes agent inference scalable from 1 agent to 1000+ concurrent agents.

Why This Matters

As agent inference standardizes:

  1. Infrastructure commoditization: Agent loops, tool routing, and conversation management become standard—developers focus on SOPs and task design
  2. Ecosystem acceleration: Skill marketplaces emerge, enabling capability sharing at scale
  3. Production simplification: Deploy agent specs like Kubernetes manifests—scale horizontally, monitor uniformly

Conclusion

As agents become the basic unit of AI applications—standardized, stateless, ready-to-trigger job units—infrastructure will evolve to treat agent inference as a first-class concern:

The infrastructure exists in pieces across different frameworks today. As the agent ecosystem matures, we’ll see convergence and unification: agent inference engines that provide production-grade runtime for agents, just as vLLM and SGLang provide for models. This is the same transformation that containers brought to application deployment—agent inference will be the Kubernetes moment for AI agents.

The focus will naturally shift from model inference to agent inference—not because model inference isn’t important, but because it becomes commoditized infrastructure while agent inference becomes the differentiator. Model inference makes models fast. Agent inference makes agents work. Both are essential, but as agent technology matures, optimizing agent behavior becomes the primary engineering concern.

The intelligence is in the model, the capability is in the agent, and the production readiness is in the agent inference layer.


References

From this blog series: