Agent Inference: The Shift from Model to Agent Optimization

Contact me

Blog -> https://cugtyt.github.io/blog/index
Email -> cugtyt@qq.com
GitHub -> Cugtyt@GitHub

When we talk about AI inference today, we mostly mean model inference—the infrastructure for making LLMs fast and efficient. We discuss GPUs, KV cache optimization, transformer layers, batch processing, and frameworks like vLLM and SGLang. These are critical concerns for serving models at scale, and naturally, this is where the industry focus currently lies.

But as agents become the basic unit of AI applications—with standardized lifecycles and building practices—the focus will shift to a higher abstraction layer: Agent Inference.

Just as model inference sits above hardware and optimizes token generation, agent inference sits above model inference and optimizes agent behavior. This is the infrastructure layer that makes agents production-ready, reliable, and scalable. As the agent technology matures, this layer will become the primary engineering concern.

Two Layers, One System

The AI inference stack has two distinct layers:

┌─────────────────────────────────────────────┐
│         Production Requirements              │
│    (What "working correctly" means)          │
├─────────────────────────────────────────────┤
│         AGENT INFERENCE LAYER                │
│  • Agent Role & System Prompts               │
│  • Skill Discovery & SOPs                    │
│  • Tool Discovery & Routing                  │
│  • Memory Management                         │
│  • Lifecycle Orchestration                   │
│  • Sub-Agent Delegation                      │
├─────────────────────────────────────────────┤
│         MODEL INFERENCE LAYER                │
│  • Token Generation                          │
│  • KV Cache Management                       │
│  • Batch Processing                          │
│  • Hardware Optimization (GPU/TPU)           │
│  • Request Scheduling                        │
└─────────────────────────────────────────────┘

Model inference focuses on making models fast:

Throughput (tokens/second)
Latency (time to first token, time per token)
Hardware utilization (GPU memory, compute)
Batch efficiency

Agent inference focuses on making agents work:

Correctness (tasks completed successfully)
Reliability (predictable behavior)
Maintainability (SOPs as versioned artifacts)
Observability (traceable decisions)

Both layers are essential. Fast token generation is useless if the agent can’t complete tasks. Perfect task completion is impractical if it takes too long.

Agents as Stateless Job Units

When agent inference becomes standardized, agents transform into template-based, stateless job units—similar to how containers standardized application deployment.

The Standardized Agent Pattern

A standardized agent is:

Template-based: Generic agent spec that can handle a class of tasks
- Agent role defining identity
- System prompt defining behavior
- Skill/SOP for domain expertise
- Tool set for capabilities
- Memory configuration
- Success criteria
Stateless: Each invocation is independent
- Task sent → Agent loop executes → Response returned
- No persistent state between invocations (unless explicitly designed)
- Clean lifecycle: start → process → complete
On-demand: Ready to trigger as long as a task is sent
- Skills auto-discovered and loaded based on task context
- Spins up, processes, returns result
- Scales horizontally like serverless functions
Uniform handling: Standardized interface for all tasks
- Input: Task description + context
- Process: Agent loop (LLM call → tool call → continue/break)
- Output: Result + execution trace

Example: Code Review Agent as Job Unit

agent_spec:
  name: "code-review-agent"
  type: "template"
  
  role: "code_reviewer"
  
  system_prompt: |
    You are a code review agent. For each code change:
    1. Check for syntax errors
    2. Verify test coverage
    3. Assess security issues
    4. Provide improvement suggestions
  
  skills:
    - name: "code-review-sop"
      version: "2.3"
      source: "mcp://skills/code-review"
    - name: "security-audit-sop"
      version: "1.5"
      source: "mcp://skills/security-audit"
  
  tools:
    - run_linter
    - run_tests
    - check_coverage
    - security_scan
  
  memory:
    type: "summary_based"
    context_window: 32000
  
  lifecycle:
    max_iterations: 10
    break_condition: "all_checks_complete"

Usage:

# Agent is just a job unit waiting for tasks
result = agent_inference_runtime.invoke(
    agent="code-review-agent",
    task={
        "files": ["src/auth.py", "src/db.py"],
        "diff": "git diff HEAD~1"
    }
)

# Agent inference runtime handles:
# 1. Auto-discovers and loads relevant skills (code-review-sop, security-audit-sop)
# 2. Constructs dynamic context from skills + system prompt
# 3. Runs standardized agent loop:
#    - LLM reads diff with skill-enhanced context
#    - Calls run_linter tool
#    - Calls run_tests tool
#    - Calls check_coverage tool
#    - Calls security_scan tool
#    - LLM synthesizes results using skill guidelines
# 4. Returns review + trace

print(result.review)
print(result.execution_trace)
print(result.skills_used)  # Which skills were loaded

The agent has no memory of previous reviews. It’s invoked, executes, returns results. Stateless. Uniform. Ready.

Benefits of Standardization

When agents become standardized job units:

Scalability: Deploy N instances, route tasks uniformly
Testability: Same input always produces same behavior
Composability: Chain agents like function calls
Observability: Standard execution traces
Versioning: Upgrade agent specs independently
Cost control: Pay only for execution time
Skill reusability: Share SOPs across different agents and teams
Dynamic adaptation: Auto-discover and load relevant skills based on task context
Unified orchestration: Main agents and sub-agents use the same infrastructure—no special cases

This is the promise of agent inference as infrastructure: agents become first-class job units in production systems, not ad-hoc scripts wrapped around model calls. The key differentiator is skill auto-discovery and dynamic context construction—agents adapt their expertise on-demand.

The Core Components of Agent Inference

Agent inference requires infrastructure for six key areas:

1. Agent Role & Prompt Management

Agent identity comes from the system prompt. The prompt defines what the agent is (“You are a code review agent”, “You are a data analyst”) and how it behaves. Skills and tools provide capabilities, but the core identity is in the prompt. System prompts are configuration, not code. Agent inference engines must support:

Role definition: Declarative agent identity
Runtime injection: Load prompts from external sources (MCP-style)
Template rendering: Construct prompts from task context
Factory generation: Dynamically create prompts from requirements
Version control: Track prompt changes, roll back

Example:

# Agent inference engine resolves role and prompts at runtime
agent_runtime.load_agent_spec(
    agent_id="code-review-agent",
    role="code_reviewer",
    prompt_source="mcp://prompts/code-review/v2.3"
)

2. Tool Infrastructure & Distribution

Tools are the “hands” of agents. Tools will become standard distributable packages—this is a core area of agent inference.

Tool registry and execution:

Schema registration: Declare available tools with types
Discovery: Agents query tool registry
Routing: Parse tool calls from LLM output
Execution: Invoke tools and return results
Validation: Ensure tool calls match schemas

From agent-gen.md, the tool interface:

Tool = {
    "name": str,
    "description": str,
    "parameters": Dict[str, Type],
    "execute": Callable
}

Package-like distribution model:

Just like npm packages, Python wheels, or Docker images, tools will become installable, versioned packages. Agent specs would reference tool packages:

agent_spec:
  tools:
    - from: toolset://linters@3.2.1
      select: [run_pylint, run_prettier]  # Use subset
    - from: toolset://test-runners@latest
    - custom_tool  # Local tool

This enables dependency resolution, version management, and distribution through public registries, private repos, or enterprise marketplaces. Agent inference engines would handle the entire tool lifecycle—from package installation to execution.

3. Memory Orchestration

Agents need state management across invocations. Agent inference provides:

Conversation state: Working memory for current task
Persistent memory: Long-term storage (e.g., Git-based as in git-context-memory.md)
Context window management: Stay within token limits
Memory isolation: Sub-agents have independent context

Example: Automatic summarization when threshold reached

# Agent inference runtime monitors conversation length
agent_config = {
    "memory": {
        "type": "conversation",
        "max_tokens": 30000,
        "summarize_threshold": 25000,  # Trigger summarization at 80% capacity
        "strategy": "auto_summarize"
    }
}

# During agent loop:
# Iteration 1-10: conversation = 15,000 tokens (normal operation)
# Iteration 11: conversation = 26,000 tokens (threshold exceeded)
#   → Runtime automatically calls summarize_messages tool
#   → Compresses messages 1-8 into single summary
#   → New conversation = 18,000 tokens (continues processing)
# Iteration 12-15: conversation stays under threshold

The agent inference layer automatically manages memory. Developers configure policies; runtime enforces them.

4. Lifecycle Orchestration

The agent loop is standardized infrastructure:

┌─────────────────────────────────────┐
│ 1. Receive Task                     │
├─────────────────────────────────────┤
│ 2. Load Agent Spec                   │
│    • Discover & inject skills        │
│    • Load system prompt + role       │
│    • Register tools                  │
├─────────────────────────────────────┤
│ 3. LLM Call (with role + prompt +   │
│    skills + conversation + tools)    │
├─────────────────────────────────────┤
│ 4. Parse Response                    │
│    • Text message? → Return          │
│    • Tool call? → Execute (goto 5)   │
├─────────────────────────────────────┤
│ 5. Execute Tool Call                 │
│    • Invoke tool function            │
│    • Capture result                  │
├─────────────────────────────────────┤
│ 6. Add Tool Result to Conversation   │
│    → Loop back to step 3             │
├─────────────────────────────────────┤
│ 7. Break Condition                   │
│    • No more tool calls              │
│    • Max iterations reached          │
│    • Success criteria met            │
├─────────────────────────────────────┤
│ 8. Return Result + Trace             │
└─────────────────────────────────────┘

From agent-gen.md:

The agent continues its loop if there’s a tool call in the LLM response, and stops if there isn’t.

Agent inference engines implement this loop as first-class infrastructure, not application code.

5. Sub-Agent Delegation

Complex tasks require task decomposition. Sub-agents are just agents invoked through the same agent inference infrastructure—there’s no special “sub-agent” type. Agent inference provides:

Dynamic launching: Main agent spawns other agents for subtasks using the same runtime
Context isolation: Each agent (main or sub) has independent conversation
Result aggregation: Sub-agent output becomes single message in main agent’s conversation
Resource management: Limit concurrent agents, set max iterations
Recursive composition: Sub-agents can spawn their own sub-agents—all using the same infrastructure

From context-offload-via-subagent.md:

Main agent dynamically generates task instructions → Sub-agent runs complete isolated lifecycle → Full sub-agent conversation compresses to single result message

Key insight: A “sub-agent” is simply another agent invocation. They all share the same infrastructure, but can be configured differently:

# Main agent configuration
main_agent = {
    "role": "project_manager",
    "skills": ["task_decomposition@2.0"],
    "tools": ["create_subagent", "aggregate_results"],
    "memory": {"type": "persistent", "backend": "git"}
}

# Sub-agent 1: Code reviewer (different skills, tools, memory)
code_reviewer = {
    "role": "code_reviewer",
    "skills": ["code_review_sop@2.3", "security_audit@1.5"],
    "tools": ["run_linter", "run_tests", "check_coverage"],
    "memory": {"type": "stateless"}  # Different memory config
}

# Sub-agent 2: Test engineer (different configuration again)
test_engineer = {
    "role": "test_engineer",
    "skills": ["test_coverage_analysis@3.0"],
    "tools": ["pytest", "coverage_report", "mutation_testing"],
    "memory": {"type": "conversation", "max_tokens": 20000}
}

Shared infrastructure across all agents:

Lifecycle orchestration (same agent loop)
Tool registry (all tools available, agents select subset)
Memory management primitives (different policies, same implementation)
Skill discovery (same registry, different selections)
Observability infrastructure (unified tracing)

Different configurations per agent:

Role and system prompts (defines behavior)
Skill sets (domain expertise)
Tool sets (available actions)
Memory policies (stateless vs. persistent, different thresholds)
Resource limits (iterations, timeout, cost)

This unified-yet-configurable approach makes agent inference scalable—from one agent to hundreds of specialized sub-agents.

6. Skill Discovery & Distribution

Skills are containers for SOPs (Standard Operating Procedures)—the “programs” for agents. A skill packages domain expertise (instructions, workflows, tools, and success criteria) that agents load on-demand. Skills follow the same package distribution model as tools—this is a core area of agent inference.

Agent inference engines provide:

Skill registry: Store SOPs as versioned artifacts
Runtime loading: Inject skills based on task requirements (e.g., “code review skill”, “data analysis skill”)
Selection logic: Choose appropriate SOP for current task
Composition: Combine multiple skills for complex tasks
Skill marketplace: Discover and install community-contributed skills

From agent-intelligence-sop.md:

SOPs are first-class artifacts: Tools + Workflow + Instructions + Success Criteria

Package distribution for skills:

Skills would follow the same package distribution model as tools (see section 2), be installable and versioned, declare dependencies on other skills and toolsets, and be referenced in agent specs:

agent_spec:
  skills:
    - code-review@2.3.0      # From registry
    - security-audit@1.5.0   # From registry
    - ./custom-sop.md        # Local skill

Examples of skill systems:

Claude MCP: Skills injected at runtime based on context
GitHub Copilot: SKILL.md files loaded per workspace for specialized workflows
Cursor: .cursorrules as project-specific SOPs

Agent Inference: Optimization and Production Concerns

Just as model inference optimizes token generation, agent inference optimizes task completion. But agent optimization is more complex—it must balance efficiency, reliability, observability, and cost.

Parallel Concerns with Model Inference

Model Inference	Agent Inference
Batching: Process multiple requests together	Parallel tool execution: Run independent tools concurrently
Model routing: Send requests to appropriate model	Sub-agent routing: Delegate subtasks to specialized agents

Efficiency: Agent-Specific Optimizations

Model inference efficiency: Tokens/second, GPU utilization
Agent inference efficiency: Tasks/hour, context utilization, success rate

Conversation Management
- Edit/delete messages to stay within context limits
- Summarize long conversations (as shown in Memory Orchestration example)
- From conversation-manage.md: edit_message, delete_message, summarize_messages
Tool Call Efficiency
- Batch independent tool calls
- Cache tool results for identical calls
- Parallel execution where possible
Context Compression
- Sub-agent offloading to isolate context
- Git snapshots for long-term state
- Selective message retention
Validation Loops
- Factory → Runtime → Observer pattern from agent-base.md
- Iterative improvement based on success criteria
- Process supervision: Design discussion → Implementation

Observability: Tracing Agent Decisions

Model inference traces: Token generation, latency, throughput
Agent inference traces: Decision rationale, tool calls, conversation flow, skill loading

Requirements:

Conversation replay (inspect full message history)
Decision trees (why did agent choose this tool? which skill influenced the decision?)
Tool call logs (parameters, results, timing)
Memory state inspection (what context was available?)
Skill injection tracking (which SOPs were loaded and when?)

Example trace output:

{
  "agent_id": "code-review-agent",
  "task_id": "review-123",
  "skills_loaded": ["code-review-sop:v2.3", "security-audit-sop:v1.5"],
  "iterations": 5,
  "tool_calls": [
    {"tool": "run_linter", "duration_ms": 340, "result": "3 issues"},
    {"tool": "run_tests", "duration_ms": 2100, "result": "all passed"},
    {"tool": "check_coverage", "duration_ms": 180, "result": "87%"}
  ],
  "memory_events": [{"type": "summarize", "iteration": 3, "tokens_saved": 8000}],
  "completion_status": "success",
  "total_duration_ms": 4500
}

Reliability: Predictable Agent Behavior

Model inference reliability: Model availability, response consistency
Agent inference reliability: Task success rate, predictable behavior, graceful degradation

Requirements:

SOPs as tested artifacts (version controlled, regression tested)
Success criteria validation (did agent actually complete task?)
Error handling and retry logic (what happens when tool fails?)
Graceful degradation (fallback strategies when skills unavailable)
Skill version pinning (ensure consistent behavior across deployments)

Cost Model: Multi-Dimensional Pricing

Model inference costs: Input tokens + output tokens
Agent inference costs: (Multiple LLM calls × tokens) + tool execution + memory storage + sub-agent overhead + skill loading

Requirements:

Budget constraints per task
Cost tracking per agent invocation
Tool execution pricing
Sub-agent spawn limits
Skill caching to reduce reload costs

Agent Inference as Production Standard

As agent technology matures, agent inference will become the primary engineering concern—just as model inference is today for serving LLMs.

The Evolution

Phase 1 (Current): Ad-hoc agent implementations

Every team writes their own agent loop
Prompts hardcoded in application
Tools tightly coupled to business logic
No standard observability

Phase 2 (Emerging): Standardized agent patterns

Libraries like LangChain, AutoGen provide abstractions
SOPs emerge as shareable artifacts (MCP, SKILL.md)
Tool interfaces standardize
Observability through logging

Phase 3 (Future): Agent inference as infrastructure

Agent inference engines (like vLLM for agents)
Declarative agent specs (like Kubernetes manifests)
Agents as stateless job units
Production-grade observability, reliability, scalability

Key Observations

As agent inference becomes production standard, several critical patterns emerge:

1. Package Distribution as Core Infrastructure

As described in sections 2 and 6, the package distribution model for tools and skills is the defining characteristic of mature agent inference. This creates network effects: as more skills are packaged and shared, every agent becomes more capable. The ecosystem compounds—much like npm transformed JavaScript development, agent inference will create a marketplace of reusable capabilities where domain experts package their expertise for millions of agents.

2. Unified Infrastructure with Configurable Specialization

All agents share the same infrastructure, but configure it differently. This is the key to scalable agent orchestration:

Shared: Lifecycle engine, tool registry, memory primitives, skill discovery, observability
Configured: Role, skill sets, tool sets, memory policies, resource limits

A project manager agent and a code reviewer agent run on the same runtime—they just load different configurations. Sub-agents aren’t special; they’re just agent invocations with different specs. This unified-yet-configurable model is what makes agent inference scalable from 1 agent to 1000+ concurrent agents.

Why This Matters

As agent inference standardizes:

Infrastructure commoditization: Agent loops, tool routing, and conversation management become standard—developers focus on SOPs and task design
Ecosystem acceleration: Skill marketplaces emerge, enabling capability sharing at scale
Production simplification: Deploy agent specs like Kubernetes manifests—scale horizontally, monitor uniformly

Conclusion

As agents become the basic unit of AI applications—standardized, stateless, ready-to-trigger job units—infrastructure will evolve to treat agent inference as a first-class concern:

Agent roles as declarative identity
System prompts as configuration
Skills (SOPs) as loadable domain expertise
Tools as registered capabilities
Memory as managed state
Lifecycles as orchestrated processes
Sub-agents as delegated tasks

The infrastructure exists in pieces across different frameworks today. As the agent ecosystem matures, we’ll see convergence and unification: agent inference engines that provide production-grade runtime for agents, just as vLLM and SGLang provide for models. This is the same transformation that containers brought to application deployment—agent inference will be the Kubernetes moment for AI agents.

The focus will naturally shift from model inference to agent inference—not because model inference isn’t important, but because it becomes commoditized infrastructure while agent inference becomes the differentiator. Model inference makes models fast. Agent inference makes agents work. Both are essential, but as agent technology matures, optimizing agent behavior becomes the primary engineering concern.

The intelligence is in the model, the capability is in the agent, and the production readiness is in the agent inference layer.

References

From this blog series:

agent-intelligence-sop.md: Intelligence Engineering and SOPs as first-class artifacts
agent-gen.md: Standardized agent lifecycle (message types, tool interface, agent loop)
agent-base.md: Factory → Runtime → Observer pattern
context-management.md: Three-tier memory strategy
git-context-memory.md: Git as persistent memory backend
context-offload-via-subagent.md: Sub-agent delegation for context isolation
conversation-manage.md: Real-time conversation optimization tools
agent-three-stage.md: Agent evolution stages