Multi-Agent Systems in Production

In Part 1, we built the case for multi-agent architecture. Now comes the hard part: making it work in production without burning through your budget.

Demo code is easy. You spin up a LangGraph workflow, wire up some agents, and show a stakeholder a successful conversation. Everyone's impressed. Then someone asks: "What does this cost to run?"

Multi-agent systems have a cost profile that surprises teams used to single-prompt chatbots. Each conversation turn might trigger 3-4 LLM calls instead of 1. Each call adds latency. State persistence needs infrastructure. Debugging requires tracing through multiple agents.

None of this means multi-agent is wrong. It means you need to understand the trade-offs before your first invoice arrives.

What You'll Learn

Real cost breakdown: tokens, latency, infrastructure
Observability patterns for multi-agent debugging
LangGraph vs Semantic Kernel: which to choose
Azure services and pricing considerations
When to avoid multi-agent entirely

Cost Reality: Token Counts & $/Conversation

Let's trace the token usage through a typical 3-turn shopping conversation.

Table showing token usage and cost for multiple agents over a 3-turn conversation

Token Breakdown per Turn

Component	Input Tokens	Output Tokens	Notes
Intent Classifier	~150	~10	Runs every turn
Search Agent	~500	~300	Includes RAG context
Cart Agent	~400	~200	Includes recently_shown
Cross-Sell Agent	~300	~150	Triggered after cart ops
Per Turn (avg)	~600	~300	~900 tokens/turn

Monthly Cost Estimate

Using Azure OpenAI GPT-4o pricing (as of January 2026):

Scenario	Conversations/Day	Turns/Conversation	Monthly Cost
Low Volume	100	3	~$25-40
Medium Volume	1,000	4	~$300-450
High Volume	10,000	5	~$4,000-6,000

Compare this to a single-prompt chatbot at roughly 1/3 the cost. The question is whether the improved conversion rate justifies the 3x token spend.

Cost Optimization Tips

Use GPT-4o-mini for classification. The intent classifier doesn't need GPT-4's reasoning power.
Cache common queries. "What's in my cart?" doesn't need an LLM call every time.
Batch vector searches. One search call returning 5 products is cheaper than 5 calls returning 1 each.

Observability: Tracing Through Agents

When a conversation goes wrong, you need to answer: Which agent failed? Why? What was the state at that point?

Diagram showing an observability trace through multiple agents in a conversation

Correlation IDs

Every request needs a unique ID that follows it through all agents.

import uuid
import logging
from functools import wraps

def trace_agent(agent_name: str):
    """Decorator to add tracing to agent methods."""
    def decorator(func):
        @wraps(func)
        async def wrapper(state: ConversationState, *args, **kwargs):
            correlation_id = state.get("correlation_id", str(uuid.uuid4()))
            logger = logging.getLogger(agent_name)

            logger.info(f"[{correlation_id}] ENTER {agent_name}")

            try:
                result = await func(state, *args, **kwargs)
                logger.info(f"[{correlation_id}] EXIT {agent_name} - success")
                return result
            except Exception as e:
                logger.error(f"[{correlation_id}] EXIT {agent_name} - error: {e}")
                raise

        return wrapper
    return decorator

# Usage
@trace_agent("cart_agent")
async def handle_cart_operation(state: ConversationState):
    pass

public class TracingMiddleware
{
    private readonly ILogger _logger;

    public async Task<T> TraceAsync<T>(
        string agentName,
        string correlationId,
        Func<Task<T>> agentAction)
    {
        _logger.LogInformation(
            "[{CorrelationId}] ENTER {AgentName}",
            correlationId, agentName);

        try
        {
            var result = await agentAction();
            _logger.LogInformation(
                "[{CorrelationId}] EXIT {AgentName} - success",
                correlationId, agentName);
            return result;
        }
        catch (Exception ex)
        {
            _logger.LogError(ex,
                "[{CorrelationId}] EXIT {AgentName} - error",
                correlationId, agentName);
            throw;
        }
    }
}

// Usage
await _tracing.TraceAsync("cart_agent", correlationId,
    () => HandleCartOperationAsync(state));

Structured Logging

Log the right data at each step:

# Intent classification
logger.info({
    "event": "intent_classified",
    "correlation_id": correlation_id,
    "input_message": last_message[:100],
    "classified_intent": intent,
    "duration_ms": duration
})

# Context resolution
logger.info({
    "event": "context_resolved",
    "correlation_id": correlation_id,
    "reference": "the blue one",
    "resolved_to": "product_123",
    "candidates_count": len(recently_shown),
    "duration_ms": duration
})

// Intent classification
_logger.LogInformation(
    "Intent classified: {Event} {CorrelationId} {Intent} {DurationMs}",
    new {
        Event = "intent_classified",
        CorrelationId = correlationId,
        InputMessage = message[..Math.Min(100, message.Length)],
        ClassifiedIntent = intent,
        DurationMs = duration
    });

// Context resolution
_logger.LogInformation(
    "Context resolved: {Event} {CorrelationId} {Reference} {ResolvedTo}",
    new {
        Event = "context_resolved",
        CorrelationId = correlationId,
        Reference = "the blue one",
        ResolvedTo = "product_123",
        CandidatesCount = recentlyShown.Count,
        DurationMs = duration
    });

What to Monitor

Intent classification accuracy: Sample and manually review 1% of classifications
Context resolution success rate: How often does "the blue one" actually resolve?
Agent latency percentiles: P50, P95, P99 per agent
Escalation triggers: What patterns lead to "talk to human"?

Framework Choice: LangGraph vs Semantic Kernel

Two frameworks dominate multi-agent orchestration: LangGraph (Python) and Semantic Kernel (C#/.NET). Here's how they compare.

Aspect	LangGraph	Semantic Kernel
Language	Python	C#, Python (beta)
Paradigm	Graph-based state machine	Plugin-based orchestration
State Management	Built-in, typed	Manual via context
Routing	Declarative edges	Planner-based (LLM decides)
Azure Integration	Via langchain-azure	Native, first-party
Learning Curve	Moderate (graphs)	Moderate (plugins)
Production Maturity	Good	Excellent (Microsoft-backed)

LangGraph (Python)

Why choose LangGraph: If your team writes Python, LangGraph is the natural choice for multi-agent orchestration.

Explicit routing control — you define the state machine, not the LLM
Graph-based mental model — visualize complex flows as nodes and edges
LangChain ecosystem — integrates with existing LangChain tools and chains

langgraph_example.py

# LangGraph: Explicit routing
workflow.add_conditional_edges(
    "classify_intent",
    lambda state: state["current_intent"],
    {
        "product_search": "search",
        "add_to_cart": "cart",
    },
)

Semantic Kernel (C#/.NET)

Why choose Semantic Kernel: If your backend is C#/.NET, Semantic Kernel is purpose-built for your stack with first-party Microsoft support.

Native Azure integration — first-party support, Microsoft-maintained
Plugin architecture — familiar pattern for .NET developers
Enterprise-ready — production-proven at scale with Microsoft backing

semantic_kernel_example.cs

// Semantic Kernel: Plugin-based
public class CartPlugin
{
    [KernelFunction("add_to_cart")]
    [Description("Adds a product to the shopping cart")]
    public async Task<string> AddToCartAsync(
        Kernel kernel,
        [Description("Product ID")] string productId)
    {
        // Cart logic here
    }
}

// Register and let the planner decide
kernel.Plugins.AddFromType<CartPlugin>();

The Bottom Line

This is primarily a language decision. Both frameworks are production-ready and well-supported.

Python team? Use LangGraph. C#/.NET team? Use Semantic Kernel. Don't fight your stack.

Azure Infrastructure

Here's what you need to run multi-agent in Azure, with approximate pricing.

Service	Purpose	Starting Price
Azure OpenAI	LLM calls (GPT-4o, GPT-4o-mini)	Pay-per-token
Azure AI Search	Vector + keyword search for RAG	~$75/month (Basic)
Azure Redis Cache	Conversation state persistence	~$16/month (Basic)
Azure Container Apps	Hosting the orchestrator	~$20/month (low traffic)
Azure Monitor	Logging, tracing, alerts	Pay-per-GB ingested

Azure AI Foundry Agent Service

Azure AI Foundry Agent Service is now generally available, providing managed orchestration for multi-agent systems—an alternative to self-hosting LangGraph or building custom orchestration.

Key features:

Built-in agent routing and multi-agent workflows
Managed state persistence
Native Azure OpenAI integration
Observability through Azure Monitor
Computer Use and Browser Automation tools (preview)

Check Azure AI Foundry Agent Service for current pricing and features.

When NOT to Use Multi-Agent

Multi-agent isn't always the answer. Here's when simpler approaches win.

Decision diagram to use multi-agent or not

Skip Multi-Agent When:

You're building FAQ/support chat. Single prompt with good retrieval handles this fine.
Conversations are single-turn. No context to maintain means no need for state management.
Tasks don't overlap. If users either search OR check out but never both, separate simple bots work better.
Volume is under 100 conversations/day. The engineering investment won't pay off.
Latency is critical (<500ms). Multi-agent adds 2-3x latency from sequential LLM calls.

The Single-Agent-With-Memory Alternative

For many use cases, a single agent with proper conversation memory handles 80% of what multi-agent does, at 50% of the cost and complexity.

# Sometimes this is enough
response = await llm.chat([
    {"role": "system", "content": COMPREHENSIVE_PROMPT},
    *conversation_history,  # Last 10 messages
    {"role": "user", "content": user_message}
])

// Sometimes this is enough
var chatHistory = new ChatHistory(COMPREHENSIVE_PROMPT);
foreach (var msg in conversationHistory.TakeLast(10))
    chatHistory.Add(msg);
chatHistory.AddUserMessage(userMessage);

var response = await chatCompletion.GetChatMessageContentAsync(
    chatHistory);

Multi-agent shines when tasks are fundamentally different (search vs cart vs recommendations), not just different topics within one task.

Interactive Demo Concept

Want to show stakeholders how multi-agent works without building a full backend? Here's a client-side demo approach.

Demo Architecture

A static HTML page with hardcoded conversation flows. User selects from pre-defined messages, and the page shows:

Which agent handled the request
The state changes that occurred
The response generated

// Simplified demo logic (no actual LLM calls)
const DEMO_FLOWS = {
    "show me running shoes": {
        agent: "Search Agent",
        stateChanges: {
            recently_shown: ["Nike Air Zoom", "Adidas Ultra", "Brooks Ghost"]
        },
        response: "Here are 3 options: 1) Nike Air Zoom..."
    },
    "add the first one": {
        agent: "Cart Agent",
        stateChanges: {
            cart: { items: ["Nike Air Zoom"], total: 129 }
        },
        response: "Added Nike Air Zoom to your cart."
    }
};

function handleUserMessage(message) {
    const flow = DEMO_FLOWS[message.toLowerCase()];
    showAgentBadge(flow.agent);
    animateStateChanges(flow.stateChanges);
    displayResponse(flow.response);
}

This isn't a real implementation—it's a communication tool. It helps non-technical stakeholders understand the flow without requiring backend infrastructure.

Key Takeaways

Cost scales with complexity. Budget 3x tokens compared to single-prompt. Use cheaper models for classification.
Observability is non-negotiable. Correlation IDs and structured logging are the difference between debugging and guessing.
Framework choice depends on ecosystem. LangGraph for explicit Python control, Semantic Kernel for Azure/.NET shops.
Know when to stop. Single-agent-with-memory handles 80% of cases. Multi-agent is for fundamentally different tasks that need isolation.

The Honest Truth

Multi-agent is powerful but not magic. It's more expensive, more complex, and harder to debug than simple approaches. Use it when the business value justifies the investment—not because it's architecturally elegant.

This concludes the two-part series on multi-agent AI systems. For questions or feedback, reach out on LinkedIn.

Want More Practical AI Tutorials?

I write about building production AI systems with Azure, Python, and C#. Subscribe for practical tutorials delivered twice a month.

Subscribe to Newsletter →