From Working Demo to Running System
In Part 1, we built the core pipeline: a three-agent system that classifies incident root causes, clusters them against historical data using vector similarity, and generates prioritised recommendations. The code works. The architecture makes sense. Now for the questions that actually determine whether you ship this to production or shelve it after the demo.
This part covers the four things that matter most for production deployment: what it actually costs to run per incident, how to debug when an agent gives you a wrong answer, which technology stack you should choose based on your team's reality, and — most importantly — the honest assessment of when this system is the wrong tool entirely.
Cost Analysis: What This Actually Costs to Run
Every incident processed by the pipeline touches three LLM calls: the classifier, the cluster summariser, and the risk recommendation generator. Here's what that looks like in real token numbers against Azure OpenAI GPT-4o pricing (as of 2026):
| Agent | Input Tokens (avg) | Output Tokens (avg) | Cost / incident |
|---|---|---|---|
| Root Cause Classifier | ~800 | ~250 | ~$0.008 |
| Cluster Summariser | ~600 | ~100 | ~$0.005 |
| Risk & Recommendation | ~1,200 | ~400 | ~$0.012 |
| Total per incident | ~2,600 | ~750 | ~$0.025 |
At 2.5 cents per incident, a team running 200 incidents per month spends $5/month on LLM calls. A team running 2,000 incidents per month spends $50/month. The embeddings for vector search add another ~$0.001 per incident (text-embedding-3-large is very cheap).
Where Cost Scales Up
The numbers above assume incidents with descriptions under ~500 words. If your team links full runbooks or Slack thread transcripts into incident tickets, input tokens can jump 5–10x. Either trim inputs at ingestion time (recommended) or switch to GPT-4o-mini for the classifier step, which handles structured classification just as well at roughly one-tenth the cost.
Infrastructure cost beyond LLM calls:
| Service | Tier | Monthly Cost (est.) |
|---|---|---|
| Azure Service Bus | Standard (1M ops/mo) | ~$10 |
| Azure Cosmos DB | Serverless (10 RU/s avg) | ~$5–$15 |
| Azure AI Search | Basic (1 replica) | ~$75 |
| Azure Container Apps (pipeline runtime) | Consumption plan | ~$5–$20 |
| Infrastructure total | ~$95–$120/month |
The AI Search cost is the dominant line item. If you already have an Azure AI Search instance running for another purpose, this entire system becomes nearly free on the infrastructure side. The tricky decision is whether the Basic tier (1 replica, no SLA) is acceptable. For an internal tooling system where a few minutes of downtime doesn't cascade into user-facing issues, yes. For anything that feeds real-time on-call workflows, step up to Standard.
Observability & Debugging
The most common debugging scenario: an engineer looks at a recommendation and says "that's wrong, the root cause is clearly X not Y." How do you trace back through an agent pipeline to find where the reasoning went off track?
The answer is structured logging at every state transition. Each agent writes its full input, output, confidence score, and elapsed time to Application Insights. When you're debugging, you pull the trace for a specific incident ID and you can see exactly what each agent received and produced.
import time
from azure.monitor.opentelemetry import configure_azure_monitor
from opentelemetry import trace
configure_azure_monitor(connection_string=os.environ["APPINSIGHTS_CONNECTION_STRING"])
tracer = trace.get_tracer("incident-pipeline")
def trace_agent(agent_name: str):
"""Decorator that wraps any agent function with tracing."""
def decorator(fn):
async def wrapper(state: IncidentState) -> IncidentState:
with tracer.start_as_current_span(agent_name) as span:
span.set_attribute("incident.id", state.signal.id)
span.set_attribute("incident.severity", state.signal.severity)
start = time.monotonic()
try:
result = await fn(state)
elapsed = time.monotonic() - start
# Log what the agent produced
if result.classification:
span.set_attribute("classification.category", result.classification.root_cause_category.value)
span.set_attribute("classification.confidence", result.classification.confidence)
if result.cluster_match:
span.set_attribute("cluster.id", result.cluster_match.cluster_id)
span.set_attribute("cluster.score", result.cluster_match.similarity_score)
span.set_attribute("agent.elapsed_ms", int(elapsed * 1000))
return result
except Exception as e:
span.record_exception(e)
state.errors.append(f"{agent_name}: {str(e)}")
return state
return wrapper
return decorator
# Usage
@trace_agent("classify")
async def classify_agent(state: IncidentState) -> IncidentState:
# ... implementation
pass
using Azure.Monitor.OpenTelemetry.AspNetCore;
using OpenTelemetry.Trace;
// In Program.cs / Startup
builder.Services.AddOpenTelemetry()
.WithTracing(tracing => tracing
.AddSource("IncidentPipeline")
.AddAzureMonitorTraceExporter(o =>
o.ConnectionString = config["AppInsights:ConnectionString"]));
// In each step
public class ClassifyStep : KernelProcessStep
{
private static readonly ActivitySource _tracer = new("IncidentPipeline");
[KernelFunction("Run")]
public async Task RunAsync(KernelProcessStepContext context, IncidentState state)
{
using var activity = _tracer.StartActivity("classify");
activity?.SetTag("incident.id", state.Signal.Id);
activity?.SetTag("incident.severity", state.Signal.Severity);
var sw = Stopwatch.StartNew();
try
{
// ... classification logic ...
activity?.SetTag("classification.category", state.Classification?.RootCauseCategory.ToString());
activity?.SetTag("classification.confidence", state.Classification?.Confidence);
activity?.SetTag("agent.elapsed_ms", sw.ElapsedMilliseconds);
}
catch (Exception ex)
{
activity?.RecordException(ex);
state.Errors.Add($"classify: {ex.Message}");
}
}
}
The Debugging Query You'll Use Constantly
In Application Insights, this KQL query pulls the full agent trace for any incident:
dependencies
| where customDimensions["incident.id"] == "INC-1234"
| project timestamp, name, duration,
category = customDimensions["classification.category"],
confidence = customDimensions["classification.confidence"],
cluster = customDimensions["cluster.id"]
| order by timestamp asc
Beyond per-incident tracing, you'll want two Azure Monitor alerts set up from day one: one for pipeline latency (alert if p95 exceeds 30 seconds — that usually means the LLM is throttling), and one for classification confidence (alert if average confidence drops below 0.6 over a sliding window — that's a signal your incident descriptions have changed format and the prompt needs updating).
Technology Choices: Python vs C#
Python Implementation
Why choose Python: If your team writes Python, you get access to the richest AI/ML ecosystem available anywhere.
- Library ecosystem — LangChain, LangGraph, thousands of community tools; new AI patterns land in Python first
- Rapid iteration — prompt changes and agent rewiring can be tested in Jupyter without a full rebuild cycle
- Async support — LangGraph's async graph execution maps cleanly to the Service Bus event model
- Debugging ergonomics — LangGraph's built-in state inspection makes it easy to step through graph execution and see exactly where an agent's reasoning went wrong
C#/.NET Implementation
Why choose C#: If your incident tooling and existing backend runs on .NET, Semantic Kernel gives you first-party Microsoft support and enterprise integration patterns.
- Native Azure integration — Semantic Kernel is Microsoft-maintained and ships with Azure OpenAI, Azure AI Search, and Cosmos DB connectors out of the box
- Enterprise patterns — dependency injection, strong typing, and structured logging integrate with whatever ASP.NET Core stack you already have
- Deployment familiarity — if your ops team knows how to deploy .NET services to Azure Container Apps, you don't add a new runtime to the mix
- Process model — Semantic Kernel's Process abstraction maps naturally to the incident pipeline's sequential-with-conditional-routing pattern
The Bottom Line
Python team? Use LangGraph. C#/.NET team? Use Semantic Kernel. Don't fight your stack — the right framework is the one your team can debug at 2am during an actual incident.
Azure Infrastructure
The minimal Azure footprint for a production deployment:
- Azure OpenAI Service — GPT-4o deployment (gpt-4o, 2024-08-06 or later) + text-embedding-3-large. Use PTU (provisioned throughput) if you're processing more than 500 incidents/day to avoid rate-limit interruptions.
- Azure Service Bus — Standard tier for reliable event queuing. Create one topic per source system (pagerduty-incidents, jira-tickets, azure-alerts) so you can pause ingest from one source without affecting others.
- Azure Cosmos DB — Serverless for most teams. Move to provisioned throughput only when you're querying the incident store heavily for reporting — serverless handles write-heavy, read-light patterns very efficiently.
- Azure AI Search — Basic tier for up to ~500K incident documents. The vector index is the key component; use semantic ranking (semantic search add-on) to improve cluster search quality beyond pure cosine similarity.
- Azure Container Apps — Host the pipeline worker on the Consumption plan. Use KEDA's Service Bus trigger for scale-to-zero behaviour; the pipeline only runs when there are messages in the queue.
Azure AI Foundry Agent Service
Azure AI Foundry Agent Service is now generally available and worth considering as an alternative to self-managing LangGraph or Semantic Kernel orchestration.
- Built-in multi-agent routing and workflow management
- Managed state persistence — you don't manage Cosmos DB schemas for agent state
- Native Azure OpenAI integration with automatic retry and throttling handling
- Observability through Azure Monitor without custom instrumentation
The trade-off: less control over the exact prompt structure and routing logic compared to rolling your own LangGraph pipeline. For teams that want to move fast and don't need precise control over the classification prompt format, Foundry Agent Service removes significant infrastructure overhead.
Check Azure AI Foundry Agent Service for current pricing.
When NOT to Build This
The honest part: most of the teams asking about this system don't actually need it yet.
Don't Build This If...
- You have fewer than 20 incidents per month. At that volume, a human reading the incidents once a week in 30 minutes is faster, cheaper, and more accurate than an AI pipeline. The overhead of building and maintaining the system isn't justified.
- Your incident descriptions are too short to classify. If your team writes "server down, fixed" as the entire incident description, there's nothing for the classifier to work with. The prerequisite is structured incident writing — the AI amplifies existing process quality, it doesn't substitute for it.
- You don't have a baseline process. If there's no post-mortem process, no action item tracking, and no regular review of incident patterns — adding AI doesn't fix that. It surfaces patterns nobody will act on. Fix the process first.
- Your incidents are extremely high-sensitivity. Security incidents or compliance-related failures may have restrictions on what data can be sent to an LLM. Check before building.
A simpler alternative worth considering first: build a static report that groups incidents by service and severity, runs weekly, and sends it to a Slack channel. Takes a day to build. If the team actually reads it and acts on it, then you've validated the underlying value and you're ready to invest in AI-powered analysis. If nobody reads the simple report, nobody will act on the AI recommendations either.
Key Takeaways
The Incident & Quality Intelligence Assistant works well in practice because it's solving a real bottleneck — the gap between incident data accumulating and pattern analysis actually happening. The AI doesn't replace engineering judgment; it does the data-correlation work that humans are too slow and too busy to do manually.
Production numbers to carry with you:
- ~$0.025 per incident in LLM costs — negligible against the engineering time it saves
- ~$95–$120/month in base infrastructure, dominated by Azure AI Search
- Semantic threshold of 0.82 with structural overlap filter — tune this against your own data before going live
- Classification confidence below 0.6 as a monitoring threshold — that's your canary for prompt drift
The technology choice is simpler than it looks: use the stack your team already knows. LangGraph and Semantic Kernel both deliver the full feature set. The differentiator isn't the framework — it's whether your team can maintain it at 2am.
If you haven't read Part 1 yet, start there — it covers the architecture diagram, the state model, and the three core agent implementations in full.
Want More Practical AI Tutorials?
I write about building production AI systems with Azure, Python, and C#. Subscribe for practical tutorials delivered twice a month.
Subscribe to Newsletter →