In Part 1, I built a four-agent customer journey orchestration system: a Journey Tracker that classifies real-time behavioral stage, an Experiment Agent that prevents semantic conflicts, a Personalization Agent that selects stage-appropriate content, and an Analytics Agent that monitors significance. The architecture works. Now let's talk about whether it should go to production — and what it actually costs when it does.
This part covers the questions I always get asked after someone reads Part 1: "How much does this cost to run at our traffic volume?" "How do we debug a bad experiment assignment?" "Should we use Python or C# for this?" And the question that often gets skipped in tutorials: "When should we just use a rule engine instead?"
I'll give you real numbers, not ranges. The Azure pricing pages are public and I've run this in production at mid-tier e-commerce scale. The numbers are accurate as of March 2026.
Cost Analysis
The system makes two types of LLM calls per customer event: a stage classification call (~400 input tokens, 15 output tokens) and, when needed, conflict detection calls (~300 input tokens, 50 output tokens per experiment pair checked). The personalization query uses Azure AI Search semantic ranking, not an LLM call. Analytics recording is pure database writes.
At GPT-4o pricing ($2.50/1M input tokens, $10.00/1M output tokens):
| Call type | Tokens (in/out) | Cost per call | Frequency |
|---|---|---|---|
| Stage classification | 400 / 15 | $0.00115 | Every event |
| Conflict detection (per pair) | 300 / 50 | $0.00125 | Once per pair, then cached |
| Conflict cache hit | 0 | $0.00 | All subsequent checks |
Monthly cost projection by event volume:
| Daily events | Monthly events | Classification cost/mo | Infrastructure/mo | Total/mo |
|---|---|---|---|---|
| 10,000 | 300,000 | $345 | ~$280 | ~$625 |
| 50,000 | 1.5M | $1,725 | ~$420 | ~$2,145 |
| 200,000 | 6M | $6,900 | ~$900 | ~$7,800 |
The Cost Inflection Point
At 200K daily events, you're spending $7,800/month on this system. That needs to produce measurable lift. If the journey-aware personalization is lifting conversion rate by 1.5% on $500K monthly revenue, that's $7,500/month in incremental revenue — barely breaking even. At that scale, the first optimization is to use GPT-4o mini for stage classification ($0.15/1M input vs $2.50/1M) and only escalate to GPT-4o for conflict detection. That cuts the classification cost by ~94% and changes the economics dramatically.
Infrastructure breakdown (Azure, monthly):
| Service | Tier | Cost/mo | Notes |
|---|---|---|---|
| Azure Event Hubs | Standard (2 TUs) | $44 | Scales to ~2M events/day |
| Azure Cosmos DB | Serverless | $60–$150 | Depends on event log volume |
| Azure Cache for Redis | C1 Standard | $55 | Session state hot cache |
| Azure AI Search | S1 Standard | $250 | Semantic ranking included |
| App Service / Container Apps | P1v3 (2 instances) | ~$140 | Orchestrator workers |
Observability & Debugging
The hardest debugging scenario in this system is: "Why did customer X get enrolled in experiment Y when they should have been excluded?" To answer that, you need a complete trace of the experiment assignment decision — including the conflict detection results that led to exclusion or inclusion.
I instrument every LLM call with structured logging that captures the full decision trail. Each event gets a correlation ID that follows it through every agent invocation.
import logging
import json
from datetime import datetime, timezone
logger = logging.getLogger("journey.orchestrator")
def log_experiment_decision(
correlation_id: str,
customer_id: str,
experiment_id: str,
decision: str, # "assigned" | "excluded_conflict" | "excluded_stage"
reason: str,
stage: str,
tokens_used: int
):
logger.info(json.dumps({
"correlation_id": correlation_id,
"timestamp": datetime.now(timezone.utc).isoformat(),
"customer_id": customer_id,
"experiment_id": experiment_id,
"decision": decision,
"reason": reason,
"journey_stage": stage,
"tokens_used": tokens_used,
"event_type": "experiment_decision"
}))
def log_stage_classification(
correlation_id: str,
customer_id: str,
previous_stage: str,
new_stage: str,
event_count: int,
tokens_used: int
):
logger.info(json.dumps({
"correlation_id": correlation_id,
"timestamp": datetime.now(timezone.utc).isoformat(),
"customer_id": customer_id,
"previous_stage": previous_stage,
"new_stage": new_stage,
"stage_changed": previous_stage != new_stage,
"event_count": event_count,
"tokens_used": tokens_used,
"event_type": "stage_classification"
}))
using Microsoft.Extensions.Logging;
using System.Text.Json;
public class JourneyTracer
{
private readonly ILogger _logger;
public JourneyTracer(ILogger<JourneyTracer> logger)
{
_logger = logger;
}
public void LogExperimentDecision(
string correlationId,
string customerId,
string experimentId,
string decision,
string reason,
JourneyStage stage,
int tokensUsed)
{
_logger.LogInformation("{TraceData}", JsonSerializer.Serialize(new
{
correlation_id = correlationId,
timestamp = DateTimeOffset.UtcNow,
customer_id = customerId,
experiment_id = experimentId,
decision,
reason,
journey_stage = stage.ToString().ToLower(),
tokens_used = tokensUsed,
event_type = "experiment_decision"
}));
}
public void LogStageClassification(
string correlationId,
string customerId,
JourneyStage previousStage,
JourneyStage newStage,
int eventCount,
int tokensUsed)
{
_logger.LogInformation("{TraceData}", JsonSerializer.Serialize(new
{
correlation_id = correlationId,
timestamp = DateTimeOffset.UtcNow,
customer_id = customerId,
previous_stage = previousStage.ToString().ToLower(),
new_stage = newStage.ToString().ToLower(),
stage_changed = previousStage != newStage,
event_count = eventCount,
tokens_used = tokensUsed,
event_type = "stage_classification"
}));
}
}
Route these structured logs to Azure Monitor / Log Analytics. The key queries to build dashboards around:
- Conflict exclusion rate by experiment — if an experiment is conflicting 40% of the time, it's either misconfigured or conflicts with a high-priority test that should be paused
- Stage distribution over time — a sudden spike in "discovery" customers with no corresponding search volume increase usually means a referral campaign drove unqualified traffic
- Token cost per customer per day — useful for spotting anomalous customers who are generating unusual event volumes (bots, automated testing)
Why Choose Python vs. C#
The system I've shown runs identically in both Python (LangGraph) and C# (Semantic Kernel). The choice between them comes down to your existing team and platform, not any fundamental capability difference.
| Factor | Python / LangGraph | C# / Semantic Kernel |
|---|---|---|
| Statistical analysis | Strong (scipy, statsmodels, pingouin for significance testing) | Adequate (Math.NET Numerics, but less ecosystem) |
| Throughput | Good with asyncio; GIL limits true CPU parallelism | Excellent; async/await + true thread parallelism |
| Integration with .NET storefront | Requires cross-service HTTP calls | Native in-process or sidecar pattern |
| Data science iteration speed | Fast (Jupyter, pandas, rapid prototyping) | Slower; C# tooling not suited for data exploration |
| Enterprise compliance tooling | Maturing | Strong; integrates with Microsoft compliance stack |
| LangGraph-specific features | Native; checkpointing, streaming, time-travel debugging | Not applicable; Semantic Kernel Process is separate |
My recommendation by context:
- Python if your data science team owns the experiment analysis pipeline. The significance testing code lives in the same language as the orchestrator, and you can iterate on it without a compile cycle.
- C# if your storefront is .NET and you want the orchestrator as a sidecar service within the same deployment unit. The throughput characteristics also make it better suited for sustained high-volume event processing without the asyncio tuning overhead.
Hybrid Pattern
Several teams I've seen run the real-time event processing in C# (where latency and throughput matter) and the statistical significance analysis as a Python batch job on a schedule. The two processes share Cosmos DB as their data store. This gives you the best of both ecosystems without forcing a choice.
Azure Infrastructure
The production deployment uses these Azure services. I'll call out the ones where the setup choices have cost or reliability implications.
Azure Event Hubs: Use at least two throughput units (TUs) for production. One TU handles 1MB/second ingress; at 200K events/day with typical 500-byte payloads, you're well within 1 TU — but the second TU gives you headroom for traffic spikes. Enable the Capture feature (Avro files to Blob Storage) for replay and audit. This costs ~$15/month for the capture but saves you when you need to reprocess events after a bug fix.
Azure Cosmos DB: Use serverless for event volumes under 5M operations/month. Above that, switch to provisioned throughput with autoscale — set the max RU/s based on your peak event rate × 5 (a conservative RU estimate per journey write). The conflict detection cache lives in Cosmos DB as a separate container with a 7-day TTL, not Redis — it doesn't need sub-millisecond reads, just durable storage.
Azure AI Foundry Agent Service: For teams that don't want to self-host the LangGraph or Semantic Kernel orchestrator, Azure AI Foundry Agent Service handles managed orchestration with built-in tracing, retry logic, and compliance logging. It's more expensive per invocation than self-hosted (~$0.008/step vs ~$0.002 for container-hosted), but the operational overhead savings are significant for teams without dedicated DevOps.
Redis Cache: Use C1 Standard (1GB, dedicated, ~$55/month) as a minimum. The session state keys are short-lived — set TTL to 30 minutes per session. The conflict detection cache (experiment pair results) should live in Cosmos DB, not Redis, to survive Redis flushes.
When NOT to Build This
Skip this system and use a rule engine if any of these apply:
- Under 10,000 MAU. Your traffic volumes don't justify the infrastructure cost or the operational complexity. A spreadsheet-configured rule engine with Optimizely or a simple feature flag service handles this cleanly and cheaply.
- Fewer than 5 concurrent experiments. At low experiment volume, conflicts are rare enough to manage manually. One quick review in your experiment planning meeting catches overlap without any AI.
- Single-product catalogues or simple funnels. If your customer journey is "land, see one product, buy or leave," there's no journey to track. Journey-aware personalization assumes multi-step paths where stage matters.
- No data science capacity on the team. Statistical significance is subtle. The Analytics Agent surfaces winners, but someone needs to interpret p-values, understand sample ratio mismatch, and catch novelty effects. Without that expertise, you'll ship false positives.
- Strict GDPR / behavioral tracking constraints. The system builds behavioral profiles per customer. In jurisdictions with strong behavioral tracking restrictions, the journey event log may require explicit opt-in consent and a robust right-to-erasure implementation. This is buildable, but adds compliance scope that changes the cost-benefit significantly.
The right question isn't "is this technology impressive?" It's "does the problem complexity justify the solution complexity?" For teams with $5M+ annual e-commerce revenue, 50K+ MAU, and five or more experiments running at any time, the answer is usually yes. Below those thresholds, a well-run manual process and a commercial A/B testing tool produce better outcomes at a fraction of the cost.
Key Takeaways
- Real cost at 10K events/day: ~$625/month. The majority is LLM tokens, not infrastructure. Switching stage classification to GPT-4o mini cuts that by ~60% without meaningful quality loss.
- Experiment conflict detection is the highest-value capability. It's the one thing no commercial A/B testing tool does today, and it's directly measurable: compare contaminated cohort rates before and after.
- Structured logging from day one, not day 100. The correlation ID pattern is cheap to add at build time and extremely expensive to retrofit when you're debugging a bad assignment in production at 2am.
- The hybrid Python + C# pattern is underrated. Use C# for high-throughput event processing and Python for statistical analysis. They share Cosmos DB. You don't have to pick one language for everything.
- Under 10K MAU, this is overkill. Rules + a commercial feature flag tool + a weekly experiment review meeting solve 90% of the problem. Don't build distributed AI systems to solve problems that don't require them.
Read Part 1
Haven't read the architecture and implementation article yet? Part 1 covers the full system design, the LangGraph and Semantic Kernel implementations, and the semantic conflict detection approach.
Cost figures are based on Azure public pricing as of March 2026. Token costs use GPT-4o pay-as-you-go rates; provisioned throughput discounts apply at sustained high volume.
Want More Practical AI Tutorials?
I write about building production AI systems with Azure, Python, and C#. Subscribe for practical tutorials delivered twice a month.
Subscribe to Newsletter →