SaaS Revenue & Churn Strategy Copilot: Part 2 - Production Considerations

In Part 1, I built a revenue copilot that connects to Stripe, product usage data, and support systems, then uses a four-agent architecture to score churn risk and generate retention recommendations. The implementation works. Now the real questions start.

What does this actually cost to run at 200 sessions per month? How do you build observability that goes beyond "did the LLM call succeed" to "did the recommendation actually lead to a retained account"? What framework should you choose — Python with LangGraph or C# with Semantic Kernel — for a SaaS analytics workload? And, critically, when should you not build this at all?

This article covers all of it with real numbers and honest trade-offs.

Cost Analysis

Let me break down the actual token consumption per session type. All figures use GPT-4o pay-as-you-go pricing: $2.50 per million input tokens, $10.00 per million output tokens.

Agent Call	Input Tokens	Output Tokens	Cost
Intent classification	~150	~15	~$0.0005
Revenue analysis (waterfall + accounts)	~3,500	~600	~$0.015
Churn scoring — 150 accounts (8 batches × 20)	~24,000	~6,000	~$0.12
Strategy recommendations (top 15 accounts)	~3,000	~1,500	~$0.023
Response synthesis	~2,000	~500	~$0.010
Full churn + strategy session (500 accounts, pre-filtered to 150)			~$0.17
Revenue overview only (no churn scoring)			~$0.03
Average across all session types			~$0.07–0.12

The most expensive component by far is churn scoring, which is exactly why the pre-filtering pattern from Part 1 matters. Without it, scoring all 500 accounts directly would cost ~$0.48 per full session — before the strategy and response steps.

Monthly Infrastructure

Service	Purpose	Est. Monthly Cost
Azure OpenAI (pay-as-you-go)	GPT-4o API calls	$15–35 (at 200 sessions)
Azure Cosmos DB	Session state + account data cache	$25–50
Azure Container Apps	Copilot API hosting	$30–80
Azure Monitor + App Insights	Observability and alerting	$15–30
Azure Key Vault	API keys and secrets	~$5
Total at 200 sessions/month		~$90–200/month

The Return Calculation

At $200/month all-in and a single retained $15K ARR account per month (a conservative result for a 300+ account SaaS), the copilot returns 75× its infrastructure cost. The AI token costs are almost irrelevant at this scale — the hosting and data storage dominate the bill.

Observability & Debugging

Standard AI observability — logging each LLM call, tracking latency, alerting on errors — is the easy part. The hard problem is outcome tracking: did the recommendations the copilot generated actually lead to retained accounts?

Without outcome tracking, you have a copilot that produces recommendations and no way to know if they're any good. Here's the pattern I use.

Every recommendation is stored in Cosmos DB with a unique ID, the account_id it targets, the session that generated it, and a status field that starts as surfaced. When a CS rep acts on it (logs a call, creates a task, sends an email), the status moves to actioned. At the next renewal date, the status becomes either retained or churned. This gives you a feedback loop that tells you which signal types and recommendation strategies actually work.

services/recommendation_tracker.py

import uuid
from datetime import datetime

async def store_recommendations(
    session_id: str,
    at_risk_accounts: list,
    cosmos_container
) -> list:
    """Persist recommendations so we can track outcomes later."""
    stored = []
    for account in at_risk_accounts[:20]:   # top 20 only
        rec = {
            "id":                 str(uuid.uuid4()),
            "session_id":         session_id,
            "account_id":         account["account_id"],
            "risk_score":         account["risk_score"],
            "risk_tier":          account["risk_tier"],
            "primary_signal":     account["primary_signal"],
            "recommended_action": account["recommended_action"],
            "urgency_days":       account["urgency_days"],
            "status":             "surfaced",   # → actioned → retained | churned
            "created_at":         datetime.utcnow().isoformat(),
            "actioned_at":        None,
            "outcome_at":         None,
            "outcome":            None,
        }
        await cosmos_container.upsert_item(rec)
        stored.append(rec)
    return stored

async def record_outcome(recommendation_id: str, outcome: str, cosmos_container):
    """outcome: 'retained' | 'churned' — called at renewal date."""
    item = await cosmos_container.read_item(recommendation_id, recommendation_id)
    item["outcome"]    = outcome
    item["outcome_at"] = datetime.utcnow().isoformat()
    item["status"]     = outcome
    await cosmos_container.replace_item(recommendation_id, item)

Services/RecommendationTracker.cs

public class RecommendationTracker(CosmosClient cosmosClient, string databaseId)
{
    private Container Container =>
        cosmosClient.GetContainer(databaseId, "recommendations");

    public async Task StoreRecommendationsAsync(
        string sessionId,
        IEnumerable<ChurnSignal> atRiskAccounts)
    {
        // Store top 20 at-risk accounts for outcome tracking
        foreach (var account in atRiskAccounts.Take(20))
        {
            var rec = new RecommendationRecord
            {
                Id                = Guid.NewGuid().ToString(),
                SessionId         = sessionId,
                AccountId         = account.AccountId,
                RiskScore         = account.RiskScore,
                RiskTier          = account.RiskTier,
                PrimarySignal     = account.PrimarySignal,
                RecommendedAction = account.RecommendedAction,
                UrgencyDays       = account.UrgencyDays,
                Status            = "surfaced",   // → actioned → retained | churned
                CreatedAt         = DateTime.UtcNow
            };
            await Container.UpsertItemAsync(rec, new PartitionKey(rec.Id));
        }
    }

    public async Task RecordOutcomeAsync(string recommendationId, string outcome)
    {
        // outcome: "retained" | "churned" — called at account renewal date
        var response = await Container.ReadItemAsync<RecommendationRecord>(
            recommendationId, new PartitionKey(recommendationId));

        var rec        = response.Resource;
        rec.Outcome    = outcome;
        rec.OutcomeAt  = DateTime.UtcNow;
        rec.Status     = outcome;

        await Container.ReplaceItemAsync(rec, recommendationId);
    }
}

Beyond outcome tracking, the metrics that actually matter for day-to-day debugging are: agent call latency (churn scoring batches should complete in under 8 seconds each), cache hit rate on the DataConnector (should be above 70% during business hours), and the pre-filter ratio (what percentage of accounts are being filtered — if it drops below 40%, your signal thresholds may need tuning).

What to Alert On

Alert on three things: any LLM call that returns non-JSON when JSON is expected (structured output failures), data freshness failures when the DataConnector can't reach a source within 30 seconds, and risk score distribution anomalies (if more than 40% of your portfolio suddenly scores as high/critical, either the model is misfiring or something genuinely bad just happened in your business).

Why Choose Python or C# for This

Both implementations work. The choice depends less on technical capability and more on your team's existing context.

Factor	Python (LangGraph)	C# (Semantic Kernel)
Ecosystem maturity for this use case	Richer — LangGraph, LangSmith tracing, pandas for data prep	Strong — SK is production-ready; fewer analytics libraries
Data source integration	Excellent — Stripe SDK, SQLAlchemy, most SaaS APIs have Python clients	Good — official SDKs available; fewer community wrappers
Type safety on financial data	Requires discipline (TypedDict, dataclasses, mypy)	Strong by default — compiler catches decimal vs float errors
Enterprise compliance environments	Fine, but .NET is sometimes preferred by infosec teams	Preferred in many enterprise .NET shops
Async concurrency for batch scoring	`asyncio.gather()` — excellent	`Task.WhenAll()` — excellent
Existing team skills	Better if your backend uses Django/FastAPI	Better if your backend is ASP.NET Core

My recommendation: if you have a SaaS backend that's already Python, use Python — the data pipeline integration will be smoother, and LangGraph's graph model maps naturally to the orchestration pattern in Part 1. If your backend is .NET, use C# with Semantic Kernel — the type system will catch financial data errors at compile time that Python only catches at runtime, which matters when you're dealing with MRR figures.

The LangSmith Argument for Python

LangSmith (LangChain's observability platform) integrates directly with LangGraph and gives you free session tracing that visualises the entire agent graph execution. For a team that's still tuning the churn scoring prompts, this visibility is genuinely useful and not trivially replicated in the C# ecosystem today.

Azure Infrastructure

The minimal production setup for this copilot uses five Azure services:

Azure OpenAI — GPT-4o deployment. Use pay-as-you-go for low volumes (<500 sessions/month). Move to provisioned throughput (PTU) if you're running 2,000+ sessions/month — the economics flip around that point.
Azure Cosmos DB (NoSQL API) — Session state, the 4-hour account data cache, and recommendation outcome tracking. Serverless mode works well here; the access pattern is bursty during business hours and quiet overnight.
Azure Container Apps — Host the copilot API. Scale-to-zero handles the overnight quiet period; scale-up handles concurrent CS team usage during the day. No always-on VMs needed at this volume.
Azure Monitor + Application Insights — Standard telemetry. Add a custom metric for recommendation outcome rates — it's the most important business metric to track.
Azure Key Vault — Store Stripe API keys, database connection strings, and the OpenAI API key. Don't put these in environment variables in production.

Azure AI Foundry Agent Service

Azure AI Foundry Agent Service is worth considering as an alternative to self-managing the LangGraph orchestration. It provides managed agent execution, built-in thread management, and native Azure OpenAI integration. The trade-off is less control over the graph structure — but for teams who want operational simplicity over customisation, it's a reasonable choice for this use case.

When NOT to Build This

Five specific scenarios where you shouldn't build a revenue copilot:

1. Under $500K ARR. At this stage, you know all your customers personally. A Stripe dashboard and a weekly Notion doc with your top 20 accounts is more than sufficient. Adding AI at this point is complexity without leverage.

2. You already have Gainsight or ChurnZero. These platforms have dedicated CS workflow integrations, purpose-built churn scoring models trained on millions of SaaS accounts, and years of UX refinement. Don't rebuild a worse version. Use the existing tool properly instead.

3. Your churn is a product problem. If customers are leaving because the product doesn't solve their core problem, no amount of proactive outreach will sustainably improve retention. The copilot surfaces at-risk accounts — it can't fix missing features, poor onboarding, or product-market fit issues. Diagnose the cause of churn before building tooling to manage its symptoms.

4. Your data is incomplete or inconsistently collected. Churn risk scores generated from partial usage data, missing NPS records, and untagged support tickets are worse than no scores — they give false confidence in assessments that aren't grounded in reality.

5. Your CS team is already at capacity. If your CS team doesn't have bandwidth to act on additional at-risk flags, generating more of them creates anxiety without improvement. Fix capacity first.

The Most Common Mistake

Building the AI layer before cleaning the data layer. Churn risk scores generated from inconsistent data don't just have low accuracy — they actively mislead, because they appear authoritative. A system that says "Acme Corp: 87% churn risk" based on stale NPS data and a wrongly-tagged support ticket is worse than no system at all.

Key Takeaways

Across both articles, the lessons that actually matter in production:

Pre-compute all financial figures. Never ask the LLM to do arithmetic on revenue data. It will hallucinate, and the errors will look plausible.
Pre-filter before scoring. The 65–73% reduction in token costs from filtering healthy accounts before LLM scoring is the single most impactful optimisation in this system.
Track outcomes, not just outputs. A recommendation log that tells you whether CS acted and whether the account retained is the only feedback loop that matters for improving the copilot over time.
Cost is not the barrier. At $90–200/month infrastructure and $0.07–0.17 per session in AI costs, the question isn't whether you can afford to run this — it's whether your data and team are ready for it.
Know when to stop. The five "when not to build this" scenarios are not caveats. If any of them apply, fix the underlying problem first.

Start With One Question

If you're evaluating whether to build this, start with a single copilot query: "Which accounts with renewal in the next 60 days show the most churn signals?" Run that against your real data in a prototype. If the results surprise your CS team (in a good way — surfaces things they didn't know), you have a use case. If they shrug, you may not.

You can read the architecture and core implementation in Part 1.

Cost figures based on Azure OpenAI GPT-4o pay-as-you-go pricing and Azure service list prices as of March 2026. Infrastructure costs will vary by region and usage patterns.

Want More Practical AI Tutorials?

I write about building production AI systems with Azure, Python, and C#. Subscribe for practical tutorials delivered twice a month.

Subscribe to Newsletter →