Financial Advice Agent: Personal Portfolio & Scenario Planner — Part 2: Production

This is Part 2

Part 1 covered the architecture, state model, LangGraph orchestration, and how to keep financial calculations deterministic. Read Part 1 first if you haven't — this part assumes you've seen the implementation.

What Nobody Tells You About AI Finance Tools

The demo works. You've got GPT-4o routing between four tools, fetching portfolio data from Cosmos DB, running compound growth calculations, and explaining the results in plain English. It's genuinely impressive.

Then someone asks: "What does this cost to run per user per month?" And you realise you haven't thought about it.

This part is about the questions that come after the demo: token costs, debugging a system that touched three tools to answer one question, choosing between Python and C#, and — most importantly — the situations where you should step back and not use AI at all.

Cost Analysis: Real Numbers

I tracked actual token usage across 50 test sessions. Here's what a typical session looks like:

Operation	Input Tokens	Output Tokens	Est. Cost (GPT-4o)
Intent classification	~350	~20	~$0.0012
Portfolio analysis response	~1,200	~400	~$0.008
Scenario narration	~1,400	~450	~$0.009
Risk assessment response	~1,600	~500	~$0.011
Typical 5-turn session total	~8,000	~2,000	~$0.06

At $0.06 per session, 1,000 active sessions/month costs roughly $60. That's reasonable for an internal tool. It becomes notable if you're scaling to tens of thousands of users.

Watch Your Context Window

The cost spike I didn't anticipate was from long sessions. A user who asks 15 questions in one sitting accumulates scenario history that gets passed into every subsequent call. By turn 15, you might be passing 4,000 tokens of context. Implement a sliding window cap — I use the 6 most recent turns plus the current holdings summary. Beyond that, archive to Cosmos DB and let the user explicitly "recall" old scenarios.

Observability: Debugging a Three-Tool Chain

The hardest debugging problem with multi-tool agents isn't finding bugs in individual tools — it's understanding why the orchestrator made the routing decision it did, and which tool produced the data the LLM narrated.

In practice, you'll find that an unexpectedly bad response is almost always one of three things: wrong intent classification, stale market data, or a scenario calculation with parameters the user didn't intend to set.

observability/tracer.py / Observability/AgentTracer.cs

import logging
from azure.monitor.opentelemetry import configure_azure_monitor
from opentelemetry import trace

configure_azure_monitor(connection_string=os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"])
tracer = trace.get_tracer("portfolio-agent")
logger = logging.getLogger("portfolio-agent")

def traced_tool_call(tool_name: str, inputs: dict):
    """Decorator/context manager for tracing tool invocations."""
    with tracer.start_as_current_span(f"tool.{tool_name}") as span:
        span.set_attribute("tool.name", tool_name)
        span.set_attribute("tool.inputs", str(inputs)[:500])  # truncate for safety
        logger.info(
            "Tool invoked",
            extra={
                "tool": tool_name,
                "inputs_summary": {k: type(v).__name__ for k, v in inputs.items()},
                "session_id": inputs.get("session_id", "unknown"),
            }
        )
        yield span  # caller fills in outputs after execution

# Usage
async def run_portfolio_analyser(state: PortfolioState) -> PortfolioState:
    inputs = {"user_id": state["user_id"], "holdings_count": len(state["holdings"])}
    with traced_tool_call("portfolio_analyser", inputs) as span:
        result = await _do_portfolio_analysis(state)
        span.set_attribute("tool.result_keys", str(list(result.keys())))
        return result

using Azure.Monitor.OpenTelemetry.Exporter;
using OpenTelemetry;
using OpenTelemetry.Trace;
using Microsoft.Extensions.Logging;

public class AgentTracer
{
    private static readonly ActivitySource Source = new("portfolio-agent");
    private readonly ILogger<AgentTracer> _logger;

    public AgentTracer(ILogger<AgentTracer> logger) => _logger = logger;

    public IDisposable TraceToolCall(string toolName, Dictionary<string, object> inputs)
    {
        var activity = Source.StartActivity($"tool.{toolName}");
        activity?.SetTag("tool.name", toolName);
        activity?.SetTag("tool.session_id", inputs.GetValueOrDefault("sessionId", "unknown"));

        _logger.LogInformation(
            "Tool invoked: {ToolName} with {InputCount} inputs",
            toolName,
            inputs.Count);

        return activity ?? Disposable.Empty;
    }
}

// Registration in Program.cs
builder.Services.AddOpenTelemetry()
    .WithTracing(tracing => tracing
        .AddSource("portfolio-agent")
        .AddAzureMonitorTraceExporter(o =>
            o.ConnectionString = config["ApplicationInsights:ConnectionString"]));

The most useful signal I added was logging the intent classification decision alongside the raw user message. When the system routes incorrectly — say, classifying "what's my risk exposure?" as a portfolio analysis rather than a risk assessment — you can see exactly what the classifier received and what it returned. That makes tuning the system prompt surgical rather than guesswork.

Financial Audit Trail

Beyond debugging, there's a compliance angle: if your tool is used in a regulated context, you need an audit trail of what data was shown to users and when. Log every tool result (not just the LLM narration) to a separate immutable store. Cosmos DB's change feed is useful here — enable it on the audit container and stream records to Azure Event Hub for downstream processing.

Python vs C#: Which One to Use

Python Implementation

Why choose Python: If your team writes Python, you get access to the richest AI/ML ecosystem available right now.

LangGraph — mature graph-based agent framework with built-in checkpointing and state persistence
Rapid iteration — Jupyter notebooks let you test individual tools interactively before wiring them into the graph
Financial libraries — numpy, pandas, and scipy are readily available if your scenario calculations need statistical analysis
Community — most new Azure OpenAI patterns appear in Python first

C#/.NET Implementation

Why choose C#: If you're building this as part of a .NET backend — or if your team is primarily a C# shop — Semantic Kernel gives you first-party Microsoft support and familiar patterns.

Native Azure integration — first-party SDKs, Microsoft-maintained, strong typing throughout
Enterprise patterns — dependency injection, middleware pipeline, familiar to .NET teams
decimal precision — .NET's decimal type is built for financial calculations; no need to import a separate library
ASP.NET Core integration — drop the agent into an existing .NET API with minimal ceremony

The Bottom Line

Python team? Use Python. C#/.NET team? Use C#. For financial tools specifically, C#'s native decimal precision is genuinely useful — but not so useful that it's worth fighting your stack.

Azure Infrastructure

The minimum viable setup for running this in Azure:

Azure OpenAI (GPT-4o deployment) — orchestration and narration
Azure Cosmos DB (NoSQL API) — portfolio data and scenario history persistence
Azure Cache for Redis — market data caching with 15-minute TTL
Azure App Service or Container Apps — hosting the API layer
Azure Monitor + Application Insights — tracing and cost visibility
Azure Key Vault — connection strings, API keys

Cosmos DB Cost Is the Surprise

Azure OpenAI is the obvious cost line, but Cosmos DB can surprise you. If you're storing scenario history per session and sessions are long, you accumulate a lot of small documents. Use the serverless capacity mode for development and low-traffic production, switch to provisioned for predictable load. At 1,000 active users, provisioned with autoscale is cheaper than serverless.

Azure AI Foundry Agent Service

Azure AI Foundry Agent Service is now generally available, providing managed orchestration for AI systems.

Built-in routing and workflows — you could use this instead of LangGraph/Semantic Kernel for the orchestration layer
Managed state persistence — replaces the MemorySaver / session dictionary pattern
Native Azure OpenAI integration
Observability through Azure Monitor out of the box

Check Azure AI Foundry Agent Service for current pricing.

When NOT to Use AI for Financial Planning

This is the section most tutorials skip. The architecture works. The costs are manageable. But there are situations where you should not deploy this kind of tool.

Regulated Advice Contexts

If your tool will be used by members of the public to make actual investment decisions, you are likely operating in regulated territory — ASIC in Australia, FCA in the UK, SEC in the US. "AI-generated projections" does not exempt you from financial advice regulation. If this is your context, engage a compliance lawyer before you ship, not after.

When Accuracy Tolerance Is Zero

This system is safe for projections because the LLM narrates deterministic calculations. But if your use case requires the LLM to interpret ambiguous regulatory text, classify transactions against compliance rules, or make any high-stakes binary decision — step back. The LLM's interpretation may be wrong in ways that are hard to detect and have real consequences.

Simpler alternatives to consider:

Static calculator — if users just need a projection with fixed inputs, a JavaScript calculator with no AI is faster, cheaper, and auditable
Rule-based system — if your "risk assessment" is actually a scoring model with fixed rules, implement it as a rules engine, not an LLM
Human adviser + AI summary — for regulated contexts, use AI to prepare a summary for a human adviser to review, not to replace the adviser

Key Takeaways

If I were starting this project again:

Separate calculation from narration from day one. The LLM's job is language, not maths. Every dollar figure comes from deterministic code.
Cap your context window deliberately. Sliding window of 6 turns is not a hack — it's a cost control mechanism that also keeps responses focused.
Trace intent classification separately. Routing errors are your biggest source of bad responses. Knowing what the classifier decided (and why) makes debugging 10x faster.
Start with serverless Cosmos DB. Switch to provisioned when you can predict load. Paying for RUs you don't use in development adds up.
Know your regulatory context. Personal tool? Fine. Public product? Get legal involved early.

Read Part 1

Part 1 covers the full architecture, LangGraph orchestration, the Scenario Planner tool, and how to handle multi-turn context.

Read Part 1 →

Want More Practical AI Tutorials?

I write about building production AI systems with Azure, Python, and C#. Subscribe for practical tutorials delivered twice a month.

Subscribe to Newsletter →