Everyone focuses on the transcript. The transcript is the easy part.
In Part 1, I built the full meeting summary pipeline: Azure AI Speech for transcription, speaker map resolution, and three specialized extraction agents for action items, decisions, and open questions. The pipeline works. But before you run it in production across a 50-person organisation, there are numbers you need to see and architectural decisions you need to make.
This part covers what happens after the demo: real cost per meeting, what to do when an agent silently drops action items, the honest comparison between LangGraph and Semantic Kernel for this specific use case, and five scenarios where you shouldn't use AI summarization at all.
What This Article Covers
- Full cost breakdown per meeting (tokens + Speech + infrastructure)
- Observability patterns for tracing extraction failures
- Python vs C# decision framework for this use case
- Azure infrastructure requirements
- When AI meeting summarization is the wrong tool
Reading time: 11 minutes | Read Part 1 first for implementation context
Cost Analysis: What a Meeting Actually Costs
The cost has two distinct components that most estimates overlook: the LLM extraction cost and the Azure AI Speech transcription cost. For a 45-minute meeting with 4 participants, here's how the numbers break down.
Azure AI Speech (Transcription + Diarization)
Azure AI Speech charges by audio minutes. As of early 2026, the standard rate with speaker diarization is approximately $0.006/minute for batch transcription. A 45-minute meeting costs about $0.27.
Diarization adds cost
Speaker diarization is billed as an add-on and adds roughly 50% to the base transcription cost. If you don't need speaker attribution (e.g., one-on-one meetings where attribution is obvious), you can skip diarization and cut Speech costs significantly.
Azure OpenAI (GPT-4o) Token Costs
A 45-minute meeting generates roughly 6,000–8,000 words of transcript. Each extraction agent receives the full transcript plus a system prompt. The synthesizer receives the extraction results. Here's the token breakdown:
| Agent | Input Tokens | Output Tokens | Cost (GPT-4o) |
|---|---|---|---|
| Speaker Map Builder | ~800 | ~100 | $0.003 |
| Action Item Extractor | ~9,500 | ~400 | $0.051 |
| Decision Extractor | ~9,500 | ~300 | $0.048 |
| Question Tracker | ~9,500 | ~200 | $0.046 |
| Summary Synthesizer | ~1,200 | ~500 | $0.009 |
| Total LLM | ~30,500 | ~1,500 | $0.157 |
The biggest optimization opportunity is that each extractor currently receives the full transcript. If you chunk the transcript and process sections in parallel, you reduce per-agent token count at the cost of increased complexity. For most teams, the full-transcript approach is simpler and the cost difference is negligible.
Total cost per meeting
| Component | 45-min Meeting | Monthly (20 meetings/week) |
|---|---|---|
| Azure AI Speech | $0.27 | $21.60 |
| Azure OpenAI (GPT-4o) | $0.16 | $12.80 |
| Infrastructure (Cosmos DB, Blob, Functions) | ~$0.02 | ~$6.50 |
| Total | ~$0.45 | ~$40.90/month |
$41/month to automatically process 80 meetings. If your team currently spends even 5 minutes per meeting on manual notes, you're recovering 6+ hours of time per month for less than a lunch.
Observability and Debugging
The hardest problems to debug in extraction pipelines aren't crashes — they're silent drops. An agent receives the transcript, processes it, and returns an empty array when it should have found three action items. No exception, no error, just missing data. Without structured logging, you have no idea which agent dropped what, or why.
Structured extraction logging
I log a structured event for every extraction result, including the agent name, item count, any items that fell below the confidence threshold, and the raw LLM response. This gives me a complete audit trail for any meeting.
import structlog
from opentelemetry import trace
logger = structlog.get_logger()
tracer = trace.get_tracer("meeting-summary-agent")
async def extract_action_items(state: MeetingState) -> MeetingState:
with tracer.start_as_current_span("extract_action_items") as span:
span.set_attribute("meeting.id", state["meeting_id"])
span.set_attribute("transcript.segments", len(state["segments"]))
transcript = format_segments(state["segments"])
response = await llm.ainvoke(
ACTION_EXTRACTION_PROMPT.format(transcript=transcript),
response_format={"type": "json_object"}
)
try:
raw_items = json.loads(response.content)
high_confidence = [i for i in raw_items if i.get("confidence", 0) >= 0.7]
low_confidence = [i for i in raw_items if i.get("confidence", 0) < 0.7]
span.set_attribute("action_items.found", len(raw_items))
span.set_attribute("action_items.kept", len(high_confidence))
span.set_attribute("action_items.filtered", len(low_confidence))
logger.info(
"action_item_extraction_complete",
meeting_id=state["meeting_id"],
total_found=len(raw_items),
kept=len(high_confidence),
filtered_low_confidence=len(low_confidence),
filtered_items=[i["task"] for i in low_confidence] # Log what was dropped
)
except (json.JSONDecodeError, KeyError) as e:
span.set_attribute("error", str(e))
logger.error(
"action_item_extraction_failed",
meeting_id=state["meeting_id"],
error=str(e),
raw_response=response.content[:500] # First 500 chars for debugging
)
high_confidence = []
return {**state, "action_items": high_confidence}
public class ActionItemPlugin
{
private readonly Kernel _kernel;
private readonly ILogger<ActionItemPlugin> _logger;
private static readonly ActivitySource ActivitySource =
new("MeetingSummaryAgent");
public async Task<MeetingState> ExtractAsync(MeetingState state)
{
using var activity = ActivitySource.StartActivity("ExtractActionItems");
activity?.SetTag("meeting.id", state.MeetingId);
activity?.SetTag("transcript.segments", state.Segments.Count);
var transcript = FormatSegments(state.Segments);
var settings = new OpenAIPromptExecutionSettings
{
ResponseFormat = "json_object",
Temperature = 0.1
};
string result;
try
{
result = await _kernel.InvokePromptAsync<string>(
ActionExtractionPrompt,
new KernelArguments(settings) { ["transcript"] = transcript });
}
catch (Exception ex)
{
_logger.LogError(ex, "LLM call failed for action extraction. MeetingId: {MeetingId}",
state.MeetingId);
activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
return state; // Return state unchanged, don't crash the pipeline
}
List<ActionItemDto> rawItems;
try
{
rawItems = JsonSerializer.Deserialize<List<ActionItemDto>>(result) ?? [];
}
catch (JsonException ex)
{
_logger.LogError(ex,
"JSON parse failed. MeetingId: {MeetingId}, RawResponse: {Response}",
state.MeetingId, result[..Math.Min(500, result.Length)]);
rawItems = [];
}
var kept = rawItems.Where(i => i.Confidence >= 0.7).ToList();
var filtered = rawItems.Where(i => i.Confidence < 0.7).ToList();
activity?.SetTag("action_items.found", rawItems.Count);
activity?.SetTag("action_items.kept", kept.Count);
activity?.SetTag("action_items.filtered", filtered.Count);
_logger.LogInformation(
"Action item extraction complete. MeetingId: {MeetingId}, Found: {Found}, Kept: {Kept}, Filtered: {Filtered}",
state.MeetingId, rawItems.Count, kept.Count, filtered.Count);
if (filtered.Count > 0)
{
_logger.LogDebug("Filtered items: {Items}",
string.Join(", ", filtered.Select(i => i.Task)));
}
state.ActionItems = kept
.Select(i => new ActionItem(i.Owner, i.Task, i.Deadline, i.Confidence))
.ToList();
return state;
}
}
Key Observability Metrics to Track
- Items found vs kept per agent — A high filter rate indicates the confidence threshold may be too strict, or the meeting genuinely had no clear commitments
- LLM call latency per agent — Each agent adds 2–5 seconds; pipeline total should stay under 30 seconds for user experience
- JSON parse failure rate — Even with JSON mode, occasionally fails; track this to know when to add retry logic
- Cost per meeting — Log token counts from each LLM call and aggregate to track actual spend vs estimate
Why Choose Python or C# for This Use Case
I've implemented this pipeline in both languages. The choice isn't about capability — both can do everything described in Part 1. It's about which trade-offs match your situation.
Why Choose Python (LangGraph)
LangGraph's strength is in its graph visualization and debugging tools. When you're iterating on the extraction pipeline — tuning prompts, adjusting confidence thresholds, changing the agent sequence — being able to visualize the state machine and step through it with LangSmith's tracing is genuinely valuable. Python also has the faster path to prototyping and a larger ecosystem of audio processing libraries if you need pre-processing beyond what Azure Speech provides.
- You're in a data/ML-heavy environment where Python is the primary language
- You want LangSmith for visual trace debugging
- You're still iterating on the extraction pipeline design
- Your team is more comfortable with Python async patterns
Why Choose C# (Semantic Kernel)
Semantic Kernel integrates cleanly with .NET enterprise environments. If your organisation runs on .NET — Azure Functions, ASP.NET APIs, Azure Service Bus consumers — the C# version fits naturally into that infrastructure. The strongly-typed models also make refactoring safer as the pipeline evolves.
- Your team is .NET-primary and already running Azure Functions or ASP.NET services
- You need to integrate directly with existing C# business logic
- Strong typing matters for your team's workflow
- You're deploying to a production Azure environment already using .NET
| Consideration | Python (LangGraph) | C# (Semantic Kernel) |
|---|---|---|
| Iteration speed | Faster | Slower (compile cycle) |
| Debugging tools | LangSmith (excellent) | OpenTelemetry (good) |
| Enterprise integration | Requires bridging | Native .NET ecosystem |
| Type safety | TypedDict (partial) | Full compile-time types |
| Azure Functions deployment | Works, more setup | Native, minimal config |
| State machine visualization | Built-in (Mermaid) | Manual with Activity tracing |
Azure Infrastructure
The minimal production deployment uses five Azure services. Here's what each one is doing and why it's necessary rather than optional.
| Service | Role | Monthly Cost (20 mtgs/wk) |
|---|---|---|
| Azure AI Speech | Transcription + diarization | ~$21.60 |
| Azure OpenAI (GPT-4o) | Extraction + synthesis agents | ~$12.80 |
| Azure Blob Storage | Audio files + transcripts | ~$1.50 |
| Azure Cosmos DB (Serverless) | Meeting state + results | ~$3.00 |
| Azure Functions (Consumption) | Pipeline orchestration trigger | ~$2.00 |
| Total Infrastructure | ~$40.90/month |
Azure AI Foundry Agent Service
For teams that don't want to manage the LangGraph or Semantic Kernel orchestration themselves, Azure AI Foundry Agent Service (formerly Azure AI Agent Service) provides a managed runtime for agent pipelines. It handles state persistence, retry logic, and parallelism without custom infrastructure. The trade-off is less control over the execution model — you configure agents declaratively rather than writing the graph code directly.
For a meeting summary pipeline specifically, Foundry Agent Service is worth evaluating if:
- You don't want to manage async task queues for long-running audio transcriptions
- You need built-in retry and failure handling without writing it yourself
- You're already using other Azure AI services and want a unified management plane
When NOT to Use AI Summarization
This is the section most AI implementation articles skip. Here are the situations where building this pipeline is the wrong answer.
1. Your meetings don't produce action items
Brainstorming sessions, all-hands presentations, and pure status updates don't generate action items — or generate so few that manual capture is faster. An AI extraction agent adds latency and cost for meetings where the output is "no action items found."
2. Your team doesn't review AI outputs
AI extraction is not infallible. The confidence scoring catches some errors, but not all. If your team plans to treat the AI summary as the authoritative record without human review, you will eventually miss action items or capture incorrect commitments. This pipeline requires a review step — typically 2–3 minutes of someone scanning the output. If that won't happen, don't ship it.
3. Your meetings are under 15 minutes
Short meetings (daily standups, quick syncs) are often fully captured in someone's head by the meeting end. The Azure Speech transcription alone takes 1–2 minutes for a 15-minute meeting. The pipeline processing adds another 30–60 seconds. At that point, typing three bullet points is faster.
4. High sensitivity / confidential content
Legal discussions, HR matters, M&A conversations, and other highly sensitive content require careful consideration of where the audio and transcript are processed. Even with Azure's data privacy commitments, you need explicit organisational approval before routing sensitive meeting content through cloud AI services. Verify with your legal and compliance team before deployment.
5. You already have a working manual process
If your team has a designated note-taker who is good at it, and follow-through rates are already high, this pipeline adds complexity for marginal gain. Build it when the current approach is genuinely failing — not as a default improvement.
Key Takeaways
The meeting summary agent works well in production for the right use case. Here's what I'd carry forward:
- Cost is low but not zero. $0.45 per 45-minute meeting is negligible for most teams, but model it before pitching the project. The Azure Speech cost (60% of per-meeting cost) is the one that surprises people.
- Confidence scoring is not optional. Without a threshold filter, you'll ship a lot of "we should probably..." as action items. The 0.7 threshold is a starting point; tune it against your meeting types.
- Log everything at extraction time. Silent drops are the dominant failure mode. Structured logging with item-level detail is the only way to debug why a meeting summary missed something.
- The review step matters. Ship the summary with a clear "review this" framing, not as an authoritative record. Teams that treat AI output as a draft to verify get better results than teams that treat it as ground truth.
- Python if you're iterating, C# if you're integrating. Both work. Choose based on your team's stack, not the framework's capabilities.
Read Part 1
This article covers production deployment. For the full pipeline implementation — speaker attribution, extraction agents, and confidence scoring — start with Part 1: Architecture and Core Implementation.
Cost figures are based on Azure pricing as of February 2026. LLM pricing changes frequently — verify current rates before building your cost model.
Want More Practical AI Tutorials?
I write about building production AI systems with Azure, Python, and C#. Subscribe for practical tutorials delivered twice a month.
Subscribe to Newsletter →