Every meeting ends the same way. Someone opens a doc, types a few bullet points, and 30% of the actual decisions never get written down.
The problem isn't attention. It's structure. "I'll take that" and "we should probably look into that" are both captured the same way in most meeting notes — as bullet points under "Next Steps." But one is a concrete commitment with an owner, and one is a vague suggestion that nobody owns. Your follow-up tracker can't tell them apart. Neither can a generic AI summary.
Knowledge workers spend an average of 4 hours per week in meetings and another 1.5 hours processing notes and follow-ups afterward. That's not a productivity problem — it's a structural one. The meeting happens, the notes are incomplete, and action items without owners quietly die. I built a meeting summary agent specifically to solve this. Not to transcribe audio (Azure handles that), but to apply structure: extract action items with owners and deadlines, track the decisions that were actually made, and surface the open questions that still need answers.
In this article, I'll walk through the full pipeline — from audio intake through speaker-attributed structured output — using Python with LangGraph and C# with Semantic Kernel.
What You'll Learn
- Why generic AI summaries fail at action item capture — and what to do instead
- How to build a speaker-aware multi-agent extraction pipeline
- The commitment quality scoring approach that filters noise from real action items
- Core implementation with LangGraph (Python) and Semantic Kernel (C#)
- ROI framework: when this is worth building vs. buying
Reading time: 13 minutes | Complexity: Intermediate — assumes familiarity with LLM APIs and async Python/C#
The Current Approach and Why It Falls Short
Most teams handle meeting notes one of three ways: someone takes manual notes during the call, someone pastes the auto-generated transcript into ChatGPT afterward, or they rely on their video conferencing tool's built-in AI summary feature.
All three have the same core limitation: they produce prose. Prose is useful for understanding what happened, but it's terrible for follow-up. What you actually need after a meeting is a structured list of who committed to what, by when, and what was left unresolved. Prose doesn't give you that without another pass of reading and extraction.
The specific problems
- No speaker attribution in action items. "Send the updated roadmap to stakeholders" is an action item. But who's sending it? If the transcript doesn't clearly attribute ownership, generic summaries just list the task without the owner.
- Commitments get diluted with discussion. In a 45-minute meeting, maybe 8 minutes involves people actually committing to things. The rest is discussion, context, and clarification. A general summary weights everything equally.
- Deadlines disappear. "By end of week" sounds clear in a meeting. In a generic summary, it often becomes "soon" or gets dropped entirely.
- No distinction between decision and discussion. "We decided to use Postgres for the primary database" is a decision. "We talked about whether to use Postgres or MySQL" is discussion. Both show up identically in a generic summary.
The real cost
Harvard Business Review research puts meeting follow-through at around 60% when action items come from manual notes. The items that get missed aren't random — they're disproportionately the ones that weren't clearly attributed to a specific person.
The Solution: Specialized Extraction Agents
The core insight here is that "summarize this meeting" is too broad a task for a single prompt. Instead of asking one LLM to do everything, I split the extraction work into three specialized agents running sequentially, each focused on a specific type of information:
- Action Item Extractor — Finds concrete commitments with owner attribution and confidence scoring
- Decision Extractor — Identifies conclusions that were reached, with context
- Question Tracker — Surfaces open questions that were raised but not resolved
After extraction, a fourth agent — the Summary Synthesizer — combines all three outputs into a structured document. Each extraction agent has a narrow, focused system prompt that makes it much better at its specific task than a general-purpose summarizer.
Speaker attribution runs as a pre-processing step using Azure AI Speech's speaker diarization feature, which gives us labeled segments before the LLM agents even start. This means the agents work with transcript segments already tagged with speaker IDs, which they can then resolve to actual names.
Tech Stack
- Azure AI Speech Services — Audio transcription with speaker diarization
- Azure OpenAI (GPT-4o) — Extraction and synthesis agents
- Python: LangGraph for orchestration, LangChain for LLM calls
- C#: Semantic Kernel with kernel functions
- Azure Cosmos DB — Meeting state persistence
- Azure Blob Storage — Audio and transcript storage
Architecture Overview
The pipeline has two distinct phases: a transcription phase that runs once per meeting recording, and an extraction phase that processes the resulting transcript through the multi-agent pipeline.
The speaker map is built before extraction agents run, so all agents work with resolved speaker names rather than anonymous IDs.
The three extraction agents run sequentially in this implementation, each reading from the same transcript but writing to different fields of the meeting state. The state flows through the graph and accumulates results, which the synthesizer then formats into the final output document.
Core Implementation
Meeting State
Both implementations start with the same data model. In Python, I use a TypedDict to define the state that flows through the LangGraph. In C#, it's a class with properties that plugins read and write.
from typing import TypedDict, List, Optional
class TranscriptSegment(TypedDict):
speaker_id: str # "Guest-1", "Guest-2" from Azure Speech
speaker_name: str # Resolved: "Sarah", "Marcus"
text: str
start_time: float # seconds
end_time: float
class ActionItem(TypedDict):
owner: str
task: str
deadline: Optional[str]
confidence: float # 0.0 - 1.0
class Decision(TypedDict):
statement: str
context: str # What led to this decision
class OpenQuestion(TypedDict):
question: str
raised_by: str
context: str
class MeetingState(TypedDict):
meeting_id: str
segments: List[TranscriptSegment]
speaker_map: dict[str, str] # {"Guest-1": "Sarah"}
action_items: List[ActionItem]
decisions: List[Decision]
open_questions: List[OpenQuestion]
summary: str
error: Optional[str]
public record TranscriptSegment(
string SpeakerId, // "Guest-1", "Guest-2" from Azure Speech
string SpeakerName, // Resolved: "Sarah", "Marcus"
string Text,
double StartTime,
double EndTime
);
public record ActionItem(
string Owner,
string Task,
string? Deadline,
double Confidence // 0.0 - 1.0
);
public record Decision(string Statement, string Context);
public record OpenQuestion(string Question, string RaisedBy, string Context);
public class MeetingState
{
public string MeetingId { get; init; } = string.Empty;
public List<TranscriptSegment> Segments { get; set; } = [];
public Dictionary<string, string> SpeakerMap { get; set; } = [];
public List<ActionItem> ActionItems { get; set; } = [];
public List<Decision> Decisions { get; set; } = [];
public List<OpenQuestion> OpenQuestions { get; set; } = [];
public string Summary { get; set; } = string.Empty;
public string? Error { get; set; }
}
Orchestration Graph
The LangGraph version builds a state machine where each node is one of the extraction agents. The Semantic Kernel version uses kernel functions called in sequence by the orchestrator class.
from langgraph.graph import StateGraph, END
from agents import (
build_speaker_map,
extract_action_items,
extract_decisions,
extract_questions,
synthesize_summary
)
def build_summary_graph() -> CompiledGraph:
workflow = StateGraph(MeetingState)
# Each node is one agent function
workflow.add_node("build_speaker_map", build_speaker_map)
workflow.add_node("extract_action_items", extract_action_items)
workflow.add_node("extract_decisions", extract_decisions)
workflow.add_node("extract_questions", extract_questions)
workflow.add_node("synthesize_summary", synthesize_summary)
# Sequential pipeline
workflow.set_entry_point("build_speaker_map")
workflow.add_edge("build_speaker_map", "extract_action_items")
workflow.add_edge("extract_action_items", "extract_decisions")
workflow.add_edge("extract_decisions", "extract_questions")
workflow.add_edge("extract_questions", "synthesize_summary")
workflow.add_edge("synthesize_summary", END)
return workflow.compile()
async def process_meeting(meeting_id: str, audio_url: str) -> MeetingState:
# 1. Transcribe with Azure Speech
segments = await transcribe_audio(audio_url)
# 2. Initialize state
initial_state: MeetingState = {
"meeting_id": meeting_id,
"segments": segments,
"speaker_map": {},
"action_items": [],
"decisions": [],
"open_questions": [],
"summary": "",
"error": None
}
# 3. Run the graph
graph = build_summary_graph()
result = await graph.ainvoke(initial_state)
return result
public class MeetingOrchestrator
{
private readonly Kernel _kernel;
private readonly SpeakerMapPlugin _speakerPlugin;
private readonly ActionItemPlugin _actionPlugin;
private readonly DecisionPlugin _decisionPlugin;
private readonly QuestionPlugin _questionPlugin;
private readonly SynthesisPlugin _synthesisPlugin;
public MeetingOrchestrator(Kernel kernel)
{
_kernel = kernel;
// Register plugins
_speakerPlugin = new SpeakerMapPlugin(kernel);
_actionPlugin = new ActionItemPlugin(kernel);
_decisionPlugin = new DecisionPlugin(kernel);
_questionPlugin = new QuestionPlugin(kernel);
_synthesisPlugin = new SynthesisPlugin(kernel);
}
public async Task<MeetingState> ProcessMeetingAsync(
string meetingId, string audioUrl)
{
// 1. Transcribe with Azure Speech
var segments = await TranscribeAudioAsync(audioUrl);
var state = new MeetingState
{
MeetingId = meetingId,
Segments = segments
};
// 2. Sequential pipeline
state = await _speakerPlugin.BuildSpeakerMapAsync(state);
state = await _actionPlugin.ExtractAsync(state);
state = await _decisionPlugin.ExtractAsync(state);
state = await _questionPlugin.ExtractAsync(state);
state = await _synthesisPlugin.SynthesizeAsync(state);
return state;
}
}
Key Challenge #1: Speaker Attribution
Azure AI Speech's speaker diarization feature labels each segment with an anonymous ID
like Guest-1 or Guest-2. That's useful for separating who's
speaking, but not useful when an action item says "Guest-3 will prepare the demo." We
need real names.
The speaker map approach
I resolve speaker IDs to names by extracting introductions from the transcript itself. Most meetings start with introductions, or at least have moments where people refer to each other by name. The Speaker Map Builder agent looks for these patterns and constructs a mapping that all subsequent agents use.
SPEAKER_MAP_PROMPT = """
You are analyzing a meeting transcript to identify who each speaker is.
The transcript uses speaker IDs like "Guest-1", "Guest-2" etc.
Extract the real name for each speaker ID where it can be determined.
Look for:
- Direct introductions: "Hi, I'm Sarah" or "This is Marcus from the product team"
- Name references: "[Guest-2]: Thanks John" (implies Guest-1 might be John)
- Email or scheduling references that include names
Return ONLY valid JSON: {{"Guest-1": "Sarah Chen", "Guest-2": "Marcus Webb"}}
If a speaker cannot be identified, omit them from the map.
Transcript:
{transcript}
"""
async def build_speaker_map(state: MeetingState) -> MeetingState:
"""First pass: build a speaker ID to name mapping."""
# Use only the first 10 segments for efficiency
# (introductions happen early)
early_segments = state["segments"][:10]
transcript = format_segments(early_segments)
response = await llm.ainvoke(
SPEAKER_MAP_PROMPT.format(transcript=transcript),
response_format={"type": "json_object"}
)
try:
speaker_map = json.loads(response.content)
except json.JSONDecodeError:
speaker_map = {}
# Apply names to all segments
resolved_segments = [
{
**seg,
"speaker_name": speaker_map.get(
seg["speaker_id"], seg["speaker_id"]
)
}
for seg in state["segments"]
]
return {**state, "speaker_map": speaker_map, "segments": resolved_segments}
def format_segments(segments: list[TranscriptSegment]) -> str:
"""Format segments as readable transcript."""
return "\n".join(
f"[{seg['speaker_id']}]: {seg['text']}"
for seg in segments
)
public class SpeakerMapPlugin
{
private readonly Kernel _kernel;
private const string SpeakerMapPrompt = """
You are analyzing a meeting transcript to identify who each speaker is.
The transcript uses speaker IDs like "Guest-1", "Guest-2" etc.
Extract the real name for each speaker ID where it can be determined.
Look for:
- Direct introductions: "Hi, I'm Sarah"
- Name references: "[Guest-2]: Thanks John"
Return ONLY valid JSON: {"Guest-1": "Sarah Chen", "Guest-2": "Marcus Webb"}
If a speaker cannot be identified, omit them from the map.
Transcript:
{{transcript}}
""";
public async Task<MeetingState> BuildSpeakerMapAsync(MeetingState state)
{
// Use first 10 segments — introductions happen early
var earlySegments = state.Segments.Take(10);
var transcript = FormatSegments(earlySegments);
var settings = new OpenAIPromptExecutionSettings
{
ResponseFormat = "json_object",
Temperature = 0
};
var result = await _kernel.InvokePromptAsync<string>(
SpeakerMapPrompt,
new KernelArguments(settings) { ["transcript"] = transcript });
Dictionary<string, string> speakerMap;
try
{
speakerMap = JsonSerializer.Deserialize<Dictionary<string, string>>(result)
?? new Dictionary<string, string>();
}
catch (JsonException)
{
speakerMap = new Dictionary<string, string>();
}
// Apply resolved names to all segments
state.SpeakerMap = speakerMap;
state.Segments = state.Segments
.Select(s => s with
{
SpeakerName = speakerMap.GetValueOrDefault(s.SpeakerId, s.SpeakerId)
})
.ToList();
return state;
}
private static string FormatSegments(IEnumerable<TranscriptSegment> segments) =>
string.Join("\n", segments.Select(s => $"[{s.SpeakerId}]: {s.Text}"));
}
Fallback for Anonymous Speakers
If a speaker can't be identified (they joined without introducing themselves), the agent falls back to the speaker ID. This is surfaced in the output as "Guest-3" rather than silently losing the attribution. The user can then relabel manually if needed.
Key Challenge #2: Distinguishing Real Commitments from Noise
This is the core of the action item extraction problem. A meeting transcript is full of statements that sound like action items but aren't:
- "We should probably look at the infrastructure costs" — suggestion, no owner
- "Someone needs to follow up on that" — no specific owner
- "That's definitely something worth exploring" — vague interest, not a commitment
- "I'll handle it" — real commitment, but to what exactly?
The solution is a focused extraction prompt that defines exactly what constitutes a real action item, combined with a confidence score that lets the agent flag borderline cases rather than silently dropping them or silently including noise.
ACTION_EXTRACTION_PROMPT = """
You are an action item extractor. Your job is to find concrete commitments.
INCLUDE items where:
- A specific person commits to doing something
- The task is concrete and specific
- Examples: "I'll send the report by Friday", "Marcus will set up the staging env"
EXCLUDE:
- Vague suggestions: "we should look into X", "someone needs to fix Y"
- Discussion of possibilities: "we could consider Z"
- Statements with no clear owner
For each action item, assign a confidence score:
- 1.0: Clear owner + clear task + explicit agreement
- 0.8: Clear owner + clear task, commitment implied by context
- 0.7: Owner identified but task somewhat vague
- Below 0.7: Do not include
Return JSON array:
[{{"owner": "name", "task": "description", "deadline": "date or null", "confidence": 0.9}}]
Meeting transcript (speaker-attributed):
{transcript}
"""
async def extract_action_items(state: MeetingState) -> MeetingState:
transcript = format_segments(state["segments"])
response = await llm.ainvoke(
ACTION_EXTRACTION_PROMPT.format(transcript=transcript),
response_format={"type": "json_object"}
)
try:
raw_items = json.loads(response.content)
# Only keep high-confidence items
action_items = [
item for item in raw_items
if item.get("confidence", 0) >= 0.7
]
except (json.JSONDecodeError, KeyError):
action_items = []
return {**state, "action_items": action_items}
public class ActionItemPlugin
{
private readonly Kernel _kernel;
private const string ActionExtractionPrompt = """
You are an action item extractor. Your job is to find concrete commitments.
INCLUDE items where:
- A specific person commits to doing something
- The task is concrete and specific
EXCLUDE:
- Vague suggestions: "we should look into X"
- Discussion of possibilities: "we could consider Z"
- Statements with no clear owner
Confidence scores:
- 1.0: Clear owner + clear task + explicit agreement
- 0.8: Clear owner + clear task, commitment implied
- 0.7: Owner identified but task somewhat vague
- Below 0.7: Do not include
Return JSON array:
[{"owner": "name", "task": "desc", "deadline": "date or null", "confidence": 0.9}]
Transcript:
{{transcript}}
""";
public async Task<MeetingState> ExtractAsync(MeetingState state)
{
var transcript = FormatSegments(state.Segments);
var settings = new OpenAIPromptExecutionSettings
{
ResponseFormat = "json_object",
Temperature = 0.1
};
var result = await _kernel.InvokePromptAsync<string>(
ActionExtractionPrompt,
new KernelArguments(settings) { ["transcript"] = transcript });
List<ActionItemDto> rawItems;
try
{
rawItems = JsonSerializer.Deserialize<List<ActionItemDto>>(result) ?? [];
}
catch (JsonException)
{
rawItems = [];
}
// Filter by confidence threshold
state.ActionItems = rawItems
.Where(i => i.Confidence >= 0.7)
.Select(i => new ActionItem(i.Owner, i.Task, i.Deadline, i.Confidence))
.ToList();
return state;
}
private static string FormatSegments(IEnumerable<TranscriptSegment> segments) =>
string.Join("\n", segments.Select(s => $"[{s.SpeakerName}]: {s.Text}"));
}
The confidence threshold of 0.7 is a tunable parameter. In practice, I've found that 0.7 captures real commitments while filtering most of the noise. Items between 0.5 and 0.7 get surfaced in a separate "review needed" section so they're not silently dropped.
Why JSON Mode Matters Here
Both the OpenAI API's response_format: json_object and
Semantic Kernel's equivalent force the model to produce parseable JSON. Without
this, the model occasionally wraps the JSON in markdown code fences or adds preamble
text — both of which break json.loads(). Always use JSON mode for
structured extraction.
ROI and Business Value
Before committing to building this pipeline, I ran through the actual numbers. Here's how the calculation looks for a typical engineering team.
| Metric | Manual Notes | AI Pipeline |
|---|---|---|
| Summary time per meeting | 10–20 min | 2 min (review) |
| Action item capture rate | ~60% | ~88% |
| Owner attribution rate | ~70% | ~95% |
| Deadline captured | ~50% | ~85% |
| Cost per meeting (AI) | — | ~$0.39 |
For a team of 6 running 15 meetings per week, the time savings alone amount to roughly 15–20 hours/week across the team. At even $50/hr fully loaded cost, that's $750–$1,000 in recovered time per week, against a weekly AI cost of about $6.
When This Pays Off
- Teams with 3+ recurring meetings per week
- Meetings that regularly produce action items (project standups, client calls, sprint planning)
- Organizations where missed follow-through has real business cost
- Teams that currently spend 10+ minutes manually writing up meeting notes
When It Doesn't Pay Off
Pure discussion meetings (brainstorming sessions, all-hands with no decisions) produce little return because there are no action items to extract. If your meetings consistently end without commitments, an AI summary agent won't help — the problem is the meeting structure, not the notes.
What's Next
Part 1 covered the core pipeline: how to structure the multi-agent extraction flow, resolve speakers to names, and score action item confidence. The implementation is solid for internal testing and controlled environments.
Part 2 covers the production reality:
- Cost breakdown — Real numbers for LLM tokens, Azure Speech, and infrastructure. A 45-minute meeting has a specific cost profile you need to understand before scaling.
- Observability — How to trace which agent dropped an action item, and why. Silent failures in extraction pipelines are hard to debug without structured logging.
- Python vs C# decision framework — LangGraph and Semantic Kernel make different trade-offs, and this particular use case has a clear winner depending on your team's stack.
- When NOT to use AI summarization — There are several meeting patterns where this pipeline is overkill and a simpler approach works better.
Part 2: Production Considerations — Coming Soon
Cost analysis, observability patterns, framework comparison, and when NOT to use AI summarization.
This article demonstrates the core extraction pipeline patterns. Production deployments need error handling, retry logic for Azure Speech API calls, and proper state persistence for long-running meetings.
Want More Practical AI Tutorials?
I write about building production AI systems with Azure, Python, and C#. Subscribe for practical tutorials delivered twice a month.
Subscribe to Newsletter →