Building a Meeting Summary Agent: Part 1

Every meeting ends the same way. Someone opens a doc, types a few bullet points, and 30% of the actual decisions never get written down.

The problem isn't attention. It's structure. "I'll take that" and "we should probably look into that" are both captured the same way in most meeting notes — as bullet points under "Next Steps." But one is a concrete commitment with an owner, and one is a vague suggestion that nobody owns. Your follow-up tracker can't tell them apart. Neither can a generic AI summary.

Knowledge workers spend an average of 4 hours per week in meetings and another 1.5 hours processing notes and follow-ups afterward. That's not a productivity problem — it's a structural one. The meeting happens, the notes are incomplete, and action items without owners quietly die. I built a meeting summary agent specifically to solve this. Not to transcribe audio (Azure handles that), but to apply structure: extract action items with owners and deadlines, track the decisions that were actually made, and surface the open questions that still need answers.

In this article, I'll walk through the full pipeline — from audio intake through speaker-attributed structured output — using Python with LangGraph and C# with Semantic Kernel.

What You'll Learn

Why generic AI summaries fail at action item capture — and what to do instead
How to build a speaker-aware multi-agent extraction pipeline
The commitment quality scoring approach that filters noise from real action items
Core implementation with LangGraph (Python) and Semantic Kernel (C#)
ROI framework: when this is worth building vs. buying

Reading time: 13 minutes | Complexity: Intermediate — assumes familiarity with LLM APIs and async Python/C#

The Current Approach and Why It Falls Short

Most teams handle meeting notes one of three ways: someone takes manual notes during the call, someone pastes the auto-generated transcript into ChatGPT afterward, or they rely on their video conferencing tool's built-in AI summary feature.

All three have the same core limitation: they produce prose. Prose is useful for understanding what happened, but it's terrible for follow-up. What you actually need after a meeting is a structured list of who committed to what, by when, and what was left unresolved. Prose doesn't give you that without another pass of reading and extraction.

The specific problems

No speaker attribution in action items. "Send the updated roadmap to stakeholders" is an action item. But who's sending it? If the transcript doesn't clearly attribute ownership, generic summaries just list the task without the owner.
Commitments get diluted with discussion. In a 45-minute meeting, maybe 8 minutes involves people actually committing to things. The rest is discussion, context, and clarification. A general summary weights everything equally.
Deadlines disappear. "By end of week" sounds clear in a meeting. In a generic summary, it often becomes "soon" or gets dropped entirely.
No distinction between decision and discussion. "We decided to use Postgres for the primary database" is a decision. "We talked about whether to use Postgres or MySQL" is discussion. Both show up identically in a generic summary.

The real cost

Harvard Business Review research puts meeting follow-through at around 60% when action items come from manual notes. The items that get missed aren't random — they're disproportionately the ones that weren't clearly attributed to a specific person.

The Solution: Specialized Extraction Agents

The core insight here is that "summarize this meeting" is too broad a task for a single prompt. Instead of asking one LLM to do everything, I split the extraction work into three specialized agents running sequentially, each focused on a specific type of information:

Action Item Extractor — Finds concrete commitments with owner attribution and confidence scoring
Decision Extractor — Identifies conclusions that were reached, with context
Question Tracker — Surfaces open questions that were raised but not resolved

After extraction, a fourth agent — the Summary Synthesizer — combines all three outputs into a structured document. Each extraction agent has a narrow, focused system prompt that makes it much better at its specific task than a general-purpose summarizer.

Speaker attribution runs as a pre-processing step using Azure AI Speech's speaker diarization feature, which gives us labeled segments before the LLM agents even start. This means the agents work with transcript segments already tagged with speaker IDs, which they can then resolve to actual names.

Tech Stack

Azure AI Speech Services — Audio transcription with speaker diarization
Azure OpenAI (GPT-4o) — Extraction and synthesis agents
Python: LangGraph for orchestration, LangChain for LLM calls
C#: Semantic Kernel with kernel functions
Azure Cosmos DB — Meeting state persistence
Azure Blob Storage — Audio and transcript storage

Architecture Overview

The pipeline has two distinct phases: a transcription phase that runs once per meeting recording, and an extraction phase that processes the resulting transcript through the multi-agent pipeline.

The speaker map is built before extraction agents run, so all agents work with resolved speaker names rather than anonymous IDs.

The three extraction agents run sequentially in this implementation, each reading from the same transcript but writing to different fields of the meeting state. The state flows through the graph and accumulates results, which the synthesizer then formats into the final output document.

Core Implementation

Meeting State

Both implementations start with the same data model. In Python, I use a TypedDict to define the state that flows through the LangGraph. In C#, it's a class with properties that plugins read and write.

models.py

from typing import TypedDict, List, Optional

class TranscriptSegment(TypedDict):
    speaker_id: str       # "Guest-1", "Guest-2" from Azure Speech
    speaker_name: str     # Resolved: "Sarah", "Marcus"
    text: str
    start_time: float     # seconds
    end_time: float

class ActionItem(TypedDict):
    owner: str
    task: str
    deadline: Optional[str]
    confidence: float     # 0.0 - 1.0

class Decision(TypedDict):
    statement: str
    context: str          # What led to this decision

class OpenQuestion(TypedDict):
    question: str
    raised_by: str
    context: str

class MeetingState(TypedDict):
    meeting_id: str
    segments: List[TranscriptSegment]
    speaker_map: dict[str, str]       # {"Guest-1": "Sarah"}
    action_items: List[ActionItem]
    decisions: List[Decision]
    open_questions: List[OpenQuestion]
    summary: str
    error: Optional[str]

Models.cs

public record TranscriptSegment(
    string SpeakerId,     // "Guest-1", "Guest-2" from Azure Speech
    string SpeakerName,   // Resolved: "Sarah", "Marcus"
    string Text,
    double StartTime,
    double EndTime
);

public record ActionItem(
    string Owner,
    string Task,
    string? Deadline,
    double Confidence     // 0.0 - 1.0
);

public record Decision(string Statement, string Context);

public record OpenQuestion(string Question, string RaisedBy, string Context);

public class MeetingState
{
    public string MeetingId { get; init; } = string.Empty;
    public List<TranscriptSegment> Segments { get; set; } = [];
    public Dictionary<string, string> SpeakerMap { get; set; } = [];
    public List<ActionItem> ActionItems { get; set; } = [];
    public List<Decision> Decisions { get; set; } = [];
    public List<OpenQuestion> OpenQuestions { get; set; } = [];
    public string Summary { get; set; } = string.Empty;
    public string? Error { get; set; }
}

Orchestration Graph

The LangGraph version builds a state machine where each node is one of the extraction agents. The Semantic Kernel version uses kernel functions called in sequence by the orchestrator class.

orchestrator.py

from langgraph.graph import StateGraph, END
from agents import (
    build_speaker_map,
    extract_action_items,
    extract_decisions,
    extract_questions,
    synthesize_summary
)

def build_summary_graph() -> CompiledGraph:
    workflow = StateGraph(MeetingState)

    # Each node is one agent function
    workflow.add_node("build_speaker_map", build_speaker_map)
    workflow.add_node("extract_action_items", extract_action_items)
    workflow.add_node("extract_decisions", extract_decisions)
    workflow.add_node("extract_questions", extract_questions)
    workflow.add_node("synthesize_summary", synthesize_summary)

    # Sequential pipeline
    workflow.set_entry_point("build_speaker_map")
    workflow.add_edge("build_speaker_map", "extract_action_items")
    workflow.add_edge("extract_action_items", "extract_decisions")
    workflow.add_edge("extract_decisions", "extract_questions")
    workflow.add_edge("extract_questions", "synthesize_summary")
    workflow.add_edge("synthesize_summary", END)

    return workflow.compile()


async def process_meeting(meeting_id: str, audio_url: str) -> MeetingState:
    # 1. Transcribe with Azure Speech
    segments = await transcribe_audio(audio_url)

    # 2. Initialize state
    initial_state: MeetingState = {
        "meeting_id": meeting_id,
        "segments": segments,
        "speaker_map": {},
        "action_items": [],
        "decisions": [],
        "open_questions": [],
        "summary": "",
        "error": None
    }

    # 3. Run the graph
    graph = build_summary_graph()
    result = await graph.ainvoke(initial_state)
    return result

MeetingOrchestrator.cs

public class MeetingOrchestrator
{
    private readonly Kernel _kernel;
    private readonly SpeakerMapPlugin _speakerPlugin;
    private readonly ActionItemPlugin _actionPlugin;
    private readonly DecisionPlugin _decisionPlugin;
    private readonly QuestionPlugin _questionPlugin;
    private readonly SynthesisPlugin _synthesisPlugin;

    public MeetingOrchestrator(Kernel kernel)
    {
        _kernel = kernel;
        // Register plugins
        _speakerPlugin = new SpeakerMapPlugin(kernel);
        _actionPlugin = new ActionItemPlugin(kernel);
        _decisionPlugin = new DecisionPlugin(kernel);
        _questionPlugin = new QuestionPlugin(kernel);
        _synthesisPlugin = new SynthesisPlugin(kernel);
    }

    public async Task<MeetingState> ProcessMeetingAsync(
        string meetingId, string audioUrl)
    {
        // 1. Transcribe with Azure Speech
        var segments = await TranscribeAudioAsync(audioUrl);

        var state = new MeetingState
        {
            MeetingId = meetingId,
            Segments = segments
        };

        // 2. Sequential pipeline
        state = await _speakerPlugin.BuildSpeakerMapAsync(state);
        state = await _actionPlugin.ExtractAsync(state);
        state = await _decisionPlugin.ExtractAsync(state);
        state = await _questionPlugin.ExtractAsync(state);
        state = await _synthesisPlugin.SynthesizeAsync(state);

        return state;
    }
}

Key Challenge #1: Speaker Attribution

Azure AI Speech's speaker diarization feature labels each segment with an anonymous ID like Guest-1 or Guest-2. That's useful for separating who's speaking, but not useful when an action item says "Guest-3 will prepare the demo." We need real names.

The speaker map approach

I resolve speaker IDs to names by extracting introductions from the transcript itself. Most meetings start with introductions, or at least have moments where people refer to each other by name. The Speaker Map Builder agent looks for these patterns and constructs a mapping that all subsequent agents use.

agents/speaker_map.py

SPEAKER_MAP_PROMPT = """
You are analyzing a meeting transcript to identify who each speaker is.

The transcript uses speaker IDs like "Guest-1", "Guest-2" etc.
Extract the real name for each speaker ID where it can be determined.

Look for:
- Direct introductions: "Hi, I'm Sarah" or "This is Marcus from the product team"
- Name references: "[Guest-2]: Thanks John" (implies Guest-1 might be John)
- Email or scheduling references that include names

Return ONLY valid JSON: {{"Guest-1": "Sarah Chen", "Guest-2": "Marcus Webb"}}
If a speaker cannot be identified, omit them from the map.

Transcript:
{transcript}
"""

async def build_speaker_map(state: MeetingState) -> MeetingState:
    """First pass: build a speaker ID to name mapping."""
    # Use only the first 10 segments for efficiency
    # (introductions happen early)
    early_segments = state["segments"][:10]
    transcript = format_segments(early_segments)

    response = await llm.ainvoke(
        SPEAKER_MAP_PROMPT.format(transcript=transcript),
        response_format={"type": "json_object"}
    )

    try:
        speaker_map = json.loads(response.content)
    except json.JSONDecodeError:
        speaker_map = {}

    # Apply names to all segments
    resolved_segments = [
        {
            **seg,
            "speaker_name": speaker_map.get(
                seg["speaker_id"], seg["speaker_id"]
            )
        }
        for seg in state["segments"]
    ]

    return {**state, "speaker_map": speaker_map, "segments": resolved_segments}


def format_segments(segments: list[TranscriptSegment]) -> str:
    """Format segments as readable transcript."""
    return "\n".join(
        f"[{seg['speaker_id']}]: {seg['text']}"
        for seg in segments
    )

Plugins/SpeakerMapPlugin.cs

public class SpeakerMapPlugin
{
    private readonly Kernel _kernel;

    private const string SpeakerMapPrompt = """
        You are analyzing a meeting transcript to identify who each speaker is.

        The transcript uses speaker IDs like "Guest-1", "Guest-2" etc.
        Extract the real name for each speaker ID where it can be determined.

        Look for:
        - Direct introductions: "Hi, I'm Sarah"
        - Name references: "[Guest-2]: Thanks John"

        Return ONLY valid JSON: {"Guest-1": "Sarah Chen", "Guest-2": "Marcus Webb"}
        If a speaker cannot be identified, omit them from the map.

        Transcript:
        {{transcript}}
        """;

    public async Task<MeetingState> BuildSpeakerMapAsync(MeetingState state)
    {
        // Use first 10 segments — introductions happen early
        var earlySegments = state.Segments.Take(10);
        var transcript = FormatSegments(earlySegments);

        var settings = new OpenAIPromptExecutionSettings
        {
            ResponseFormat = "json_object",
            Temperature = 0
        };

        var result = await _kernel.InvokePromptAsync<string>(
            SpeakerMapPrompt,
            new KernelArguments(settings) { ["transcript"] = transcript });

        Dictionary<string, string> speakerMap;
        try
        {
            speakerMap = JsonSerializer.Deserialize<Dictionary<string, string>>(result)
                         ?? new Dictionary<string, string>();
        }
        catch (JsonException)
        {
            speakerMap = new Dictionary<string, string>();
        }

        // Apply resolved names to all segments
        state.SpeakerMap = speakerMap;
        state.Segments = state.Segments
            .Select(s => s with
            {
                SpeakerName = speakerMap.GetValueOrDefault(s.SpeakerId, s.SpeakerId)
            })
            .ToList();

        return state;
    }

    private static string FormatSegments(IEnumerable<TranscriptSegment> segments) =>
        string.Join("\n", segments.Select(s => $"[{s.SpeakerId}]: {s.Text}"));
}

Fallback for Anonymous Speakers

If a speaker can't be identified (they joined without introducing themselves), the agent falls back to the speaker ID. This is surfaced in the output as "Guest-3" rather than silently losing the attribution. The user can then relabel manually if needed.

Key Challenge #2: Distinguishing Real Commitments from Noise

This is the core of the action item extraction problem. A meeting transcript is full of statements that sound like action items but aren't:

"We should probably look at the infrastructure costs" — suggestion, no owner
"Someone needs to follow up on that" — no specific owner
"That's definitely something worth exploring" — vague interest, not a commitment
"I'll handle it" — real commitment, but to what exactly?

The solution is a focused extraction prompt that defines exactly what constitutes a real action item, combined with a confidence score that lets the agent flag borderline cases rather than silently dropping them or silently including noise.

agents/action_extractor.py

ACTION_EXTRACTION_PROMPT = """
You are an action item extractor. Your job is to find concrete commitments.

INCLUDE items where:
- A specific person commits to doing something
- The task is concrete and specific
- Examples: "I'll send the report by Friday", "Marcus will set up the staging env"

EXCLUDE:
- Vague suggestions: "we should look into X", "someone needs to fix Y"
- Discussion of possibilities: "we could consider Z"
- Statements with no clear owner

For each action item, assign a confidence score:
- 1.0: Clear owner + clear task + explicit agreement
- 0.8: Clear owner + clear task, commitment implied by context
- 0.7: Owner identified but task somewhat vague
- Below 0.7: Do not include

Return JSON array:
[{{"owner": "name", "task": "description", "deadline": "date or null", "confidence": 0.9}}]

Meeting transcript (speaker-attributed):
{transcript}
"""

async def extract_action_items(state: MeetingState) -> MeetingState:
    transcript = format_segments(state["segments"])

    response = await llm.ainvoke(
        ACTION_EXTRACTION_PROMPT.format(transcript=transcript),
        response_format={"type": "json_object"}
    )

    try:
        raw_items = json.loads(response.content)
        # Only keep high-confidence items
        action_items = [
            item for item in raw_items
            if item.get("confidence", 0) >= 0.7
        ]
    except (json.JSONDecodeError, KeyError):
        action_items = []

    return {**state, "action_items": action_items}

Plugins/ActionItemPlugin.cs

public class ActionItemPlugin
{
    private readonly Kernel _kernel;

    private const string ActionExtractionPrompt = """
        You are an action item extractor. Your job is to find concrete commitments.

        INCLUDE items where:
        - A specific person commits to doing something
        - The task is concrete and specific

        EXCLUDE:
        - Vague suggestions: "we should look into X"
        - Discussion of possibilities: "we could consider Z"
        - Statements with no clear owner

        Confidence scores:
        - 1.0: Clear owner + clear task + explicit agreement
        - 0.8: Clear owner + clear task, commitment implied
        - 0.7: Owner identified but task somewhat vague
        - Below 0.7: Do not include

        Return JSON array:
        [{"owner": "name", "task": "desc", "deadline": "date or null", "confidence": 0.9}]

        Transcript:
        {{transcript}}
        """;

    public async Task<MeetingState> ExtractAsync(MeetingState state)
    {
        var transcript = FormatSegments(state.Segments);

        var settings = new OpenAIPromptExecutionSettings
        {
            ResponseFormat = "json_object",
            Temperature = 0.1
        };

        var result = await _kernel.InvokePromptAsync<string>(
            ActionExtractionPrompt,
            new KernelArguments(settings) { ["transcript"] = transcript });

        List<ActionItemDto> rawItems;
        try
        {
            rawItems = JsonSerializer.Deserialize<List<ActionItemDto>>(result) ?? [];
        }
        catch (JsonException)
        {
            rawItems = [];
        }

        // Filter by confidence threshold
        state.ActionItems = rawItems
            .Where(i => i.Confidence >= 0.7)
            .Select(i => new ActionItem(i.Owner, i.Task, i.Deadline, i.Confidence))
            .ToList();

        return state;
    }

    private static string FormatSegments(IEnumerable<TranscriptSegment> segments) =>
        string.Join("\n", segments.Select(s => $"[{s.SpeakerName}]: {s.Text}"));
}

The confidence threshold of 0.7 is a tunable parameter. In practice, I've found that 0.7 captures real commitments while filtering most of the noise. Items between 0.5 and 0.7 get surfaced in a separate "review needed" section so they're not silently dropped.

Why JSON Mode Matters Here

Both the OpenAI API's response_format: json_object and Semantic Kernel's equivalent force the model to produce parseable JSON. Without this, the model occasionally wraps the JSON in markdown code fences or adds preamble text — both of which break json.loads(). Always use JSON mode for structured extraction.

ROI and Business Value

Before committing to building this pipeline, I ran through the actual numbers. Here's how the calculation looks for a typical engineering team.

Metric	Manual Notes	AI Pipeline
Summary time per meeting	10–20 min	2 min (review)
Action item capture rate	~60%	~88%
Owner attribution rate	~70%	~95%
Deadline captured	~50%	~85%
Cost per meeting (AI)	—	~$0.39

For a team of 6 running 15 meetings per week, the time savings alone amount to roughly 15–20 hours/week across the team. At even $50/hr fully loaded cost, that's $750–$1,000 in recovered time per week, against a weekly AI cost of about $6.

When This Pays Off

Teams with 3+ recurring meetings per week
Meetings that regularly produce action items (project standups, client calls, sprint planning)
Organizations where missed follow-through has real business cost
Teams that currently spend 10+ minutes manually writing up meeting notes

When It Doesn't Pay Off

Pure discussion meetings (brainstorming sessions, all-hands with no decisions) produce little return because there are no action items to extract. If your meetings consistently end without commitments, an AI summary agent won't help — the problem is the meeting structure, not the notes.

What's Next

Part 1 covered the core pipeline: how to structure the multi-agent extraction flow, resolve speakers to names, and score action item confidence. The implementation is solid for internal testing and controlled environments.

Part 2 covers the production reality:

Cost breakdown — Real numbers for LLM tokens, Azure Speech, and infrastructure. A 45-minute meeting has a specific cost profile you need to understand before scaling.
Observability — How to trace which agent dropped an action item, and why. Silent failures in extraction pipelines are hard to debug without structured logging.
Python vs C# decision framework — LangGraph and Semantic Kernel make different trade-offs, and this particular use case has a clear winner depending on your team's stack.
When NOT to use AI summarization — There are several meeting patterns where this pipeline is overkill and a simpler approach works better.

Part 2: Production Considerations —

Cost analysis, observability patterns, framework comparison, and when NOT to use AI summarization.

Read Part 2→

This article demonstrates the core extraction pipeline patterns. Production deployments need error handling, retry logic for Azure Speech API calls, and proper state persistence for long-running meetings.

Want More Practical AI Tutorials?

I write about building production AI systems with Azure, Python, and C#. Subscribe for practical tutorials delivered twice a month.

Subscribe to Newsletter →