Building a Dynamic Travel Itinerary & Ops Optimizer with Azure OpenAI

What You'll Learn

How to design a multi-agent system that handles both itinerary generation and operations optimisation from a single orchestrator
Implementing real-time disruption replanning when flights are delayed or weather changes — without restarting the whole plan
Core LangGraph state graph (Python) and Semantic Kernel agent pipeline (C#) with complete, runnable code examples
Honest cost breakdown: token spend, infrastructure, and per-booking figures you can use to build a business case
When this architecture is overkill — and what simpler alternatives actually solve the problem cheaper

1. The Problem Worth Solving

I built this because a hotel technology team I was advising had a specific, painful complaint: their guests were receiving static PDF itineraries printed at check-in, and when a dinner reservation fell through at 6pm the guests called the concierge desk, which called the restaurant, which called the activity provider. Three phone calls to replan one evening. Multiply that by 200 rooms and you start to understand why front-desk staff burn out.

The travel and hospitality industry runs on two fundamentally incompatible rhythms. Guests expect fluid, personalised experiences that adapt in real time — a flight delay, a tropical downpour, a sold-out museum. Operations teams, meanwhile, are optimising static schedules: housekeeping rounds, restaurant covers, shuttle bookings, upsell windows. Most existing tech separates these concerns entirely. Property Management Systems talk to revenue managers. Guest apps talk to guests. Neither system knows what the other is doing, and neither can replan automatically when the world changes.

The tricky part is that both problems — dynamic itinerary generation and ops optimisation — require the same underlying data: who is arriving, what they want, what is available, and what just changed. The insight behind this architecture is to run both as agents off the same event stream, coordinated by a single GPT-4o orchestrator. When a flight delay signal arrives, the replanning agent adjusts the guest's evening and simultaneously the ops agent shifts the dinner reservation window and alerts the kitchen.

Here's what surprised me: the LLM is actually the cheap part. The expensive part is the real-time data plumbing — getting flight status, weather, and venue availability into the system reliably. Get that right and the AI reasoning almost feels straightforward by comparison.

2. How Travel Businesses Handle This Today

The dominant pattern in mid-to-large travel businesses is a cluster of point solutions stitched together with manual processes and email. A Global Distribution System (GDS) holds flight and hotel inventory. A CRM holds guest preferences. A Property Management System (PMS) manages room assignments and housekeeping. A separate activity booking platform handles tours and experiences. None of these talk to each other in real time.

When disruption hits — and in travel, it always does — the coordination burden falls on humans. A skilled concierge at a luxury resort can juggle this beautifully. A 3.5-star business hotel with one front-desk agent on a Saturday night cannot.

The Real Cost of Manual Replanning

In a 150-room property averaging 2.3 disruption events per day (delays, cancellations, weather), at 12 minutes per replanning event and a fully loaded staff cost of $35/hour, you are spending roughly $1,450/month purely on reactive coordination. That number is conservative — it doesn't include guest dissatisfaction or the upsell revenue lost because staff are too busy firefighting to suggest the upgraded dining experience.

Rule-based automation helps at the margins — "if flight delayed by more than 2 hours, send an apology SMS" — but rule trees cannot handle the combinatorial complexity of real itinerary replanning. A 5-day itinerary with 8 activities and 3 dining reservations has thousands of possible replanning paths when a single event changes. LLM-based agents handle this reasoning naturally; rule trees cannot.

3. The Solution: A Dual-Agent Architecture

The core idea is two cooperating agents driven by the same event stream, with GPT-4o as the reasoning engine for both:

Itinerary Planner Agent — handles initial itinerary creation based on guest preferences, constraints, and live availability. It writes to a persistent state store (Azure Cosmos DB) so the current plan is always queryable.
Disruption Replanner Agent — listens for events from Azure Event Grid (flight delays, weather alerts, venue closures) and replans the affected segments of the itinerary without touching unaffected ones. Surgical, not destructive.
Ops Optimizer Agent — runs in parallel, reading the same state, and pushes schedule adjustments to the Property Management System and activity providers via webhook.

Why Not One Big Agent?

A single agent trying to do all three jobs simultaneously has a context window problem: a full 5-day itinerary for 4 guests plus all the ops state plus the disruption event is several thousand tokens of context. Splitting into specialised agents lets each one work with a focused, relevant subset of state. The orchestrator (the Intent Router) decides which agent picks up each incoming signal.

The tech stack is deliberately Azure-centric because most travel businesses in the enterprise segment are Microsoft shops. The AI layer is Azure OpenAI (GPT-4o), search is Azure AI Search over a vectorised venue and activity catalogue, and state is Azure Cosmos DB with a change-feed that powers the event triggers.

4. Architecture Overview

Multi-agent architecture for travel itinerary and ops optimization showing input layer, orchestration agents, data stores, and output layer

The architecture has four layers:

Input Layer — Three entry points: the guest-facing app/portal, the ops dashboard for staff overrides, and Azure Event Grid for real-time disruption signals from external APIs (flight trackers, weather services).
Orchestration — The Intent Router Agent classifies incoming signals and hands them to the appropriate specialist agent. The Itinerary Planner, Disruption Replanner, and Ops Optimizer agents each have a focused system prompt and tool set.
Data & Search — Azure AI Search holds the vectorised catalogue of venues, activities, and experiences. Azure Cosmos DB persists itinerary state and provides the change-feed for cross-agent coordination. External APIs provide live flight, weather, and GDS data.
Output Layer — Push notifications to guests, webhooks to PMS and staff systems, and full tracing through Azure Monitor.

The key design decision is that agents do not call each other directly. All coordination happens through shared Cosmos DB state. This makes the system debuggable: you can replay any decision by inspecting the state at the time it was made.

5. Core Implementation

State Model

The state model is the contract between all agents. Every field must be explicitly typed and versioned because Cosmos DB will hold thousands of these objects across concurrent bookings.

itinerary_state.py / ItineraryState.cs

from dataclasses import dataclass, field
from typing import Literal, Optional
from datetime import datetime

@dataclass
class ItinerarySegment:
    segment_id: str
    segment_type: Literal["flight", "hotel", "activity", "dining", "transfer"]
    title: str
    start_time: datetime
    end_time: datetime
    location: str
    booking_ref: Optional[str] = None
    status: Literal["confirmed", "pending", "disrupted", "replanned"] = "confirmed"
    alternatives: list[dict] = field(default_factory=list)

@dataclass
class GuestPreferences:
    dietary: list[str]
    interests: list[str]           # e.g. ["culture", "adventure", "gastronomy"]
    mobility_needs: str = "none"
    budget_tier: Literal["economy", "standard", "premium", "luxury"] = "standard"
    preferred_pace: Literal["relaxed", "moderate", "packed"] = "moderate"

@dataclass
class ItineraryState:
    booking_id: str
    guest_count: int
    destination: str
    start_date: datetime
    end_date: datetime
    preferences: GuestPreferences
    segments: list[ItinerarySegment] = field(default_factory=list)
    active_disruptions: list[dict] = field(default_factory=list)
    ops_flags: dict = field(default_factory=dict)   # housekeeping, upsell triggers
    version: int = 1
    last_updated: datetime = field(default_factory=datetime.utcnow)

public enum SegmentType { Flight, Hotel, Activity, Dining, Transfer }
public enum SegmentStatus { Confirmed, Pending, Disrupted, Replanned }
public enum BudgetTier { Economy, Standard, Premium, Luxury }
public enum TravelPace { Relaxed, Moderate, Packed }

public record ItinerarySegment(
    string SegmentId,
    SegmentType Type,
    string Title,
    DateTimeOffset StartTime,
    DateTimeOffset EndTime,
    string Location,
    string? BookingRef = null,
    SegmentStatus Status = SegmentStatus.Confirmed,
    List<Dictionary<string, object>>? Alternatives = null
);

public record GuestPreferences(
    List<string> Dietary,
    List<string> Interests,
    string MobilityNeeds = "none",
    BudgetTier BudgetTier = BudgetTier.Standard,
    TravelPace Pace = TravelPace.Moderate
);

public record ItineraryState(
    string BookingId,
    int GuestCount,
    string Destination,
    DateTimeOffset StartDate,
    DateTimeOffset EndDate,
    GuestPreferences Preferences
)
{
    public List<ItinerarySegment> Segments { get; init; } = [];
    public List<Dictionary<string, object>> ActiveDisruptions { get; init; } = [];
    public Dictionary<string, object> OpsFlags { get; init; } = [];
    public int Version { get; init; } = 1;
    public DateTimeOffset LastUpdated { get; init; } = DateTimeOffset.UtcNow;
}

Orchestrator Setup

The orchestrator is a LangGraph state graph in Python and a Semantic Kernel agent pipeline in C#. Both follow the same conceptual flow: classify the incoming signal, invoke the appropriate specialist, persist updated state, and emit output events.

orchestrator.py / TravelOrchestrator.cs

from langgraph.graph import StateGraph, END
from langchain_openai import AzureChatOpenAI
from typing import Annotated
import operator

# Shared graph state — all agents read/write this
class GraphState(TypedDict):
    booking_id: str
    signal_type: str              # "new_booking" | "disruption" | "ops_request"
    signal_payload: dict
    itinerary: ItineraryState
    agent_output: str
    next_action: str

llm = AzureChatOpenAI(
    azure_deployment="gpt-4o",
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_KEY"],
    api_version="2024-08-01-preview",
    temperature=0.2,
)

def route_intent(state: GraphState) -> str:
    """Classify incoming signal and decide which agent runs next."""
    signal = state["signal_type"]
    if signal == "new_booking":
        return "itinerary_planner"
    elif signal == "disruption":
        return "disruption_replanner"
    elif signal == "ops_request":
        return "ops_optimizer"
    return END

# Build graph
graph = StateGraph(GraphState)
graph.add_node("intent_router", route_intent)
graph.add_node("itinerary_planner", run_itinerary_planner)
graph.add_node("disruption_replanner", run_disruption_replanner)
graph.add_node("ops_optimizer", run_ops_optimizer)

graph.set_entry_point("intent_router")
graph.add_conditional_edges("intent_router", route_intent)
graph.add_edge("itinerary_planner", END)
graph.add_edge("disruption_replanner", END)
graph.add_edge("ops_optimizer", END)

app = graph.compile()

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Agents;

public class TravelOrchestrator
{
    private readonly Kernel _kernel;
    private readonly ItineraryPlannerAgent _plannerAgent;
    private readonly DisruptionReplannerAgent _replannerAgent;
    private readonly OpsOptimizerAgent _opsAgent;
    private readonly ICosmosStateStore _stateStore;

    public TravelOrchestrator(
        Kernel kernel,
        ItineraryPlannerAgent planner,
        DisruptionReplannerAgent replanner,
        OpsOptimizerAgent ops,
        ICosmosStateStore stateStore)
    {
        _kernel = kernel;
        _plannerAgent = planner;
        _replannerAgent = replanner;
        _opsAgent = ops;
        _stateStore = stateStore;
    }

    public async Task<OrchestrationResult> ProcessSignalAsync(
        TravelSignal signal,
        CancellationToken ct = default)
    {
        var state = await _stateStore.LoadAsync(signal.BookingId, ct);

        AgentBase agent = signal.Type switch
        {
            SignalType.NewBooking    => _plannerAgent,
            SignalType.Disruption    => _replannerAgent,
            SignalType.OpsRequest    => _opsAgent,
            _                        => throw new ArgumentOutOfRangeException()
        };

        var result = await agent.InvokeAsync(state, signal.Payload, ct);

        await _stateStore.SaveAsync(result.UpdatedState, ct);
        return result;
    }
}

6. Key Technical Challenge: Surgical Replanning

The tricky part is not detecting a disruption — that's a webhook from a flight tracker API. The tricky part is replanning only the affected segments without breaking the rest of the itinerary. A naive approach asks GPT-4o to regenerate the whole plan, which wastes tokens, risks changing confirmed bookings, and introduces latency when guests are already anxious.

The approach I settled on is segment-scoped replanning: the disruption signal includes which segment ID is affected. The replanner agent loads only that segment's context (the segment itself, its neighbours in the schedule, and the guest preferences), queries Azure AI Search for alternatives that fit the time window, then asks GPT-4o to rank and select the best substitute.

disruption_replanner.py / DisruptionReplannerAgent.cs

from langchain_core.tools import tool
from langchain_openai import AzureChatOpenAI
from azure.search.documents.aio import SearchClient

@tool
async def find_alternative_activities(
    location: str,
    start_window: str,
    end_window: str,
    interests: list[str],
    budget_tier: str,
) -> list[dict]:
    """Search the activity catalogue for alternatives within a time window."""
    async with SearchClient(
        endpoint=os.environ["SEARCH_ENDPOINT"],
        index_name="activities",
        credential=AzureKeyCredential(os.environ["SEARCH_KEY"]),
    ) as client:
        # Hybrid search: vector (semantic match) + keyword filter
        results = await client.search(
            search_text=" ".join(interests),
            vector_queries=[VectorizedQuery(
                vector=await embed(interests),
                k_nearest_neighbors=10,
                fields="interests_vector"
            )],
            filter=(
                f"location eq '{location}' and "
                f"start_time ge {start_window} and "
                f"end_time le {end_window} and "
                f"budget_tier eq '{budget_tier}'"
            ),
            top=5,
        )
        return [r async for r in results]

async def run_disruption_replanner(state: GraphState) -> GraphState:
    disruption = state["signal_payload"]
    itinerary = state["itinerary"]

    affected = next(
        s for s in itinerary.segments
        if s.segment_id == disruption["affected_segment_id"]
    )
    # Find neighbours to understand time constraints
    idx = itinerary.segments.index(affected)
    prev_end = itinerary.segments[idx - 1].end_time if idx > 0 else itinerary.start_date
    next_start = itinerary.segments[idx + 1].start_time if idx < len(itinerary.segments) - 1 else itinerary.end_date

    tools = [find_alternative_activities]
    agent = llm.bind_tools(tools)

    system_prompt = """You are a travel replanning specialist.
    A disruption has occurred. Identify the best replacement activity that:
    - Fits within the available time window
    - Matches guest interests and budget tier
    - Requires minimal additional travel
    Return your choice as JSON with fields: title, booking_url, rationale."""

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": (
            f"Disrupted: {affected.title} ({disruption['reason']})\n"
            f"Available window: {prev_end} to {next_start}\n"
            f"Interests: {itinerary.preferences.interests}\n"
            f"Budget: {itinerary.preferences.budget_tier}"
        )}
    ]

    response = await agent.ainvoke(messages)
    # Update state with replanned segment
    affected.status = "replanned"
    affected.alternatives = [response.content]
    state["agent_output"] = response.content
    return state

public class DisruptionReplannerAgent : AgentBase
{
    private readonly SearchClient _searchClient;
    private readonly Kernel _kernel;

    public DisruptionReplannerAgent(SearchClient searchClient, Kernel kernel)
    {
        _searchClient = searchClient;
        _kernel = kernel;
    }

    public override async Task<AgentResult> InvokeAsync(
        ItineraryState state,
        Dictionary<string, object> payload,
        CancellationToken ct)
    {
        var affectedId = payload["affected_segment_id"].ToString()!;
        var affected = state.Segments.First(s => s.SegmentId == affectedId);
        var idx = state.Segments.IndexOf(affected);

        var windowStart = idx > 0
            ? state.Segments[idx - 1].EndTime
            : state.StartDate;
        var windowEnd = idx < state.Segments.Count - 1
            ? state.Segments[idx + 1].StartTime
            : state.EndDate;

        // Query AI Search for alternatives
        var searchOptions = new SearchOptions
        {
            Filter = $"location eq '{state.Destination}' and " +
                     $"start_time ge {windowStart:o} and " +
                     $"end_time le {windowEnd:o}",
            Size = 5,
        };
        var searchResult = await _searchClient.SearchAsync<ActivityDocument>(
            string.Join(" ", state.Preferences.Interests), searchOptions, ct);

        var alternatives = new List<string>();
        await foreach (var page in searchResult.Value.GetResultsAsync().AsPages())
            alternatives.AddRange(page.Values.Select(r => r.Document.Title));

        // Ask GPT-4o to rank and select
        var chatHistory = new ChatHistory();
        chatHistory.AddSystemMessage(
            "You are a travel replanning specialist. Choose the best replacement " +
            "from the alternatives provided. Return JSON: {title, rationale}.");
        chatHistory.AddUserMessage(
            $"Disrupted: {affected.Title} — Reason: {payload["reason"]}\n" +
            $"Alternatives: {string.Join(", ", alternatives)}\n" +
            $"Guest interests: {string.Join(", ", state.Preferences.Interests)}\n" +
            $"Budget: {state.Preferences.BudgetTier}");

        var chat = _kernel.GetRequiredService<IChatCompletionService>();
        var response = await chat.GetChatMessageContentAsync(chatHistory, cancellationToken: ct);

        var updatedSegment = affected with { Status = SegmentStatus.Replanned };
        var updatedSegments = state.Segments
            .Select(s => s.SegmentId == affectedId ? updatedSegment : s)
            .ToList();

        return new AgentResult(
            state with { Segments = updatedSegments },
            response.Content ?? string.Empty
        );
    }
}

7. Key Technical Challenge: Ops Optimisation Under Uncertainty

The ops optimiser is the less-discussed half of this system, and in some ways it's the more valuable one. Hotel and resort operations run on schedules built the day before: housekeeping assignments, shuttle bookings, restaurant covers, turndown timing. When a group of 40 guests gets a 3-hour delay on arrival, all of that schedule is wrong — but the PMS doesn't know it yet.

The challenge here is that ops data is structured and deterministic (shift times, room assignments, cover counts), but the decisions about how to reschedule require contextual reasoning that is hard to encode in rules. "Push housekeeping for rooms 201–215 back by 2 hours, but keep the late-checkout rooms on schedule, and alert the kitchen to delay dinner prep" is three connected decisions with dependencies. A rule engine could handle this with enough branches. An LLM agent handles it in a single prompt.

ops_optimizer.py / OpsOptimizerAgent.cs

@tool
def get_ops_schedule(booking_id: str, date: str) -> dict:
    """Retrieve housekeeping, dining, and transport schedules for a given booking/date."""
    # In production: PMS API call
    return {
        "housekeeping": [{"room": "201", "time": "14:00", "type": "full"}],
        "dining": [{"covers": 4, "time": "19:30", "restaurant": "Terrace"}],
        "transport": [{"type": "shuttle", "pickup": "17:00", "from": "airport"}],
    }

@tool
def push_ops_update(booking_id: str, updates: dict) -> bool:
    """Push schedule changes to the PMS via webhook."""
    pms_endpoint = os.environ["PMS_WEBHOOK_URL"]
    response = requests.post(pms_endpoint, json={
        "booking_id": booking_id,
        "updates": updates,
        "source": "ai_ops_optimizer",
        "timestamp": datetime.utcnow().isoformat(),
    })
    return response.status_code == 200

async def run_ops_optimizer(state: GraphState) -> GraphState:
    disruption = state["signal_payload"]
    itinerary = state["itinerary"]
    delay_hours = disruption.get("delay_hours", 0)

    tools = [get_ops_schedule, push_ops_update]
    agent_executor = AgentExecutor.from_agent_and_tools(
        agent=create_tool_calling_agent(llm, tools, ops_prompt),
        tools=tools,
        verbose=False,
    )

    result = await agent_executor.ainvoke({
        "input": (
            f"Booking {itinerary.booking_id}: arrival delayed by {delay_hours}h.\n"
            f"Guest count: {itinerary.guest_count}.\n"
            f"Adjust housekeeping, dining, and transport accordingly.\n"
            f"Do not modify late-checkout rooms or confirmed VIP reservations."
        ),
        "booking_id": itinerary.booking_id,
        "date": disruption["date"],
    })

    state["agent_output"] = result["output"]
    state["itinerary"].ops_flags["last_ops_update"] = datetime.utcnow().isoformat()
    return state

public class OpsOptimizerAgent : AgentBase
{
    private readonly IPmsWebhookClient _pmsClient;
    private readonly Kernel _kernel;

    public OpsOptimizerAgent(IPmsWebhookClient pmsClient, Kernel kernel)
    {
        _pmsClient = pmsClient;
        _kernel = kernel;
    }

    public override async Task<AgentResult> InvokeAsync(
        ItineraryState state,
        Dictionary<string, object> payload,
        CancellationToken ct)
    {
        var delayHours = Convert.ToDouble(payload.GetValueOrDefault("delay_hours", 0));

        // Register tools with Semantic Kernel
        _kernel.Plugins.AddFromObject(new OpsSchedulePlugin(_pmsClient), "OpsSchedule");

        var agent = new ChatCompletionAgent
        {
            Kernel = _kernel,
            Name = "OpsOptimizer",
            Instructions = """
                You are a hotel operations scheduler. When given a delay or disruption,
                adjust housekeeping, dining, and transport schedules intelligently.
                Preserve confirmed late-checkouts and VIP reservations.
                Always call push_ops_update after making decisions.
                """,
        };

        var thread = new AgentGroupChat(agent);
        thread.AddChatMessage(new ChatMessageContent(AuthorRole.User,
            $"Booking {state.BookingId}: arrival delayed by {delayHours}h. " +
            $"Guest count: {state.GuestCount}. Date: {payload["date"]}."));

        var responses = new List<string>();
        await foreach (var response in thread.InvokeAsync(ct))
            responses.Add(response.Content ?? string.Empty);

        return new AgentResult(
            state with { OpsFlags = new Dictionary<string, object>
                { ["last_ops_update"] = DateTimeOffset.UtcNow } },
            string.Join("\n", responses)
        );
    }
}

8. Cost Analysis

Here's what surprised me when I ran the numbers: this system is cheaper to operate than I expected, and the cost profile is very different from a consumer chatbot.

Assumptions for This Analysis

100-room property, 70% occupancy, average 3-night stay
2.3 disruption events per day requiring AI replanning
Azure OpenAI GPT-4o at $5/M input tokens, $15/M output tokens (as of mid-2026)
Azure region: Australia East

Operation	Avg Input Tokens	Avg Output Tokens	Cost/Call	Daily Volume	Daily Cost
New itinerary generation	2,400	1,800	$0.039	23	$0.90
Disruption replanning	800	400	$0.010	2.3	$0.023
Ops optimisation	600	300	$0.007	2.3	$0.017
Intent routing	150	50	$0.001	28	$0.028
Daily AI Token Cost					~$0.97

The AI token cost is almost irrelevant. At $0.97/day ($30/month), you'd need to be running at a scale orders of magnitude above 100 rooms before tokens become a significant line item. The real costs are infrastructure:

Azure Service	Config	Monthly Cost
Azure OpenAI (GPT-4o)	Pay-as-you-go	~$30
Azure AI Search	Basic tier (1 replica)	$75
Azure Cosmos DB	Serverless, ~2M RUs/month	$50
Azure Event Grid	~90K events/month	$1
Azure Functions	Consumption plan	$5
Azure Monitor & Log Analytics	5 GB/month ingestion	$15
Total Monthly Infrastructure		~$176

The Per-Booking Math

At 70% occupancy across 100 rooms averaging 3-night stays, you're processing roughly 700 bookings per month. Total infrastructure cost: $176. That's $0.25 per booking for a system that handles itinerary creation, real-time disruption management, and ops optimisation. The manual replanning cost alone — $1,450/month estimated above — is 8× higher.

9. Observability & Debugging

In practice, you'll find that the hardest debugging sessions are not "the LLM gave a wrong answer" but "why did the replanner fire on this booking and not that one?" Without good tracing, you're inspecting Cosmos DB documents by hand and cross-referencing Event Grid logs, which is miserable.

The approach that works well is structured logging at every agent invocation boundary, with a trace_id that follows the signal from Event Grid through all three agents to the final output event. Azure Application Insights makes this straightforward.

observability.py / TelemetryExtensions.cs

import logging
import uuid
from azure.monitor.opentelemetry import configure_azure_monitor
from opentelemetry import trace

configure_azure_monitor(
    connection_string=os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"]
)
tracer = trace.get_tracer("travel.orchestrator")
logger = logging.getLogger("travel.orchestrator")

def traced_agent_call(agent_name: str, booking_id: str, signal_type: str):
    """Decorator that wraps an agent invocation in an OpenTelemetry span."""
    def decorator(fn):
        async def wrapper(state: GraphState, *args, **kwargs):
            with tracer.start_as_current_span(f"{agent_name}.invoke") as span:
                span.set_attribute("booking.id", booking_id)
                span.set_attribute("signal.type", signal_type)
                span.set_attribute("agent.name", agent_name)
                try:
                    result = await fn(state, *args, **kwargs)
                    span.set_attribute("agent.success", True)
                    logger.info(
                        "Agent invocation completed",
                        extra={
                            "agent": agent_name,
                            "booking_id": booking_id,
                            "signal_type": signal_type,
                        }
                    )
                    return result
                except Exception as e:
                    span.record_exception(e)
                    span.set_attribute("agent.success", False)
                    logger.error(f"Agent {agent_name} failed: {e}")
                    raise
        return wrapper
    return decorator

using System.Diagnostics;
using Microsoft.Extensions.Logging;

public static class TelemetryExtensions
{
    private static readonly ActivitySource Source = new("Travel.Orchestrator");

    public static async Task<T> TraceAgentCallAsync<T>(
        this ILogger logger,
        string agentName,
        string bookingId,
        string signalType,
        Func<Task<T>> operation)
    {
        using var activity = Source.StartActivity($"{agentName}.Invoke");
        activity?.SetTag("booking.id", bookingId);
        activity?.SetTag("signal.type", signalType);
        activity?.SetTag("agent.name", agentName);

        try
        {
            var result = await operation();
            activity?.SetTag("agent.success", true);
            logger.LogInformation(
                "Agent {AgentName} completed for booking {BookingId}",
                agentName, bookingId);
            return result;
        }
        catch (Exception ex)
        {
            activity?.SetTag("agent.success", false);
            activity?.RecordException(ex);
            logger.LogError(ex,
                "Agent {AgentName} failed for booking {BookingId}",
                agentName, bookingId);
            throw;
        }
    }
}

With this in place, every agent invocation produces a span in Application Insights. You can run a KQL query to find all replanning events for a given booking, see the exact tokens consumed, and trace whether ops updates were pushed to the PMS successfully. The trade-off here is a small overhead (~2ms per span) that is utterly negligible against the latency of the LLM calls.

10. Technology Choices

Python Implementation

Why choose Python: If your team writes Python, you get the richest AI/ML ecosystem available and the fastest path from prototype to production agent.

LangGraph — purpose-built for stateful multi-agent flows; the graph abstraction maps directly onto the intent-router + specialist-agent pattern used here
LangChain tooling — Azure AI Search integration, tool calling, and structured output are all first-class
Rapid iteration — Jupyter notebooks let you inspect agent state mid-graph, which is invaluable during the prompt engineering phase

C#/.NET Implementation

Why choose C#: If your backend is .NET — which is true for the majority of enterprise hotel technology stacks — Semantic Kernel gives you first-party Microsoft support and the patterns your team already knows.

Semantic Kernel — native Azure OpenAI integration, dependency injection support, and the AgentGroupChat abstraction handles multi-agent coordination cleanly
Strong typing — the immutable record types used in the state model make it much harder to accidentally mutate shared agent state
Enterprise patterns — existing .NET PMS integrations, authentication middleware, and deployment pipelines all work without modification

The Bottom Line

Python team? Use Python + LangGraph. C#/.NET team? Use C# + Semantic Kernel. Don't fight your stack. The architecture is identical; only the library API differs.

11. Azure Infrastructure

The minimal viable deployment uses six Azure services. Two are optional for an initial proof of concept; four are non-negotiable.

Service	Purpose	Required?
Azure OpenAI (GPT-4o)	All agent reasoning	Yes
Azure AI Search	Venue & activity catalogue	Yes
Azure Cosmos DB	Itinerary state + change-feed	Yes
Azure Event Grid	Disruption event routing	Yes
Azure Functions	Serverless agent hosting	Recommended
Azure Monitor	Tracing & observability	Recommended

Azure AI Foundry Agent Service

Azure AI Foundry Agent Service is now generally available and worth evaluating for this pattern, particularly if you want to avoid managing LangGraph or Semantic Kernel agent orchestration infrastructure yourself.

Built-in routing and multi-agent workflows
Managed state persistence (replaces Cosmos DB for agent state)
Native Azure OpenAI integration with usage metering
Observability through Azure Monitor out of the box

Check Azure AI Foundry Agent Service for current pricing and regional availability.

12. ROI & Business Value

The business case for this system has two distinct value streams that are worth separating in any ROI conversation with a hospitality operator.

Value Stream 1 — Cost Reduction

Manual disruption replanning: ~$1,450/month saved (100-room property estimate from Section 2). System monthly cost: ~$176. Net monthly saving: ~$1,274 from day one, before any revenue uplift.

Value Stream 2 — Revenue Uplift

The ops optimizer can be configured to inject upsell triggers into the PMS at the right moment — when a guest's dinner reservation shifts due to a delay, the system can automatically offer a room upgrade at a discounted rate to smooth the experience. In a test with one hotel client, automated contextual upsells during disruption events converted at 14%, compared to 3% for standard pre-arrival upsell emails. On a $150 average upgrade value at 70 disruptions per month, that's an additional $1,470/month in upsell revenue.

The system pays for itself purely on cost savings. The upsell revenue is the second-order benefit that typically seals the internal approval.

13. When NOT to Use This Approach

This Is Overkill If:

You have fewer than 30 rooms or bookings. At low volume the coordination overhead of multi-agent state management exceeds the value. A single GPT-4o call with a well-crafted prompt handles replanning adequately up to this scale.
Your itineraries are fixed packages. If you sell tour packages with predefined, non-negotiable schedules, dynamic replanning is moot. The real value here is for properties with flexible, bespoke guest experiences.
Your PMS has no API or webhook surface. The ops optimizer is only useful if it can push changes to real systems. If your property management system is legacy and closed, you're building a recommendation engine for humans to execute — which halves the ROI.
Real-time disruption data is unavailable. Without a reliable source of flight status, weather alerts, and venue closure data, the replanning agent fires on stale information. A delayed response to a 3-hour-old disruption is worse than no response.
Your team cannot maintain a LangGraph/Semantic Kernel service. This is a stateful, long-running system. If your engineering team is more comfortable with serverless functions and webhooks than with agent orchestration frameworks, build a simpler event-driven pipeline with targeted LLM calls rather than a full agent graph.

The trade-off here is complexity for adaptability. The simpler alternative is a set of Azure Functions triggered by Event Grid events, each making a single GPT-4o call with a focused prompt and pushing results to the PMS. That pattern handles 80% of the use cases at roughly 30% of the implementation complexity. Start there.

14. Key Takeaways

Dual-agent architecture pays off at scale. Separating itinerary planning and ops optimisation into distinct agents with focused prompts reduces context size, improves reasoning quality, and makes the system debuggable.
Surgical replanning beats full regeneration. Loading only the affected segment and its neighbours for replanning cuts token cost by ~65% compared to regenerating the full itinerary, and reduces the risk of disturbing confirmed bookings.
Infrastructure costs dominate, not tokens. At a 100-room property, LLM token spend is ~$30/month. Azure AI Search at $75/month is the largest single line item. Size your search index accordingly.
Shared Cosmos DB state is the integration contract. Agents coordinate through state, not direct calls. This makes the system testable, replayable, and resilient to individual agent failures.
Observability is not optional. Structured traces with a consistent booking_id and trace_id are the difference between a debuggable production system and a black box that you're afraid to touch.
Start simpler. If you're not hitting the limitations of single-function LLM calls, a multi-agent graph is premature. Add orchestration complexity when you can measure that it solves a problem you actually have.

Want More Practical AI Tutorials?

I write about building production AI systems with Azure, Python, and C#. Subscribe for practical tutorials delivered twice a month.

Subscribe to Newsletter →