Building a RAG Knowledge Bot - Part 2: Production Considerations

Building the RAG pipeline was the easy part. Keeping it running in production without burning through your Azure budget? That is where the real engineering begins.

In Part 1, we built a RAG-based internal knowledge bot from the ground up. We covered document chunking strategies, embedding generation with Azure OpenAI, hybrid retrieval using Azure AI Search, and the response generation pipeline. If you have not read Part 1 yet, I would recommend starting there since this article builds directly on those foundations.

But here is the thing I have learned shipping RAG systems to production: the architecture is maybe 30% of the work. The other 70% is answering hard questions that nobody talks about in tutorials. How much will this cost at scale? How do I know when retrieval quality degrades? Should I even be using RAG for this use case?

In this article, I will tackle those production realities head-on. We will look at real cost numbers, build observability into the pipeline, compare technology choices, and — perhaps most importantly — discuss when RAG is the wrong tool for the job.

What You'll Learn

Real cost breakdown for Azure OpenAI RAG systems at different scales
How to add tracing and observability to every stage of the RAG pipeline
A decision framework for choosing Python vs. C# for your implementation
When NOT to use RAG — and what to use instead

Cost Analysis: What RAG Actually Costs

Let me be honest: one of the first questions any stakeholder will ask is "how much does this cost?" And vague answers like "it depends" are not going to cut it. So here is a real breakdown based on Azure OpenAI pricing as of early 2026.

Per-Query Token Economics

Every RAG query involves two LLM calls: one for embedding the query, and one for generating the answer. Here is how the tokens break down for a typical knowledge bot query:

Query embedding: ~500 tokens (text-embedding-3-large at $0.00013/1K tokens) = ~$0.000065
Context injection: ~2,000 tokens of retrieved chunks sent as input to GPT-4o ($2.50/1M input tokens) = ~$0.005
Answer generation: ~500 tokens of output from GPT-4o ($10.00/1M output tokens) = ~$0.005

That puts a single query at roughly $0.008 to $0.01. Not bad for one query. But multiply that by thousands of employees asking questions daily, and the numbers add up fast.

Monthly Cost Estimates by Scale

Scale	Queries/Day	Azure OpenAI (Tokens)	Azure AI Search	Blob Storage (100 GB)	Estimated Monthly Total
Small team	100	~$24	$75 (Basic)	$2	~$101
Department	1,000	~$240	$250 (Standard S1)	$2	~$492
Enterprise	10,000	~$2,400	$500 (Standard S2)	$2	~$2,902

Watch Out for Long Documents

These estimates assume ~2,000 tokens of context per query. If your chunking strategy sends 5,000+ tokens of context (or users ask follow-up questions that carry conversation history), your per-query cost can triple. I have seen teams blow through their monthly budget in a week because they were injecting entire documents instead of relevant chunks. Always monitor your token usage in the first two weeks after launch.

Observability and Debugging

RAG pipelines have a unique debugging challenge: when the answer is wrong, you need to figure out whether the problem was in retrieval (wrong chunks), in the prompt (ambiguous instructions), or in the model (hallucination). Without structured observability, you are flying blind.

Tracing Every Pipeline Stage

I instrument every stage of the RAG pipeline with structured traces. Each query gets a correlation ID that follows it from the initial request through embedding, retrieval, and generation. This makes it possible to reconstruct exactly what happened for any given query after the fact.

import logging
import time
import uuid
from dataclasses import dataclass, field
from typing import List, Optional

logger = logging.getLogger("rag_pipeline")

@dataclass
class PipelineTrace:
    """Captures timing and metadata for each RAG pipeline stage."""
    correlation_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    stages: List[dict] = field(default_factory=list)

    def start_stage(self, name: str) -> dict:
        stage = {
            "name": name,
            "start_time": time.perf_counter(),
            "end_time": None,
            "metadata": {},
        }
        self.stages.append(stage)
        return stage

    def end_stage(self, stage: dict, **metadata):
        stage["end_time"] = time.perf_counter()
        stage["duration_ms"] = (
            (stage["end_time"] - stage["start_time"]) * 1000
        )
        stage["metadata"].update(metadata)

    def log_summary(self):
        total_ms = sum(s.get("duration_ms", 0) for s in self.stages)
        logger.info(
            "Pipeline complete",
            extra={
                "correlation_id": self.correlation_id,
                "total_duration_ms": round(total_ms, 2),
                "stages": [
                    {
                        "name": s["name"],
                        "duration_ms": round(s.get("duration_ms", 0), 2),
                        **s["metadata"],
                    }
                    for s in self.stages
                ],
            },
        )


async def traced_rag_query(query: str, search_client, llm_client):
    """Execute a RAG query with full pipeline tracing."""
    trace = PipelineTrace()

    # Stage 1: Embed the query
    stage = trace.start_stage("embedding")
    query_vector = await llm_client.embed(query)
    trace.end_stage(stage, token_count=len(query.split()))

    # Stage 2: Retrieve relevant chunks
    stage = trace.start_stage("retrieval")
    results = await search_client.hybrid_search(
        query=query, vector=query_vector, top_k=5
    )
    trace.end_stage(
        stage,
        chunks_returned=len(results),
        top_score=results[0].score if results else 0,
    )

    # Stage 3: Generate answer
    stage = trace.start_stage("generation")
    context = "\n\n".join([r.content for r in results])
    answer = await llm_client.generate(query=query, context=context)
    trace.end_stage(stage, output_tokens=len(answer.split()))

    trace.log_summary()
    return answer

using System.Diagnostics;
using Microsoft.Extensions.Logging;

public class PipelineTrace
{
    public string CorrelationId { get; } = Guid.NewGuid().ToString();
    private readonly List<StageTrace> _stages = new();
    private readonly ILogger _logger;

    public PipelineTrace(ILogger logger) => _logger = logger;

    public StageTrace StartStage(string name)
    {
        var stage = new StageTrace(name);
        _stages.Add(stage);
        return stage;
    }

    public void LogSummary()
    {
        var totalMs = _stages.Sum(s => s.DurationMs);
        _logger.LogInformation(
            "Pipeline complete. CorrelationId={CorrelationId}, " +
            "TotalMs={TotalMs}, Stages={StageCount}",
            CorrelationId, Math.Round(totalMs, 2), _stages.Count);

        foreach (var stage in _stages)
        {
            _logger.LogInformation(
                "  Stage={Name}, DurationMs={Ms}, Metadata={Meta}",
                stage.Name, Math.Round(stage.DurationMs, 2),
                stage.Metadata);
        }
    }
}

public class StageTrace : IDisposable
{
    public string Name { get; }
    public double DurationMs { get; private set; }
    public Dictionary<string, object> Metadata { get; } = new();
    private readonly Stopwatch _sw = Stopwatch.StartNew();

    public StageTrace(string name) => Name = name;

    public void AddMetadata(string key, object value)
        => Metadata[key] = value;

    public void Dispose()
    {
        _sw.Stop();
        DurationMs = _sw.Elapsed.TotalMilliseconds;
    }
}

// Usage in a RAG query handler
public class TracedRagQueryHandler
{
    private readonly ISearchClient _searchClient;
    private readonly ILlmClient _llmClient;
    private readonly ILogger<TracedRagQueryHandler> _logger;

    public TracedRagQueryHandler(
        ISearchClient searchClient,
        ILlmClient llmClient,
        ILogger<TracedRagQueryHandler> logger)
    {
        _searchClient = searchClient;
        _llmClient = llmClient;
        _logger = logger;
    }

    public async Task<string> ExecuteAsync(string query)
    {
        var trace = new PipelineTrace(_logger);

        // Stage 1: Embed the query
        float[] queryVector;
        using (var stage = trace.StartStage("embedding"))
        {
            queryVector = await _llmClient.EmbedAsync(query);
            stage.AddMetadata("token_count", query.Split().Length);
        }

        // Stage 2: Retrieve relevant chunks
        List<SearchResult> results;
        using (var stage = trace.StartStage("retrieval"))
        {
            results = await _searchClient.HybridSearchAsync(
                query, queryVector, topK: 5);
            stage.AddMetadata("chunks_returned", results.Count);
            stage.AddMetadata("top_score",
                results.FirstOrDefault()?.Score ?? 0);
        }

        // Stage 3: Generate answer
        string answer;
        using (var stage = trace.StartStage("generation"))
        {
            var context = string.Join("\n\n",
                results.Select(r => r.Content));
            answer = await _llmClient.GenerateAsync(query, context);
            stage.AddMetadata("output_tokens", answer.Split().Length);
        }

        trace.LogSummary();
        return answer;
    }
}

Key Metrics to Track

Once you have tracing in place, here are the metrics I watch on a dashboard. These tell you both whether the system is healthy and whether answer quality is degrading over time:

Metric	What It Tells You	Healthy Range
Retrieval latency (p50 / p95)	How fast chunks are returned from the index	p50 < 100ms, p95 < 300ms
Top chunk relevance score	Whether the best match is actually relevant	> 0.75 for the top result
Chunk hit rate	Percentage of queries where at least one chunk scores above threshold	> 85%
Generation latency (p50 / p95)	Time-to-first-token and total generation time	p50 < 2s, p95 < 5s
Token usage per query	Cost efficiency and context window usage	< 3,000 tokens total

I strongly recommend setting up alerts when chunk hit rate drops below 80% or retrieval latency exceeds 500ms. These are early warning signs that either your index needs updating or your chunking strategy is not aligned with how users actually phrase their questions.

Technology Choices: Python vs. C#

Both Part 1 and this article show code in Python and C# because the reality is that most organizations are not greenfield. You are building RAG into an existing ecosystem, and the language choice matters more for operational reasons than technical ones.

Why Choose Python

Python is the default language for AI/ML development, and for good reason. The LangChain ecosystem gives you pre-built integrations for nearly every vector store, LLM provider, and document loader you can think of. If your team has data scientists or ML engineers, they are already writing Python. Rapid prototyping is significantly faster — I can get a working RAG proof-of-concept running in an afternoon with LangChain and FAISS.

The trade-off is that Python services can require more operational overhead in production. You will need to manage async properly (everything should be async/await), and deployment to Azure Functions or Container Apps requires a bit more configuration than a .NET equivalent.

Why Choose C#

If you are in an enterprise .NET shop, C# with Semantic Kernel is a compelling choice. Microsoft builds Semantic Kernel in-house, which means Azure integration is first-class — Azure AI Search connectors, Azure OpenAI clients, and managed identity support all work out of the box. The type system catches errors at compile time that Python would only surface at runtime, which matters when you are building a system that hundreds of employees depend on.

Semantic Kernel's plugin architecture also maps cleanly to the agent pattern. Each RAG capability (search, summarize, generate) becomes a plugin that you can compose and test independently.

Decision Framework

Factor	Lean Python	Lean C#
Team expertise	Data science / ML background	.NET / enterprise development
Ecosystem needs	LangChain, HuggingFace, FAISS	Semantic Kernel, Azure SDK
Performance profile	IO-bound (fine with async Python)	High-throughput, low-latency
Azure integration depth	Good (via SDKs)	Native (Semantic Kernel is Microsoft)
Prototyping speed	Fastest (notebooks, REPL)	Moderate (compile step, boilerplate)
Production reliability	Good (with discipline)	Excellent (type safety, tooling)

My recommendation: Prototype in Python, then evaluate whether to keep it or port to C# based on your team's operational comfort. The RAG logic itself is nearly identical in both languages — it is the deployment and maintenance story that differs.

Azure Infrastructure

A production RAG knowledge bot requires several Azure services working together. Here is the infrastructure I recommend, along with why each piece exists.

Flowchart diagram of RAG Knowledge Bot Azure architecture

Document ingestion pipeline feeds into the search index; user queries flow through retrieval and generation

Azure Blob Storage — Source of truth for all documents. Upload PDFs, Word docs, or Markdown files here. Azure Functions triggers on new blobs to kick off the ingestion pipeline.
Azure Functions — Runs the chunking and embedding pipeline. A blob trigger fires when new documents arrive, splits them into chunks, generates embeddings, and pushes them to the search index.
Azure OpenAI — Hosts both text-embedding-3-large (for embeddings) and GPT-4o (for answer generation). Use managed identity for authentication — no API keys in your code.
Azure AI Search — Stores the vector index and supports hybrid (vector + keyword) retrieval. This is the core of the retrieval pipeline we built in Part 1.

For teams that want to reduce operational complexity, Azure AI Foundry Agent Service offers a managed alternative that handles orchestration, memory, and tool use out of the box. It is worth evaluating if you want to avoid managing the infrastructure yourself, though you trade some customization for convenience.

Infrastructure as Code

I recommend defining all of this in Bicep or Terraform from day one. RAG systems have enough moving parts that manual portal setup becomes a liability quickly. Here is a quick example of provisioning the core resources.

# config.py - Environment configuration for the RAG pipeline
import os
from dataclasses import dataclass

@dataclass
class RagConfig:
    """Central configuration for all Azure services."""
    # Azure OpenAI
    openai_endpoint: str = os.getenv("AZURE_OPENAI_ENDPOINT", "")
    openai_deployment: str = os.getenv("AZURE_OPENAI_DEPLOYMENT", "gpt-4o")
    embedding_deployment: str = os.getenv(
        "AZURE_EMBEDDING_DEPLOYMENT", "text-embedding-3-large"
    )

    # Azure AI Search
    search_endpoint: str = os.getenv("AZURE_SEARCH_ENDPOINT", "")
    search_index: str = os.getenv("AZURE_SEARCH_INDEX", "knowledge-index")

    # Azure Blob Storage
    storage_connection: str = os.getenv("AZURE_STORAGE_CONNECTION", "")
    storage_container: str = os.getenv("AZURE_STORAGE_CONTAINER", "documents")

    # Pipeline settings
    chunk_size: int = int(os.getenv("CHUNK_SIZE", "512"))
    chunk_overlap: int = int(os.getenv("CHUNK_OVERLAP", "50"))
    retrieval_top_k: int = int(os.getenv("RETRIEVAL_TOP_K", "5"))

    # Scaling
    max_concurrent_embeddings: int = int(
        os.getenv("MAX_CONCURRENT_EMBEDDINGS", "10")
    )
    rate_limit_rpm: int = int(os.getenv("RATE_LIMIT_RPM", "60"))


# Usage
config = RagConfig()
print(f"Search index: {config.search_index}")
print(f"Chunk size: {config.chunk_size} tokens")

// RagConfig.cs - Environment configuration for the RAG pipeline
public class RagConfig
{
    // Azure OpenAI
    public string OpenAiEndpoint { get; set; } = "";
    public string OpenAiDeployment { get; set; } = "gpt-4o";
    public string EmbeddingDeployment { get; set; }
        = "text-embedding-3-large";

    // Azure AI Search
    public string SearchEndpoint { get; set; } = "";
    public string SearchIndex { get; set; } = "knowledge-index";

    // Azure Blob Storage
    public string StorageConnection { get; set; } = "";
    public string StorageContainer { get; set; } = "documents";

    // Pipeline settings
    public int ChunkSize { get; set; } = 512;
    public int ChunkOverlap { get; set; } = 50;
    public int RetrievalTopK { get; set; } = 5;

    // Scaling
    public int MaxConcurrentEmbeddings { get; set; } = 10;
    public int RateLimitRpm { get; set; } = 60;
}

// In Program.cs or Startup.cs
builder.Services.Configure<RagConfig>(
    builder.Configuration.GetSection("Rag"));

// appsettings.json
// {
//   "Rag": {
//     "OpenAiEndpoint": "https://your-instance.openai.azure.com/",
//     "SearchEndpoint": "https://your-search.search.windows.net",
//     "SearchIndex": "knowledge-index",
//     "ChunkSize": 512,
//     "RetrievalTopK": 5
//   }
// }

For scaling, keep in mind that Azure OpenAI has rate limits measured in tokens-per-minute (TPM). At the enterprise scale (10,000 queries/day), you will likely need a provisioned throughput deployment rather than pay-as-you-go to avoid throttling during peak hours.

When NOT to Use RAG

This might be the most important section of this article. I have seen teams spend months building RAG systems for problems that had simpler, cheaper, and more reliable solutions. Here is when I would steer you away from RAG.

RAG Is Probably the Wrong Choice When...

Small document sets (<50 documents): If your entire knowledge base is a few dozen pages, full-text search with a simple keyword index will give you faster, cheaper, and more deterministic results. You do not need embeddings and vector search for a 30-page FAQ.
Highly structured data: If the answers live in a database with well-defined schemas (product catalogs, employee directories, inventory systems), write a SQL query or build an API. RAG adds unnecessary complexity and introduces the risk of hallucinated answers from data that should be exact.
Real-time data requirements: RAG has an inherent ingestion lag. Documents need to be chunked, embedded, and indexed before they are searchable. If your users need answers from data that changes every few minutes (live dashboards, trading systems), RAG will always be behind.
Compliance-critical answers: When you need deterministic, auditable answers — like regulatory compliance questions or legal interpretations — RAG's probabilistic nature is a liability. The same question can produce slightly different answers depending on which chunks are retrieved and how the model synthesizes them. For these cases, curated FAQ systems or decision trees are safer.
Budget constraints under $50/month: At very low volumes with tight budgets, the Azure AI Search basic tier alone ($75/month) may exceed your entire budget. Consider a simpler embedding-only approach with FAISS stored in-memory if you are building for a small team.

None of this means RAG is bad. It means RAG is a tool, and like any tool, it has a sweet spot. That sweet spot is medium-to-large unstructured document collections where users ask natural language questions and expect synthesized answers. If your use case fits that description, RAG is excellent. If it does not, save yourself the engineering effort.

Quick Decision Checklist

def should_use_rag(
    doc_count: int,
    data_is_structured: bool,
    needs_real_time: bool,
    needs_deterministic: bool,
    monthly_budget: float,
) -> dict:
    """Simple decision helper for RAG vs. alternatives."""
    reasons_against = []

    if doc_count < 50:
        reasons_against.append(
            "Small doc set - consider full-text search instead"
        )
    if data_is_structured:
        reasons_against.append(
            "Structured data - use SQL/API queries instead"
        )
    if needs_real_time:
        reasons_against.append(
            "Real-time needs - RAG has ingestion lag"
        )
    if needs_deterministic:
        reasons_against.append(
            "Deterministic answers needed - use curated FAQ"
        )
    if monthly_budget < 100:
        reasons_against.append(
            "Budget too low for Azure AI Search + OpenAI"
        )

    use_rag = len(reasons_against) == 0
    return {
        "recommend_rag": use_rag,
        "reasons_against": reasons_against,
        "alternative": (
            "RAG is a good fit!" if use_rag
            else f"Consider alternatives: {reasons_against[0]}"
        ),
    }


# Example usage
result = should_use_rag(
    doc_count=500,
    data_is_structured=False,
    needs_real_time=False,
    needs_deterministic=False,
    monthly_budget=300,
)
print(result)
# {"recommend_rag": True, "alternative": "RAG is a good fit!"}

public record RagDecision(
    bool RecommendRag,
    List<string> ReasonsAgainst,
    string Alternative);

public static class RagAdvisor
{
    /// <summary>
    /// Simple decision helper for RAG vs. alternatives.
    /// </summary>
    public static RagDecision ShouldUseRag(
        int docCount,
        bool dataIsStructured,
        bool needsRealTime,
        bool needsDeterministic,
        decimal monthlyBudget)
    {
        var reasons = new List<string>();

        if (docCount < 50)
            reasons.Add(
                "Small doc set - consider full-text search");
        if (dataIsStructured)
            reasons.Add(
                "Structured data - use SQL/API queries");
        if (needsRealTime)
            reasons.Add(
                "Real-time needs - RAG has ingestion lag");
        if (needsDeterministic)
            reasons.Add(
                "Deterministic answers needed - use curated FAQ");
        if (monthlyBudget < 100)
            reasons.Add(
                "Budget too low for Azure AI Search + OpenAI");

        var useRag = reasons.Count == 0;
        var alt = useRag
            ? "RAG is a good fit!"
            : $"Consider alternatives: {reasons[0]}";

        return new RagDecision(useRag, reasons, alt);
    }
}

// Example usage
var result = RagAdvisor.ShouldUseRag(
    docCount: 500,
    dataIsStructured: false,
    needsRealTime: false,
    needsDeterministic: false,
    monthlyBudget: 300m);

Console.WriteLine(result.Alternative);
// "RAG is a good fit!"

Key Takeaways

Know your costs upfront: At ~$0.008 per query, a RAG knowledge bot is affordable for most teams. But token costs scale linearly, and long context windows can multiply your bill fast. Monitor from day one.
Observability is not optional: Instrument every stage of the pipeline with correlation IDs and structured logging. When answer quality degrades, you need to pinpoint whether the issue is in retrieval, context assembly, or generation.
Choose your language based on your team: Python for speed and ecosystem breadth, C# for enterprise integration and type safety. The RAG logic itself is nearly identical in both.
Right-size your infrastructure: Start with Azure AI Search Basic tier and pay-as-you-go OpenAI. Scale up to Standard tier and provisioned throughput only when query volume demands it.
Be honest about whether you need RAG: Small document sets, structured data, and compliance-critical use cases often have simpler solutions. RAG shines for large unstructured collections with natural language queries.

The Production Readiness Test

Before shipping your RAG knowledge bot to real users, make sure you can answer "yes" to all of these:

Can you trace any query from input to output with a single correlation ID?
Do you have alerts for retrieval quality degradation (chunk hit rate, relevance scores)?
Have you load-tested at 2x your expected peak query volume?
Is your infrastructure defined in code (Bicep, Terraform, or ARM templates)?
Do you have a document re-ingestion pipeline for when content is updated?

If you missed Part 1, that is where we built the core RAG pipeline — chunking, embedding, hybrid retrieval, and response generation. Together, these two articles give you a complete picture: the architecture to build it and the operational knowledge to run it.

This article covers production considerations for RAG systems. Actual costs may vary based on your Azure region, usage patterns, and pricing tier. Always validate with the Azure pricing calculator for your specific workload.

Want More Practical AI Tutorials?

I write about building production AI systems with Azure, Python, and C#. Subscribe for practical tutorials delivered twice a month.

Subscribe to Newsletter →