AI Contract Review Assistant: Part 2 - Production Considerations

"What does this actually cost to run?"

That's the question every AI demo avoids. In Part 1, we built a multi-agent contract review system with document parsing, clause extraction, risk analysis, and obligation tracking. It works beautifully in development.

Now let's talk about what happens when you deploy it to production: real costs, observability challenges, technology decisions, and—critically—when you should NOT use this approach at all.

What You'll Learn

Real token costs per contract (with examples)
Monthly infrastructure costs at different scales
Observability patterns for debugging agent pipelines
Python vs C# decision framework
When AI contract review is NOT the right solution

Reading time: 11 minutes

Real Cost Analysis

Let's break down the actual costs you'll see in production.

Token Costs Per Contract

Each contract goes through multiple LLM calls: clause classification, risk analysis, obligation extraction, and report generation. Here's what that looks like with GPT-4o pricing ($2.50/1M input, $10/1M output):

Contract Type	Pages	Input Tokens	Output Tokens	LLM Cost
Simple NDA	5	~8,000	~2,000	$0.04
SaaS Agreement	20	~35,000	~8,000	$0.17
Enterprise MSA	50	~90,000	~20,000	$0.43
Complex M&A	100+	~200,000	~40,000	$0.90

These numbers are for GPT-4o. If you use GPT-4o-mini for classification (keeping GPT-4o for risk analysis), you can reduce costs by 40-50%.

Azure Document Intelligence Costs

Document parsing adds another layer:

Prebuilt-contract model: $10 per 1,000 pages
Average 30-page contract: $0.30

Monthly Infrastructure Costs

Service	Small (100/mo)	Medium (500/mo)	Large (2000/mo)
Azure OpenAI	$50	$200	$800
Document Intelligence	$30	$150	$600
Container Apps	$50	$100	$200
Cosmos DB	$25	$50	$100
Blob Storage	$5	$20	$50
Total Monthly	$160	$520	$1,750
Cost Per Contract	$1.60	$1.04	$0.88

Cost Comparison: AI vs. Manual

At 500 contracts/month:

Manual review: 500 × $1,500 = $750,000/month
AI-assisted: $520 infrastructure + 500 × $150 (human review) = $75,520/month
Savings: $674,480/month (90%)

Observability & Debugging

When a contract review fails or produces unexpected results, you need to trace exactly what happened. Multi-agent systems are notoriously hard to debug without proper observability.

What to Log

Per-agent timing: How long did each agent take?
Confidence scores: Flag low-confidence classifications for human review
Token usage: Track consumption for cost allocation
Classification distribution: Are certain clause types being over/under-detected?

Implementing Observability

observability/metrics.py

import time
import logging
from opentelemetry import trace
from opentelemetry.metrics import get_meter

tracer = trace.get_tracer(__name__)
meter = get_meter(__name__)

# Define metrics
clause_extraction_duration = meter.create_histogram(
    "contract_review.clause_extraction_seconds",
    description="Time to extract and classify clauses"
)

low_confidence_counter = meter.create_counter(
    "contract_review.low_confidence_classifications",
    description="Number of classifications below confidence threshold"
)

class ObservableClauseExtractor:
    def __init__(self, extractor, confidence_threshold: float = 0.7):
        self.extractor = extractor
        self.threshold = confidence_threshold
        self.logger = logging.getLogger(__name__)

    async def extract_with_telemetry(
        self,
        state: ContractState
    ) -> ContractState:
        with tracer.start_as_current_span("clause_extraction") as span:
            span.set_attribute("document_id", state.document_id)
            span.set_attribute("section_count", len(state.sections))

            start_time = time.time()

            # Run extraction
            result = await self.extractor.extract(state)

            duration = time.time() - start_time
            clause_extraction_duration.record(
                duration,
                {"contract_type": state.contract_type}
            )

            # Track low confidence results
            for clause in result.extracted_clauses:
                if clause.confidence and clause.confidence < self.threshold:
                    low_confidence_counter.add(1, {
                        "clause_type": clause.clause_type.value
                    })

                    self.logger.warning(
                        "Low confidence classification",
                        extra={
                            "document_id": state.document_id,
                            "clause_type": clause.clause_type.value,
                            "confidence": clause.confidence,
                            "text_preview": clause.text[:200]
                        }
                    )

            span.set_attribute("clauses_extracted", len(result.extracted_clauses))
            span.set_attribute("duration_seconds", duration)

            return result

Observability/MetricsCollector.cs

using System.Diagnostics;
using System.Diagnostics.Metrics;
using Microsoft.Extensions.Logging;

public class ObservableClauseExtractor : IClauseExtractorAgent
{
    private static readonly ActivitySource ActivitySource =
        new("ContractReview.ClauseExtraction");

    private static readonly Meter Meter =
        new("ContractReview.Metrics");

    private static readonly Histogram<double> ExtractionDuration =
        Meter.CreateHistogram<double>(
            "contract_review.clause_extraction_seconds");

    private static readonly Counter<int> LowConfidenceCount =
        Meter.CreateCounter<int>(
            "contract_review.low_confidence_classifications");

    private readonly IClauseExtractorAgent _inner;
    private readonly ILogger<ObservableClauseExtractor> _logger;
    private readonly double _confidenceThreshold;

    public ObservableClauseExtractor(
        IClauseExtractorAgent inner,
        ILogger<ObservableClauseExtractor> logger,
        double confidenceThreshold = 0.7)
    {
        _inner = inner;
        _logger = logger;
        _confidenceThreshold = confidenceThreshold;
    }

    public async Task<ContractState> ExtractAsync(
        ContractState state,
        CancellationToken ct = default)
    {
        using var activity = ActivitySource.StartActivity("ClauseExtraction");
        activity?.SetTag("document_id", state.DocumentId);
        activity?.SetTag("section_count", state.Sections.Count);

        var stopwatch = Stopwatch.StartNew();

        var result = await _inner.ExtractAsync(state, ct);

        stopwatch.Stop();
        var duration = stopwatch.Elapsed.TotalSeconds;

        ExtractionDuration.Record(duration,
            new KeyValuePair<string, object?>(
                "contract_type", state.ContractType));

        // Track low confidence
        foreach (var clause in result.ExtractedClauses)
        {
            if (clause.Confidence < _confidenceThreshold)
            {
                LowConfidenceCount.Add(1,
                    new KeyValuePair<string, object?>(
                        "clause_type", clause.ClauseType.ToString()));

                _logger.LogWarning(
                    "Low confidence classification: {ClauseType} " +
                    "({Confidence:P0}) for document {DocumentId}",
                    clause.ClauseType,
                    clause.Confidence,
                    state.DocumentId);
            }
        }

        activity?.SetTag("clauses_extracted",
            result.ExtractedClauses.Count);
        activity?.SetTag("duration_seconds", duration);

        return result;
    }
}

Key Metrics to Dashboard

P95 processing time: Alert if contracts take too long
Low confidence rate: Spike indicates model degradation or new contract types
Clause type distribution: Detect classification drift over time
Token usage trend: Catch runaway costs early

Technology Choices: Python vs C#

Both implementations work. Here's how to choose:

Choose Python When:

Rapid prototyping: LangGraph's graph API is more concise
Data science team: They're already comfortable with Python
LangChain ecosystem: You want pre-built integrations
Jupyter experimentation: Interactive development matters

Choose C# When:

Enterprise .NET environment: Existing infrastructure, skills, CI/CD
SharePoint/M365 integration: Native SDK support
Strict type safety: Compile-time guarantees matter
Azure Functions: Better cold start performance than Python

Factor	Python + LangGraph	C# + Semantic Kernel
Learning curve	Lower (more examples)	Moderate
Type safety	Runtime (Pydantic)	Compile-time
Performance	Good (async)	Better (compiled)
Enterprise adoption	Growing	Strong
Azure integration	Good	Excellent
Community/examples	More	Growing

Azure Infrastructure

Here's the recommended Azure architecture for production:

Service	Purpose	Why This Choice
Azure OpenAI	LLM inference	Enterprise SLA, data residency, PTU pricing available
Document Intelligence	PDF/DOCX parsing	Prebuilt contract model, structure extraction
Container Apps	Agent orchestration	Serverless scale, KEDA autoscaling
Cosmos DB	State storage, audit log	Global distribution, JSON-native
Blob Storage	Document storage	Cost-effective, lifecycle policies
Application Insights	Monitoring	Distributed tracing, custom metrics
Key Vault	Secrets management	API keys, connection strings

Azure AI Foundry Agent Service

For fully managed orchestration, consider Azure AI Foundry Agent Service. It handles agent lifecycle, state management, and scaling automatically—though with less customization than self-hosted LangGraph/Semantic Kernel.

When NOT to Use AI Contract Review

AI contract review isn't always the right answer. Here's when to skip it:

Never Use AI Alone For:

High-stakes M&A (>$10M): Too much at risk; always full human review
Regulatory filings: SEC, FDA submissions require certified review
Litigation documents: Court filings, discovery responses
Privileged communications: Attorney-client privilege concerns

Use With Heavy Human Oversight:

First contract with major client: Relationship risk too high
International contracts: Jurisdiction complexity, translation issues
Heavily negotiated deals: Non-standard terms need human judgment
Contracts with unusual structure: AI expects standard formats

When Simpler Solutions Work Better

Standard NDAs with no modifications: Template matching is faster and cheaper
Internal policies: Checklist approach works fine
Contracts under 5 pages: Manual review may be faster than upload/process/review cycle
Low volume (<10/month): ROI doesn't justify development cost

The Honest Truth

AI contract review is a productivity tool, not a replacement for legal judgment. It handles the 80% of work that's routine so humans can focus on the 20% that matters. If you're hoping to eliminate lawyers entirely, this isn't the solution.

Key Takeaways

Real costs are manageable: $0.50-2.00 per contract in LLM costs, under $1,000/month infrastructure for most use cases
Observability is critical: Log confidence scores, track classification distribution, alert on anomalies
Choose your stack wisely: Python for rapid development, C# for enterprise integration
Know the limits: High-stakes deals, regulatory filings, and privileged documents need full human review
Start small: Begin with one contract type (e.g., NDAs), prove value, then expand

The multi-agent architecture we built in Part 1 provides a solid foundation. With the production considerations covered here, you're ready to deploy a system that genuinely improves legal ops efficiency—while staying honest about its limitations.

Missed Part 1?

Read the full implementation: document parsing, clause extraction, risk analysis, and obligation tracking.

← Read Part 1

Want More Practical AI Tutorials?

I write about building production AI systems with Azure, Python, and C#. Subscribe for practical tutorials delivered twice a month.

Subscribe to Newsletter →