"What does this actually cost to run?"
That's the question every AI demo avoids. In Part 1, we built a multi-agent contract review system with document parsing, clause extraction, risk analysis, and obligation tracking. It works beautifully in development.
Now let's talk about what happens when you deploy it to production: real costs, observability challenges, technology decisions, and—critically—when you should NOT use this approach at all.
What You'll Learn
- Real token costs per contract (with examples)
- Monthly infrastructure costs at different scales
- Observability patterns for debugging agent pipelines
- Python vs C# decision framework
- When AI contract review is NOT the right solution
Reading time: 11 minutes
Real Cost Analysis
Let's break down the actual costs you'll see in production.
Token Costs Per Contract
Each contract goes through multiple LLM calls: clause classification, risk analysis, obligation extraction, and report generation. Here's what that looks like with GPT-4o pricing ($2.50/1M input, $10/1M output):
| Contract Type | Pages | Input Tokens | Output Tokens | LLM Cost |
|---|---|---|---|---|
| Simple NDA | 5 | ~8,000 | ~2,000 | $0.04 |
| SaaS Agreement | 20 | ~35,000 | ~8,000 | $0.17 |
| Enterprise MSA | 50 | ~90,000 | ~20,000 | $0.43 |
| Complex M&A | 100+ | ~200,000 | ~40,000 | $0.90 |
These numbers are for GPT-4o. If you use GPT-4o-mini for classification (keeping GPT-4o for risk analysis), you can reduce costs by 40-50%.
Azure Document Intelligence Costs
Document parsing adds another layer:
- Prebuilt-contract model: $10 per 1,000 pages
- Average 30-page contract: $0.30
Monthly Infrastructure Costs
| Service | Small (100/mo) | Medium (500/mo) | Large (2000/mo) |
|---|---|---|---|
| Azure OpenAI | $50 | $200 | $800 |
| Document Intelligence | $30 | $150 | $600 |
| Container Apps | $50 | $100 | $200 |
| Cosmos DB | $25 | $50 | $100 |
| Blob Storage | $5 | $20 | $50 |
| Total Monthly | $160 | $520 | $1,750 |
| Cost Per Contract | $1.60 | $1.04 | $0.88 |
Cost Comparison: AI vs. Manual
At 500 contracts/month:
- Manual review: 500 × $1,500 = $750,000/month
- AI-assisted: $520 infrastructure + 500 × $150 (human review) = $75,520/month
- Savings: $674,480/month (90%)
Observability & Debugging
When a contract review fails or produces unexpected results, you need to trace exactly what happened. Multi-agent systems are notoriously hard to debug without proper observability.
What to Log
- Per-agent timing: How long did each agent take?
- Confidence scores: Flag low-confidence classifications for human review
- Token usage: Track consumption for cost allocation
- Classification distribution: Are certain clause types being over/under-detected?
Implementing Observability
import time
import logging
from opentelemetry import trace
from opentelemetry.metrics import get_meter
tracer = trace.get_tracer(__name__)
meter = get_meter(__name__)
# Define metrics
clause_extraction_duration = meter.create_histogram(
"contract_review.clause_extraction_seconds",
description="Time to extract and classify clauses"
)
low_confidence_counter = meter.create_counter(
"contract_review.low_confidence_classifications",
description="Number of classifications below confidence threshold"
)
class ObservableClauseExtractor:
def __init__(self, extractor, confidence_threshold: float = 0.7):
self.extractor = extractor
self.threshold = confidence_threshold
self.logger = logging.getLogger(__name__)
async def extract_with_telemetry(
self,
state: ContractState
) -> ContractState:
with tracer.start_as_current_span("clause_extraction") as span:
span.set_attribute("document_id", state.document_id)
span.set_attribute("section_count", len(state.sections))
start_time = time.time()
# Run extraction
result = await self.extractor.extract(state)
duration = time.time() - start_time
clause_extraction_duration.record(
duration,
{"contract_type": state.contract_type}
)
# Track low confidence results
for clause in result.extracted_clauses:
if clause.confidence and clause.confidence < self.threshold:
low_confidence_counter.add(1, {
"clause_type": clause.clause_type.value
})
self.logger.warning(
"Low confidence classification",
extra={
"document_id": state.document_id,
"clause_type": clause.clause_type.value,
"confidence": clause.confidence,
"text_preview": clause.text[:200]
}
)
span.set_attribute("clauses_extracted", len(result.extracted_clauses))
span.set_attribute("duration_seconds", duration)
return result
using System.Diagnostics;
using System.Diagnostics.Metrics;
using Microsoft.Extensions.Logging;
public class ObservableClauseExtractor : IClauseExtractorAgent
{
private static readonly ActivitySource ActivitySource =
new("ContractReview.ClauseExtraction");
private static readonly Meter Meter =
new("ContractReview.Metrics");
private static readonly Histogram<double> ExtractionDuration =
Meter.CreateHistogram<double>(
"contract_review.clause_extraction_seconds");
private static readonly Counter<int> LowConfidenceCount =
Meter.CreateCounter<int>(
"contract_review.low_confidence_classifications");
private readonly IClauseExtractorAgent _inner;
private readonly ILogger<ObservableClauseExtractor> _logger;
private readonly double _confidenceThreshold;
public ObservableClauseExtractor(
IClauseExtractorAgent inner,
ILogger<ObservableClauseExtractor> logger,
double confidenceThreshold = 0.7)
{
_inner = inner;
_logger = logger;
_confidenceThreshold = confidenceThreshold;
}
public async Task<ContractState> ExtractAsync(
ContractState state,
CancellationToken ct = default)
{
using var activity = ActivitySource.StartActivity("ClauseExtraction");
activity?.SetTag("document_id", state.DocumentId);
activity?.SetTag("section_count", state.Sections.Count);
var stopwatch = Stopwatch.StartNew();
var result = await _inner.ExtractAsync(state, ct);
stopwatch.Stop();
var duration = stopwatch.Elapsed.TotalSeconds;
ExtractionDuration.Record(duration,
new KeyValuePair<string, object?>(
"contract_type", state.ContractType));
// Track low confidence
foreach (var clause in result.ExtractedClauses)
{
if (clause.Confidence < _confidenceThreshold)
{
LowConfidenceCount.Add(1,
new KeyValuePair<string, object?>(
"clause_type", clause.ClauseType.ToString()));
_logger.LogWarning(
"Low confidence classification: {ClauseType} " +
"({Confidence:P0}) for document {DocumentId}",
clause.ClauseType,
clause.Confidence,
state.DocumentId);
}
}
activity?.SetTag("clauses_extracted",
result.ExtractedClauses.Count);
activity?.SetTag("duration_seconds", duration);
return result;
}
}
Key Metrics to Dashboard
- P95 processing time: Alert if contracts take too long
- Low confidence rate: Spike indicates model degradation or new contract types
- Clause type distribution: Detect classification drift over time
- Token usage trend: Catch runaway costs early
Technology Choices: Python vs C#
Both implementations work. Here's how to choose:
Choose Python When:
- Rapid prototyping: LangGraph's graph API is more concise
- Data science team: They're already comfortable with Python
- LangChain ecosystem: You want pre-built integrations
- Jupyter experimentation: Interactive development matters
Choose C# When:
- Enterprise .NET environment: Existing infrastructure, skills, CI/CD
- SharePoint/M365 integration: Native SDK support
- Strict type safety: Compile-time guarantees matter
- Azure Functions: Better cold start performance than Python
| Factor | Python + LangGraph | C# + Semantic Kernel |
|---|---|---|
| Learning curve | Lower (more examples) | Moderate |
| Type safety | Runtime (Pydantic) | Compile-time |
| Performance | Good (async) | Better (compiled) |
| Enterprise adoption | Growing | Strong |
| Azure integration | Good | Excellent |
| Community/examples | More | Growing |
Azure Infrastructure
Here's the recommended Azure architecture for production:
| Service | Purpose | Why This Choice |
|---|---|---|
| Azure OpenAI | LLM inference | Enterprise SLA, data residency, PTU pricing available |
| Document Intelligence | PDF/DOCX parsing | Prebuilt contract model, structure extraction |
| Container Apps | Agent orchestration | Serverless scale, KEDA autoscaling |
| Cosmos DB | State storage, audit log | Global distribution, JSON-native |
| Blob Storage | Document storage | Cost-effective, lifecycle policies |
| Application Insights | Monitoring | Distributed tracing, custom metrics |
| Key Vault | Secrets management | API keys, connection strings |
Azure AI Foundry Agent Service
For fully managed orchestration, consider Azure AI Foundry Agent Service. It handles agent lifecycle, state management, and scaling automatically—though with less customization than self-hosted LangGraph/Semantic Kernel.
When NOT to Use AI Contract Review
AI contract review isn't always the right answer. Here's when to skip it:
Never Use AI Alone For:
- High-stakes M&A (>$10M): Too much at risk; always full human review
- Regulatory filings: SEC, FDA submissions require certified review
- Litigation documents: Court filings, discovery responses
- Privileged communications: Attorney-client privilege concerns
Use With Heavy Human Oversight:
- First contract with major client: Relationship risk too high
- International contracts: Jurisdiction complexity, translation issues
- Heavily negotiated deals: Non-standard terms need human judgment
- Contracts with unusual structure: AI expects standard formats
When Simpler Solutions Work Better
- Standard NDAs with no modifications: Template matching is faster and cheaper
- Internal policies: Checklist approach works fine
- Contracts under 5 pages: Manual review may be faster than upload/process/review cycle
- Low volume (<10/month): ROI doesn't justify development cost
The Honest Truth
AI contract review is a productivity tool, not a replacement for legal judgment. It handles the 80% of work that's routine so humans can focus on the 20% that matters. If you're hoping to eliminate lawyers entirely, this isn't the solution.
Key Takeaways
- Real costs are manageable: $0.50-2.00 per contract in LLM costs, under $1,000/month infrastructure for most use cases
- Observability is critical: Log confidence scores, track classification distribution, alert on anomalies
- Choose your stack wisely: Python for rapid development, C# for enterprise integration
- Know the limits: High-stakes deals, regulatory filings, and privileged documents need full human review
- Start small: Begin with one contract type (e.g., NDAs), prove value, then expand
The multi-agent architecture we built in Part 1 provides a solid foundation. With the production considerations covered here, you're ready to deploy a system that genuinely improves legal ops efficiency—while staying honest about its limitations.
Missed Part 1?
Read the full implementation: document parsing, clause extraction, risk analysis, and obligation tracking.
← Read Part 1Want More Practical AI Tutorials?
I write about building production AI systems with Azure, Python, and C#. Subscribe for practical tutorials delivered twice a month.
Subscribe to Newsletter →