Welcome to Integrating LLMs Into Production. We're going to explore how to build safe, reliable, and cost-effective AI systems.
Moving LLMs to production creates real challenges. Safety means protecting against hallucinations, prompt injections, and data leakage. Reliability requires handling rate limits, timeouts, and variable latency across different providers. Cost becomes critical because token usage scales with traffic, not with your infrastructure. And Quality means maintaining consistent outputs at scale. These four pillars are interconnected and need to be addressed together.
list
Here's a high-level architecture to solve these problems. Your application flows through a response cache first to avoid redundant calls. From there, a smart router sends requests to an LLM gateway that includes usage monitoring. The router intelligently distributes requests across multiple providers: GPT-4, Claude, or a local model. There's also a fallback handler that automatically degrades to cheaper or faster models if the primary choice fails or times out. This layered approach gives you safety, redundancy, and cost control.
mermaid
Looking at this code, we start by validating inputs before they ever reach the LLM. First, we enforce token limits using tiktoken to count actual tokens, not just characters. Second, we detect prompt injection attempts by looking for suspicious patterns like 'ignore previous' or 'system:' instructions hidden in user input. Third, we scan for personally identifiable information and redact it automatically. This triple-layered approach stops attacks and data leakage at the gate.
code
For output validation, we have two main approaches. {{step}}First, Structured Outputs force responses into JSON schemas using function calling, validate them with Pydantic models, and retry parsing if it fails. {{step}}Second, Content Filtering checks every response for hallucinations, runs toxicity scoring, and grounds claims in actual data before returning to the user. Together these prevent bad outputs from reaching your customers.
cards
The fallback pyramid is your safety net for production. If GPT-4 Turbo times out or hits rate limits, you automatically degrade to GPT-3.5 Turbo at lower cost. If that fails, fall back to Claude Haiku which is faster. If Haiku times out, use a local model you control. And if all else fails, return a static fallback response. This pyramid ensures your system never goes down, it just trades speed or quality for availability.
mermaid
Notice this TypeScript circuit breaker implementation. It tracks failures and switches between three states: closed for normal operation, open when failures exceed a threshold to stop hammering a failing provider, and half-open to test recovery. Each call is wrapped in a Promise.race with a five-second timeout. On success, we reset the failure count. On failure, we increment it and throw the error, triggering the fallback. This pattern prevents cascade failures across your system.
code
We use three retry techniques. {{step}}Exponential Backoff spaces retries at one second, two seconds, four seconds, eight seconds with a maximum of three attempts. We also add jitter to prevent the thundering herd problem where all clients retry simultaneously. {{step}}Idempotency Keys let us safely retry requests by assigning unique request IDs so duplicate submissions are deduplicated by the provider. {{step}}Timeout Hierarchy ensures the overall request has a thirty-second budget, each LLM call gets ten seconds, and fallback options get just five seconds so we can still respond.
cards
Cost optimization in production is critical because your token spend scales with usage, not infrastructure. The strategies we'll cover focus on reducing redundant API calls, choosing the right model for each task, and cutting unnecessary tokens from prompts.
image
We deploy two caching strategies. {{step}}Semantic Caching works by embedding incoming queries, finding similar cached queries using cosine similarity above zero-point-nine-five, and returning the cached response immediately. {{step}}Response Caching is traditional HTTP caching with Cache-Control headers, stored in Redis or Memcached. Combined, these strategies can achieve over ninety percent cache hit rates on typical workloads, reducing costs dramatically.
cards
The table shows model economics. GPT-4 Turbo costs ten dollars per million input tokens and thirty output tokens, making it ideal for complex reasoning like financial analysis. GPT-3.5 Turbo is fifty cents and one-fifty, perfect for general tasks like customer support. Claude Haiku is cheapest at twenty-five cents and one-twenty-five, ideal for speed-critical real-time chat. And local Llama three costs only compute, making it best for high-volume classification where latency is tolerable. Match the model to the task.
table
Before, our prompt used four hundred fifty tokens with verbose instructions. After, we restructured it to request only JSON output, eliminated unnecessary context, and dropped to just one hundred twenty tokens. That's a seventy-three percent reduction in token usage, meaning a seventy-three percent cost reduction for each request. Across thousands of daily requests, this optimization saves thousands of dollars monthly. Short, structured prompts are your best cost lever.
code
These are realistic metrics from a production support system. Average cost per request is twelve cents, consuming twenty-four hundred tokens. With a sixty-five percent cache hit rate, they save one hundred eighty dollars daily just from caching. Setting budgets and alerts on daily spending prevents surprise bills when traffic spikes.
stats
You cannot manage what you cannot measure. Production LLM systems need comprehensive observability from input to output. Let's look at the key metrics and tooling.
image
We track four categories of metrics. {{step}}Performance means latency: measure P-50, P-95, and P-99 percentiles, track time to first token, and monitor end-to-end request duration. {{step}}Quality is output validation: track parse success rate, hallucination detection hits, and user satisfaction scores. {{step}}Cost is token usage: count input and output tokens separately, calculate cost per request, and watch the distribution across models. {{step}}Reliability is error tracking: monitor timeout rate, count fallback triggers, and alert when circuit breakers activate.
cards
Your observability stack flows from the LLM gateway to three parallel systems. Logs go to DataDog for full request context. Metrics go to Prometheus for time-series analysis and cost tracking. Traces go to Jaeger to see the full request path across components. Prometheus feeds into PagerDuty for alerting when metrics breach thresholds. This gives you both historical analysis and real-time alerting.
mermaid
Here's a real traced request with a unique trace I-D. Request received, cache miss so we move to the LLM. Routing to GPT-4, three hundred forty tokens input. GPT-4 times out at five-point-two seconds, triggering the fallback. We reroute to GPT-3.5 which responds with one hundred twenty-five output tokens. Response validates successfully. Total request time six-point-eight seconds, cost eight cents. That trace tells the entire story and helps debug issues.
terminal
Production LLM systems must be tested rigorously before and after launch. Testing prevents regressions, catches safety issues, and validates cost assumptions.
image
The pyramid has three levels. {{step}}Unit Tests mock LLM calls and test input validation, output parsing, and error handling with eighty percent code coverage. {{step}}Integration Tests use real provider calls against golden datasets, run regression tests, and benchmark performance. {{step}}Production Tests include shadow mode deployments comparing old and new systems, A-B testing with real users, and feedback loops to catch edge cases.
cards
This code parameterizes tests across golden datasets. We have known-good inputs and expected outputs for positive, negative, and mixed feedback. The test runs each input through the LLM and verifies sentiment matches, themes are a superset of expected themes, and actions count meets the minimum. Golden datasets catch regressions immediately and let new team members understand expected behavior.
code
The table sets targets for production quality. Accuracy should exceed ninety-five percent, measured by human evaluation on samples of one hundred. Consistency should exceed ninety percent, verified by sending the same input multiple times. Latency P-95 should stay under three seconds from production monitoring. Cost per request should stay under fifteen cents from token tracking. Cache hit rate should exceed sixty percent from Redis analytics.
table
LLMs create unique security challenges because they process user data and generate outputs at scale. We need defense in depth across inputs and outputs.
image
Input security prevents attacks: detect prompt injections, sanitize inputs, rate limit per user, and moderate content before sending to the LLM. {{step}}Output security protects data: redact P-I-I from responses, filter for harmful content, watermark outputs for audit trails, and log all requests with user identifiers.
cards
Critical: never send user data to LLM providers without explicit consent. Anonymize personal data before sending to LLMs. Encrypt data in transit and at rest. Audit all LLM requests with user I-D-s for compliance. Provide an opt-out mechanism for users who don't want AI processing. Delete prompts and responses after thirty days. Ensure compliance with G-D-P-R, C-C-P-A, and H-I-P-A-A depending on your jurisdiction. This checklist applies whether you use commercial APIs or self-hosted models.
callout list
This TypeScript rate limiter enforces two limits. Per user, allow twenty requests per minute to prevent abuse from a single account. Per IP address, allow one hundred requests per minute to catch distributed attacks. Each request consumes a token from both buckets. If either bucket is exhausted, reject with a rate limit error. This dual approach stops both single-user abuse and coordinated attacks while remaining fair to legitimate traffic.
code
Getting LLM features to production safely requires careful staged rollouts. We use multiple deployment patterns to minimize blast radius.
image
Start in Development where you run golden tests. Move to Staging after tests pass and run load tests. Then deploy to Canary at five percent traffic to catch issues at small scale. If metrics stay healthy, scale to blue-green at fifty percent traffic. If everything looks good, roll out to one hundred percent. At any stage, if errors appear, rollback immediately. This gradual approach catches issues before they affect all users.
mermaid
Feature flags let you control LLM availability per user without redeployment. This code checks LaunchDarkly for a flag like 'llm-summarization-enabled' for each user. If enabled, call the LLM path. If disabled, use traditional code. This lets you roll out to five percent of users, monitor metrics, then expand. It also lets you kill a feature instantly if problems arise without rebuilding.
code
Define success metrics before launch. P-95 latency under two seconds. Uptime exceeding ninety-nine-point-five percent. Cost per request under ten cents. Quality score above ninety-five percent. These numbers vary by use case, but having targets prevents scope creep and gives teams clear goals.
stats
Before LLM integration, customer support was manual. Agents spent fifteen minutes per ticket, handling fifty tickets daily, with high burnout and inconsistent quality. {{step}}After deploying LLM-assisted routing and auto-response, average handle time dropped to eight minutes, agents handled ninety tickets daily, satisfaction improved, and responses became consistent. {{step}}Result: eighty percent productivity gain and one hundred twenty thousand dollars annual savings.
cards
When a new ticket arrives, the LLM classifies it. We check the cache first—if we've seen this type of ticket before, return the cached category immediately. Cache miss goes to GPT-3.5 Turbo which returns structured JSON. We validate that JSON parses correctly, then store the result for future hits. Finally, route to the appropriate team: Billing, Technical, or General. This architecture keeps costs down through caching while maintaining accuracy.
mermaid
The auto-response flow first classifies urgency and category. High urgency tickets skip the LLM and go directly to a human. Lower urgency triggers a knowledge base search with a high similarity threshold of zero-point-eighty-five. If a relevant article exists, the LLM generates a response grounded in that article. We validate the response before sending. This keeps agents available for complex issues while LLMs handle routine questions with factual accuracy.
code
Learning from others' mistakes accelerates your own success. Let's look at the most common production failures.
image
No Fallback Strategy is a common mistake: a single model failure crashes your system. Always implement backup models, circuit breakers, and cached fallbacks. {{step}}Ignoring Token Costs leads to bill shock. Set budget alerts, optimize prompts, and use smaller models when appropriate. {{step}}Skipping Validation creates security vulnerabilities. Validate all inputs, filter outputs, and monitor for anomalies continuously. {{step}}No Monitoring means you're flying blind. Track all key metrics, set up alerts on thresholds, and review logs weekly to catch trends.
cards
Safety First: implement input validation, output filtering, and P-I-I protection to prevent data leaks and attacks. {{step}}Build Reliability: use a fallback pyramid, circuit breakers, and smart retries so your system degrades gracefully instead of crashing. {{step}}Optimize Costs: deploy semantic caching, right-size models for each task, and compress prompts to reduce token usage. {{step}}Measure Everything: track latency, quality, cost, and error metrics; test continuously before release; and deploy gradually with feature flags.
cards
This week, implement input validation and basic caching. This month, add a fallback strategy and production monitoring. This quarter, complete your full production deployment with gradual rollout. Check github dot com slash llm-patterns for architecture templates, promptfoo dot dev for testing frameworks, and llm-cost-calculator dot com for pricing analysis.
list