AI Engineering

Prompt Engineering Best Practices for Production LLMs

TL;DR

Production prompt engineering requires structured formats, few-shot examples, explicit constraints, and continuous evaluation. Move beyond ad-hoc prompts to systematic prompt pipelines with version control and A/B testing.

January 25, 20267 min read
Prompt EngineeringLLMGPT-4ClaudeProduction AINLP

Prompt engineering has evolved from a curiosity to a critical engineering discipline. In production systems, the difference between a good and bad prompt can mean the difference between 95% accuracy and 60% accuracy. This guide shares practical patterns that work.

The Production Mindset

Most prompt engineering tutorials focus on interactive chat scenarios. Production is different. You need:

  • Consistency: The same input should produce similar outputs
  • Measurability: You must be able to evaluate quality at scale
  • Maintainability: Prompts evolve and need version control

Key Insight

Treat prompts as code. They need version control, testing, documentation, and review processes just like any other production artifact.

Structured Output Formats

Production systems need parseable outputs. Always specify the exact format you expect.

JSON Mode

SYSTEM_PROMPT = """You are a medical data extraction assistant.
 
Extract the following information from the clinical note and return ONLY valid JSON.
 
Required fields:
- patient_age: integer or null
- chief_complaint: string
- medications: array of strings
- allergies: array of strings
 
Example output:
{
  "patient_age": 45,
  "chief_complaint": "chest pain",
  "medications": ["aspirin", "lisinopril"],
  "allergies": ["penicillin"]
}
"""

Validation Layer

Never trust LLM outputs blindly:

from pydantic import BaseModel, validator
from typing import Optional
 
class ClinicalExtraction(BaseModel):
    patient_age: Optional[int]
    chief_complaint: str
    medications: list[str]
    allergies: list[str]
 
    @validator('patient_age')
    def age_must_be_reasonable(cls, v):
        if v is not None and (v < 0 or v > 150):
            raise ValueError('Age must be between 0 and 150')
        return v
 
def extract_clinical_data(note: str) -> ClinicalExtraction:
    response = llm.invoke(SYSTEM_PROMPT, note)
    return ClinicalExtraction.model_validate_json(response)

Few-Shot Learning Patterns

Research by Brown et al. (2020) demonstrated that few-shot prompting dramatically improves task performance. Here's how to implement it effectively:

Dynamic Example Selection

from sentence_transformers import SentenceTransformer
import numpy as np
 
class FewShotSelector:
    def __init__(self, examples: list[dict]):
        self.examples = examples
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.embeddings = self.encoder.encode([e['input'] for e in examples])
 
    def select_examples(self, query: str, k: int = 3) -> list[dict]:
        """Select the k most similar examples to the query."""
        query_embedding = self.encoder.encode(query)
        similarities = np.dot(self.embeddings, query_embedding)
        top_indices = np.argsort(similarities)[-k:][::-1]
        return [self.examples[i] for i in top_indices]

Common Mistake

Static few-shot examples can mislead the model when the query is dissimilar. Always use semantic similarity to select relevant examples dynamically.

Chain-of-Thought Prompting

Wei et al. (2022) showed that asking models to show their reasoning significantly improves accuracy on complex tasks:

REASONING_PROMPT = """
Analyze the following customer support ticket and determine the appropriate action.
 
Think through this step by step:
1. First, identify the customer's core issue
2. Then, assess the urgency level (low, medium, high, critical)
3. Next, determine the appropriate department
4. Finally, suggest the initial response
 
Ticket: {ticket_text}
 
Let me work through this step by step:
"""

Structured Chain-of-Thought

For production, capture the reasoning in a structured format:

class TicketAnalysis(BaseModel):
    reasoning_steps: list[str]
    core_issue: str
    urgency: Literal["low", "medium", "high", "critical"]
    department: str
    suggested_response: str
 
COT_PROMPT = """
Analyze the ticket and return your analysis as JSON.
 
{
  "reasoning_steps": ["Step 1: ...", "Step 2: ...", ...],
  "core_issue": "...",
  "urgency": "low|medium|high|critical",
  "department": "...",
  "suggested_response": "..."
}
"""

Prompt Templates and Version Control

Maintain prompts in a structured, version-controlled format:

# prompts/clinical_extraction_v2.yaml
name: clinical_extraction
version: "2.1.0"
description: "Extract structured data from clinical notes"
model: gpt-4-turbo
temperature: 0
max_tokens: 1000
 
system_prompt: |
  You are a medical data extraction assistant specializing in...
 
user_template: |
  Extract information from the following clinical note:
 
  {note_content}
 
  Return only valid JSON matching the schema.
 
schema:
  type: object
  required: [patient_age, chief_complaint]
  properties:
    patient_age:
      type: integer
      minimum: 0
      maximum: 150

Evaluation and Optimization

Building Test Suites

class PromptTestSuite:
    def __init__(self, prompt_template: str):
        self.prompt = prompt_template
        self.test_cases = []
 
    def add_test(self, input_data: dict, expected: dict, tags: list[str] = []):
        self.test_cases.append({
            "input": input_data,
            "expected": expected,
            "tags": tags
        })
 
    def run_evaluation(self) -> dict:
        results = []
        for case in self.test_cases:
            response = self.invoke(case["input"])
            score = self.evaluate(response, case["expected"])
            results.append({"case": case, "response": response, "score": score})
 
        return {
            "total": len(results),
            "passed": sum(1 for r in results if r["score"] >= 0.9),
            "average_score": sum(r["score"] for r in results) / len(results),
            "by_tag": self._aggregate_by_tag(results)
        }

A/B Testing Prompts

class PromptABTest:
    def __init__(self, prompt_a: str, prompt_b: str, split: float = 0.5):
        self.prompts = {"A": prompt_a, "B": prompt_b}
        self.split = split
        self.results = {"A": [], "B": []}
 
    def get_prompt(self, request_id: str) -> tuple[str, str]:
        """Deterministic assignment based on request ID."""
        variant = "A" if hash(request_id) % 100 < self.split * 100 else "B"
        return variant, self.prompts[variant]
 
    def record_result(self, variant: str, success: bool, latency: float):
        self.results[variant].append({"success": success, "latency": latency})

Production Guardrails

Input Validation

def validate_input(text: str, max_tokens: int = 4000) -> str:
    """Sanitize and validate input before sending to LLM."""
    # Remove potential injection attempts
    text = re.sub(r'<\|.*?\|>', '', text)
 
    # Truncate if too long
    tokens = tokenizer.encode(text)
    if len(tokens) > max_tokens:
        text = tokenizer.decode(tokens[:max_tokens])
 
    return text

Output Guardrails

class OutputGuardrails:
    def __init__(self):
        self.sensitive_patterns = [
            r'\b\d{3}-\d{2}-\d{4}\b',  # SSN
            r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',  # Email
        ]
 
    def check(self, output: str) -> tuple[bool, list[str]]:
        issues = []
        for pattern in self.sensitive_patterns:
            if re.search(pattern, output):
                issues.append(f"Sensitive data pattern detected: {pattern}")
        return len(issues) == 0, issues

Cost Optimization

Production LLM costs can escalate quickly. Implement these strategies:

Caching Layer

from functools import lru_cache
import hashlib
 
class PromptCache:
    def __init__(self, redis_client):
        self.redis = redis_client
        self.ttl = 3600  # 1 hour
 
    def get_cache_key(self, prompt: str, params: dict) -> str:
        content = f"{prompt}:{json.dumps(params, sort_keys=True)}"
        return hashlib.sha256(content.encode()).hexdigest()
 
    async def get_or_compute(self, prompt: str, params: dict, compute_fn):
        key = self.get_cache_key(prompt, params)
        cached = await self.redis.get(key)
        if cached:
            return json.loads(cached)
 
        result = await compute_fn(prompt, params)
        await self.redis.setex(key, self.ttl, json.dumps(result))
        return result

Token Optimization

def optimize_prompt(prompt: str, target_reduction: float = 0.2) -> str:
    """Reduce prompt token count while preserving meaning."""
    # Remove redundant whitespace
    prompt = re.sub(r'\s+', ' ', prompt)
 
    # Use abbreviations for common terms
    replacements = {
        "for example": "e.g.",
        "that is": "i.e.",
        "and so on": "etc.",
    }
    for full, abbrev in replacements.items():
        prompt = prompt.replace(full, abbrev)
 
    return prompt

Monitoring and Observability

Track these metrics for production prompts:

class PromptMetrics:
    def record(self, prompt_id: str, response: dict, metadata: dict):
        metrics = {
            "prompt_id": prompt_id,
            "timestamp": datetime.utcnow(),
            "latency_ms": metadata["latency_ms"],
            "input_tokens": metadata["input_tokens"],
            "output_tokens": metadata["output_tokens"],
            "cost_usd": self.calculate_cost(metadata),
            "success": metadata.get("success", True),
            "model": metadata["model"]
        }
        self.emit(metrics)

Conclusion

Production prompt engineering requires treating prompts as first-class engineering artifacts. Key takeaways:

  1. Structure your outputs - Use JSON schemas and validation
  2. Version control prompts - Track changes like code
  3. Build evaluation suites - Measure quality systematically
  4. Implement guardrails - Validate inputs and outputs
  5. Optimize costs - Cache and minimize tokens
  6. Monitor everything - Track latency, cost, and quality

The field continues to evolve rapidly. Stay current with research and continuously evaluate new techniques against your production baselines.


References

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., ... Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877-1901. https://arxiv.org/abs/2005.14165

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824-24837. https://arxiv.org/abs/2201.11903

OpenAI. (2024). GPT-4 technical report. https://openai.com/research/gpt-4

Anthropic. (2024). Claude model card and prompt engineering guide. https://docs.anthropic.com/claude/docs/prompt-engineering


Want to discuss prompt engineering strategies? Get in touch or explore my AI engineering projects.

Frequently Asked Questions

OR

Osvaldo Restrepo

Senior Full Stack AI & Software Engineer. Building production AI systems that solve real problems.