HomePrompt EngineeringEvaluating & Testing Prompts

intermediate12 min read· Module 7, Lesson 3

📊Evaluating & Testing Prompts

Build test suites for your prompts and measure quality systematically

Evaluating & Testing Prompts

Why Testing Matters

Prompts are code. They execute logic, produce outputs, and affect downstream systems. Yet most teams treat prompts as informal text — they eyeball a few outputs and call it done.

This approach fails at scale because:

Regressions are invisible. A small prompt change can break edge cases you tested months ago.
Quality is subjective. Without defined criteria, "good enough" varies by person and mood.
Iteration is blind. You cannot improve what you do not measure.
Production failures are expensive. A bad prompt in production can generate incorrect data, offend users, or cause downstream system failures.

Professional prompt engineering requires the same rigor as software testing: defined inputs, expected outputs, automated checks, and continuous monitoring.

Defining Success Criteria

Before testing, you need to know what "good" looks like. Define clear, measurable criteria for your prompt outputs.

Types of Criteria

Criterion	Description	Example
Accuracy	Factual correctness	"The output must contain the correct price"
Format	Structure compliance	"Output must be valid JSON with specific keys"
Completeness	All required info present	"Response must cover all 5 product features"
Tone	Voice and style	"Professional tone, no slang, third person"
Safety	No harmful content	"No PII, no medical advice, no hallucinations"
Length	Within bounds	"Between 100 and 300 words"
Latency	Speed of response	"Response generated in under 3 seconds"

Writing Good Criteria

Bad criteria:

"The output should be good and helpful"

Good criteria:

1. Output is valid JSON matching the schema { name: string, price: number, description: string }
2. Price is accurate to within $0.01 of the source data
3. Description is 1-3 sentences, no marketing superlatives
4. No fields are null or empty
5. Response time is under 2 seconds

Building Assessment Sets

An assessment set is a collection of test cases, each with:

Input — The user message or context to send to the prompt
Expected output — What a correct response looks like
Assessment criteria — How to judge the response

Assessment Set Structure

JSON
{
  "assessment_set": "product-description-generator",
  "version": "1.2",
  "test_cases": [
    {
      "id": "tc-001",
      "input": "Generate a description for: Blue Widget, $9.99, waterproof",
      "expected_output": {
        "format": "json",
        "required_fields": ["name", "price", "description"],
        "price_value": 9.99,
        "must_contain": ["waterproof"],
        "must_not_contain": ["amazing", "incredible", "best"]
      },
      "tags": ["basic", "single-product"]
    },
    {
      "id": "tc-002",
      "input": "Generate a description for: Red Gadget, $149.00, bluetooth, rechargeable",
      "expected_output": {
        "format": "json",
        "required_fields": ["name", "price", "description"],
        "price_value": 149.00,
        "must_contain": ["bluetooth", "rechargeable"],
        "must_not_contain": ["amazing", "incredible", "best"]
      },
      "tags": ["basic", "multi-feature"]
    }
  ]
}

How Many Test Cases?

Use Case	Minimum Cases	Recommended
Prototype / POC	5-10	20
Internal tool	20-50	50-100
Customer-facing	50-100	200+
Safety-critical	100+	500+

Sourcing Test Cases

Real user queries — Sample from production logs
Edge cases — Unusual inputs, empty fields, long text, special characters
Adversarial inputs — Prompt injections, off-topic requests
Boundary conditions — Maximum/minimum values, exact thresholds
Failure modes — What the model gets wrong most often

Automated Assessment with Claude

One of the most powerful methods is using Claude itself as a judge. This is called LLM-as-judge or model-graded assessment.

Basic Assessment Prompt

You are an assessment judge. Grade the following AI response against the criteria.

INPUT: {input}
AI RESPONSE: {response}

CRITERIA:
1. Is the output valid JSON? (yes/no)
2. Does it contain all required fields (name, price, description)? (yes/no)
3. Is the price accurate? (yes/no)
4. Is the description 1-3 sentences? (yes/no)
5. Does it avoid marketing superlatives? (yes/no)

Return your assessment as JSON:
{
  "scores": {
    "valid_json": true/false,
    "has_required_fields": true/false,
    "accurate_price": true/false,
    "correct_length": true/false,
    "no_superlatives": true/false
  },
  "overall_pass": true/false,
  "reasoning": "Brief explanation"
}

Structured Assessment Categories

Method	Best For	Accuracy
Exact match	Deterministic outputs (codes, IDs)	Very high
Contains/regex	Required keywords or patterns	High
LLM-as-judge	Subjective quality, tone, helpfulness	Medium-high
Human review	Complex, nuanced assessment	Highest
Composite	Combining multiple methods	High

Scoring Methods

Binary Scoring

Simple pass/fail for each criterion.

Python
def binary_score(response, criteria):
    results = {}
    results["valid_json"] = is_valid_json(response)
    results["has_fields"] = has_required_fields(response, criteria["fields"])
    results["correct_length"] = check_length(response, criteria["min"], criteria["max"])
    results["pass"] = all(results.values())
    return results

Likert Scale Scoring

Rate each criterion on a 1-5 scale for more nuance.

Python
def likert_score(response, criteria):
    # Use Claude as a judge
    assessment_prompt = f"""
    Rate the following response on each criterion from 1-5:
    1 = Very poor, 2 = Poor, 3 = Acceptable, 4 = Good, 5 = Excellent

    Response: {response}

    Criteria:
    - Accuracy: How factually correct is the response?
    - Completeness: Does it cover all required information?
    - Clarity: How easy is it to understand?
    - Tone: Does it match the desired voice?

    Return JSON: {{ "accuracy": N, "completeness": N, "clarity": N, "tone": N }}
    """
    return call_claude(assessment_prompt)

Weighted Scoring

Assign weights to different criteria based on importance.

Python
WEIGHTS = {
    "accuracy": 0.40,
    "completeness": 0.25,
    "format": 0.15,
    "tone": 0.10,
    "length": 0.10,
}

def weighted_score(scores):
    total = sum(scores[k] * WEIGHTS[k] for k in WEIGHTS)
    return round(total, 2)

A/B Testing Prompts

When you have two prompt variants, run them against the same test set and compare.

A/B Test Framework

Python
from anthropic import Anthropic

client = Anthropic()

def run_ab_test(prompt_a, prompt_b, test_cases, judge_prompt):
    results = {"a_wins": 0, "b_wins": 0, "ties": 0, "details": []}

    for case in test_cases:
        # Run both prompts
        response_a = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt_a + "\n" + case["input"]}],
        ).content[0].text

        response_b = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt_b + "\n" + case["input"]}],
        ).content[0].text

        # Judge both responses
        judgment = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": judge_prompt.format(
                    input=case["input"],
                    response_a=response_a,
                    response_b=response_b,
                ),
            }],
        ).content[0].text

        verdict = json.loads(judgment)
        if verdict["winner"] == "A":
            results["a_wins"] += 1
        elif verdict["winner"] == "B":
            results["b_wins"] += 1
        else:
            results["ties"] += 1

        results["details"].append({
            "case_id": case["id"],
            "winner": verdict["winner"],
            "reasoning": verdict["reasoning"],
        })

    return results

Tracking Performance Over Time

Testing is not a one-time event. Track metrics across prompt versions to spot regressions and measure improvement.

Metrics to Track

Metric	Formula	Target
Pass rate	Passed / Total	> 95% for production
Average score	Sum of scores / Count	> 4.0 on 5-point scale
Regression rate	Cases that used to pass but now fail	0%
Latency p50/p95	Median and 95th percentile response time	< 2s / < 5s
Cost per run	Total tokens x price per token	Within budget

Version Tracking Table

| Version | Date       | Pass Rate | Avg Score | Regressions | Notes                    |
|---------|------------|-----------|-----------|-------------|--------------------------|
| v1.0    | 2025-01-15 | 72%       | 3.4       | -           | Initial prompt           |
| v1.1    | 2025-01-22 | 81%       | 3.8       | 2           | Added few-shot examples  |
| v1.2    | 2025-02-01 | 89%       | 4.1       | 0           | Restructured format      |
| v1.3    | 2025-02-10 | 94%       | 4.4       | 1           | Added edge case handling |
| v2.0    | 2025-03-01 | 97%       | 4.7       | 0           | Complete rewrite         |

The Testing Loop

Testing prompts is an iterative cycle:

    ┌──────────────┐
    │  1. Write    │
    │   Prompt     │
    └──────┬───────┘
           │
    ┌──────▼───────┐
    │  2. Run      │
    │  Test Suite  │
    └──────┬───────┘
           │
    ┌──────▼───────┐
    │  3. Analyze  │
    │   Results    │
    └──────┬───────┘
           │
    ┌──────▼───────┐       ┌──────────────┐
    │  4. Pass?    │──No──>│  5. Refine   │
    │   (>95%)     │       │   Prompt     │──┐
    └──────┬───────┘       └──────────────┘  │
           │                                  │
          Yes                                 │
           │               ┌──────────────┐   │
    ┌──────▼───────┐       │  6. Add New  │   │
    │   Deploy     │       │  Test Cases  │───┘
    └──────────────┘       └──────────────┘

Key Principles

Never deploy without passing tests. Treat failing tests like failing unit tests.
Add test cases for every bug. When you find a bad output, add it to the test suite.
Version your prompts. Use git or a similar system to track prompt changes.
Automate the loop. Run tests in CI/CD pipelines automatically.
Review regressions immediately. A regression means your change broke something.

Tools for Prompt Testing

Tool	Type	Description
Anthropic Console	Cloud	Built-in testing tools for Claude prompts
promptfoo	Open-source	CLI tool for testing and grading prompts
Braintrust	Platform	Logging, grading, and prompt management
LangSmith	Platform	Tracing and grading for LLM apps
Weights & Biases	Platform	Experiment tracking for ML and LLM
Custom scripts	DIY	Python/TypeScript scripts using the Claude API

Practical Example: Python Testing Script

Here is a complete, production-ready testing script you can adapt for your own projects:

Python
"""
prompt_test.py — Automated prompt testing framework
Usage: python prompt_test.py --test-set tests/product_descriptions.json --prompt prompts/v2.txt
"""

from datetime import datetime
from pathlib import Path
from anthropic import Anthropic

client = Anthropic()
MODEL = "claude-sonnet-4-20250514"


def load_test_set(path: str) -> dict:
    """Load a test set from a JSON file."""
    with open(path) as f:
        return json.load(f)


def load_prompt(path: str) -> str:
    """Load a prompt template from a text file."""
    return Path(path).read_text()


def run_prompt(prompt: str, user_input: str) -> tuple[str, float]:
    """Run a prompt and return the response with latency."""
    start = time.time()
    response = client.messages.create(
        model=MODEL,
        max_tokens=1024,
        messages=[{"role": "user", "content": f"{prompt}\n\nInput: {user_input}"}],
    )
    latency = time.time() - start
    return response.content[0].text, latency


def assess_response(
    user_input: str,
    response: str,
    expected: dict,
    criteria: list[str],
) -> dict:
    """Use Claude as a judge to assess a response."""
    judge_prompt = f"""You are a strict judge. Assess the AI response against
the expected output and criteria.

USER INPUT: {user_input}
AI RESPONSE: {response}
EXPECTED OUTPUT SPEC: {json.dumps(expected)}

CRITERIA:
{chr(10).join(f"- {c}" for c in criteria)}

Return your assessment as JSON:
{{
  "scores": {{
    <criterion_name>: {{ "pass": true/false, "score": 1-5, "reason": "..." }}
  }},
  "overall_pass": true/false,
  "overall_score": <1-5 float>,
  "summary": "Brief summary"
}}

Be strict. Only pass criteria that are clearly met."""

    result = client.messages.create(
        model=MODEL,
        max_tokens=1024,
        messages=[{"role": "user", "content": judge_prompt}],
    )
    return json.loads(result.content[0].text)


def run_tests(test_set_path: str, prompt_path: str, output_path: str = None):
    """Run a full test suite and generate a report."""
    test_set = load_test_set(test_set_path)
    prompt = load_prompt(prompt_path)

    results = {
        "test_set": test_set.get("test_set", "unknown"),
        "prompt_file": prompt_path,
        "timestamp": datetime.now().isoformat(),
        "model": MODEL,
        "summary": {},
        "details": [],
    }

    total_pass = 0
    total_score = 0.0
    total_latency = 0.0

    for case in test_set["test_cases"]:
        print(f"  Running: {case['id']}...", end=" ")

        response, latency = run_prompt(prompt, case["input"])
        assessment = assess_response(
            case["input"],
            response,
            case["expected_output"],
            test_set.get("criteria", ["accuracy", "format", "completeness"]),
        )

        passed = assessment.get("overall_pass", False)
        score = assessment.get("overall_score", 0)

        if passed:
            total_pass += 1
            print("PASS")
        else:
            print("FAIL")

        total_score += score
        total_latency += latency

        results["details"].append({
            "case_id": case["id"],
            "input": case["input"],
            "response": response,
            "assessment": assessment,
            "latency_seconds": round(latency, 2),
            "passed": passed,
        })

    count = len(test_set["test_cases"])
    results["summary"] = {
        "total_cases": count,
        "passed": total_pass,
        "failed": count - total_pass,
        "pass_rate": round(total_pass / count * 100, 1) if count else 0,
        "avg_score": round(total_score / count, 2) if count else 0,
        "avg_latency": round(total_latency / count, 2) if count else 0,
    }

    # Print summary
    print("\n" + "=" * 50)
    print(f"  TEST RESULTS: {results['test_set']}")
    print("=" * 50)
    print(f"  Total:     {count}")
    print(f"  Passed:    {total_pass}")
    print(f"  Failed:    {count - total_pass}")
    print(f"  Pass Rate: {results['summary']['pass_rate']}%")
    print(f"  Avg Score: {results['summary']['avg_score']}/5")
    print(f"  Avg Latency: {results['summary']['avg_latency']}s")
    print("=" * 50)

    # Save results
    if output_path:
        with open(output_path, "w") as f:
            json.dump(results, f, indent=2)
        print(f"\n  Results saved to: {output_path}")

    return results


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Test prompts")
    parser.add_argument("--test-set", required=True, help="Path to test set JSON")
    parser.add_argument("--prompt", required=True, help="Path to prompt template")
    parser.add_argument("--output", default=None, help="Path to save results JSON")
    args = parser.parse_args()

    run_tests(args.test_set, args.prompt, args.output)

Test Set Template

Save this as your starting point for any new test suite:

JSON
{
  "test_set": "my-feature-name",
  "version": "1.0",
  "criteria": [
    "Output is valid JSON",
    "All required fields are present and non-empty",
    "Factual accuracy matches expected values",
    "Tone is professional and neutral",
    "Length is within specified bounds"
  ],
  "test_cases": [
    {
      "id": "tc-001",
      "input": "Your test input here",
      "expected_output": {
        "format": "json",
        "required_fields": ["field1", "field2"],
        "constraints": "Describe what correct looks like"
      },
      "tags": ["basic"]
    },
    {
      "id": "tc-002",
      "input": "Edge case input",
      "expected_output": {
        "format": "json",
        "required_fields": ["field1", "field2"],
        "constraints": "Expected behavior for edge case"
      },
      "tags": ["edge-case"]
    }
  ]
}

Running the Tests

Terminal
# Basic run
python prompt_test.py --test-set tests/products.json --prompt prompts/v2.txt

# Save results for comparison
python prompt_test.py --test-set tests/products.json --prompt prompts/v2.txt --output results/v2_results.json

# Compare two versions
python prompt_test.py --test-set tests/products.json --prompt prompts/v1.txt --output results/v1.json
python prompt_test.py --test-set tests/products.json --prompt prompts/v2.txt --output results/v2.json

Key Takeaways

Treat prompts like code — test them rigorously before deploying
Define clear, measurable success criteria before writing tests
Build test suites with real data, edge cases, and adversarial inputs
Use Claude as a judge for scalable automated grading
Choose the right scoring method — binary, Likert, or weighted
A/B test prompt variants against the same test set
Track performance over time to catch regressions
Follow the testing loop: write, test, analyze, refine, repeat
Automate everything — manual testing does not scale

Module 7

3/6

📋 Prompt Templates & Patterns

Advanced Prompt Techniques 🎓

3/6