HomePrompt EngineeringEvaluating & Testing Prompts
intermediate12 min read· Module 7, Lesson 3

📊Evaluating & Testing Prompts

Build test suites for your prompts and measure quality systematically

Evaluating & Testing Prompts

Why Testing Matters

Prompts are code. They execute logic, produce outputs, and affect downstream systems. Yet most teams treat prompts as informal text — they eyeball a few outputs and call it done.

This approach fails at scale because:

  • Regressions are invisible. A small prompt change can break edge cases you tested months ago.
  • Quality is subjective. Without defined criteria, "good enough" varies by person and mood.
  • Iteration is blind. You cannot improve what you do not measure.
  • Production failures are expensive. A bad prompt in production can generate incorrect data, offend users, or cause downstream system failures.

Professional prompt engineering requires the same rigor as software testing: defined inputs, expected outputs, automated checks, and continuous monitoring.


Defining Success Criteria

Before testing, you need to know what "good" looks like. Define clear, measurable criteria for your prompt outputs.

Types of Criteria

CriterionDescriptionExample
AccuracyFactual correctness"The output must contain the correct price"
FormatStructure compliance"Output must be valid JSON with specific keys"
CompletenessAll required info present"Response must cover all 5 product features"
ToneVoice and style"Professional tone, no slang, third person"
SafetyNo harmful content"No PII, no medical advice, no hallucinations"
LengthWithin bounds"Between 100 and 300 words"
LatencySpeed of response"Response generated in under 3 seconds"

Writing Good Criteria

Bad criteria:

"The output should be good and helpful"

Good criteria:

1. Output is valid JSON matching the schema { name: string, price: number, description: string } 2. Price is accurate to within $0.01 of the source data 3. Description is 1-3 sentences, no marketing superlatives 4. No fields are null or empty 5. Response time is under 2 seconds

Building Assessment Sets

An assessment set is a collection of test cases, each with:

  1. Input — The user message or context to send to the prompt
  2. Expected output — What a correct response looks like
  3. Assessment criteria — How to judge the response

Assessment Set Structure

JSON
{ "assessment_set": "product-description-generator", "version": "1.2", "test_cases": [ { "id": "tc-001", "input": "Generate a description for: Blue Widget, $9.99, waterproof", "expected_output": { "format": "json", "required_fields": ["name", "price", "description"], "price_value": 9.99, "must_contain": ["waterproof"], "must_not_contain": ["amazing", "incredible", "best"] }, "tags": ["basic", "single-product"] }, { "id": "tc-002", "input": "Generate a description for: Red Gadget, $149.00, bluetooth, rechargeable", "expected_output": { "format": "json", "required_fields": ["name", "price", "description"], "price_value": 149.00, "must_contain": ["bluetooth", "rechargeable"], "must_not_contain": ["amazing", "incredible", "best"] }, "tags": ["basic", "multi-feature"] } ] }

How Many Test Cases?

Use CaseMinimum CasesRecommended
Prototype / POC5-1020
Internal tool20-5050-100
Customer-facing50-100200+
Safety-critical100+500+

Sourcing Test Cases

  • Real user queries — Sample from production logs
  • Edge cases — Unusual inputs, empty fields, long text, special characters
  • Adversarial inputs — Prompt injections, off-topic requests
  • Boundary conditions — Maximum/minimum values, exact thresholds
  • Failure modes — What the model gets wrong most often

Automated Assessment with Claude

One of the most powerful methods is using Claude itself as a judge. This is called LLM-as-judge or model-graded assessment.

Basic Assessment Prompt

You are an assessment judge. Grade the following AI response against the criteria. INPUT: {input} AI RESPONSE: {response} CRITERIA: 1. Is the output valid JSON? (yes/no) 2. Does it contain all required fields (name, price, description)? (yes/no) 3. Is the price accurate? (yes/no) 4. Is the description 1-3 sentences? (yes/no) 5. Does it avoid marketing superlatives? (yes/no) Return your assessment as JSON: { "scores": { "valid_json": true/false, "has_required_fields": true/false, "accurate_price": true/false, "correct_length": true/false, "no_superlatives": true/false }, "overall_pass": true/false, "reasoning": "Brief explanation" }

Structured Assessment Categories

MethodBest ForAccuracy
Exact matchDeterministic outputs (codes, IDs)Very high
Contains/regexRequired keywords or patternsHigh
LLM-as-judgeSubjective quality, tone, helpfulnessMedium-high
Human reviewComplex, nuanced assessmentHighest
CompositeCombining multiple methodsHigh

Scoring Methods

Binary Scoring

Simple pass/fail for each criterion.

Python
def binary_score(response, criteria): results = {} results["valid_json"] = is_valid_json(response) results["has_fields"] = has_required_fields(response, criteria["fields"]) results["correct_length"] = check_length(response, criteria["min"], criteria["max"]) results["pass"] = all(results.values()) return results

Likert Scale Scoring

Rate each criterion on a 1-5 scale for more nuance.

Python
def likert_score(response, criteria): # Use Claude as a judge assessment_prompt = f""" Rate the following response on each criterion from 1-5: 1 = Very poor, 2 = Poor, 3 = Acceptable, 4 = Good, 5 = Excellent Response: {response} Criteria: - Accuracy: How factually correct is the response? - Completeness: Does it cover all required information? - Clarity: How easy is it to understand? - Tone: Does it match the desired voice? Return JSON: {{ "accuracy": N, "completeness": N, "clarity": N, "tone": N }} """ return call_claude(assessment_prompt)

Weighted Scoring

Assign weights to different criteria based on importance.

Python
WEIGHTS = { "accuracy": 0.40, "completeness": 0.25, "format": 0.15, "tone": 0.10, "length": 0.10, } def weighted_score(scores): total = sum(scores[k] * WEIGHTS[k] for k in WEIGHTS) return round(total, 2)

A/B Testing Prompts

When you have two prompt variants, run them against the same test set and compare.

A/B Test Framework

Python
from anthropic import Anthropic client = Anthropic() def run_ab_test(prompt_a, prompt_b, test_cases, judge_prompt): results = {"a_wins": 0, "b_wins": 0, "ties": 0, "details": []} for case in test_cases: # Run both prompts response_a = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{"role": "user", "content": prompt_a + "\n" + case["input"]}], ).content[0].text response_b = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{"role": "user", "content": prompt_b + "\n" + case["input"]}], ).content[0].text # Judge both responses judgment = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{ "role": "user", "content": judge_prompt.format( input=case["input"], response_a=response_a, response_b=response_b, ), }], ).content[0].text verdict = json.loads(judgment) if verdict["winner"] == "A": results["a_wins"] += 1 elif verdict["winner"] == "B": results["b_wins"] += 1 else: results["ties"] += 1 results["details"].append({ "case_id": case["id"], "winner": verdict["winner"], "reasoning": verdict["reasoning"], }) return results

Tracking Performance Over Time

Testing is not a one-time event. Track metrics across prompt versions to spot regressions and measure improvement.

Metrics to Track

MetricFormulaTarget
Pass ratePassed / Total> 95% for production
Average scoreSum of scores / Count> 4.0 on 5-point scale
Regression rateCases that used to pass but now fail0%
Latency p50/p95Median and 95th percentile response time< 2s / < 5s
Cost per runTotal tokens x price per tokenWithin budget

Version Tracking Table

| Version | Date | Pass Rate | Avg Score | Regressions | Notes | |---------|------------|-----------|-----------|-------------|--------------------------| | v1.0 | 2025-01-15 | 72% | 3.4 | - | Initial prompt | | v1.1 | 2025-01-22 | 81% | 3.8 | 2 | Added few-shot examples | | v1.2 | 2025-02-01 | 89% | 4.1 | 0 | Restructured format | | v1.3 | 2025-02-10 | 94% | 4.4 | 1 | Added edge case handling | | v2.0 | 2025-03-01 | 97% | 4.7 | 0 | Complete rewrite |

The Testing Loop

Testing prompts is an iterative cycle:

┌──────────────┐ │ 1. Write │ │ Prompt │ └──────┬───────┘ │ ┌──────▼───────┐ │ 2. Run │ │ Test Suite │ └──────┬───────┘ │ ┌──────▼───────┐ │ 3. Analyze │ │ Results │ └──────┬───────┘ │ ┌──────▼───────┐ ┌──────────────┐ │ 4. Pass? │──No──>│ 5. Refine │ │ (>95%) │ │ Prompt │──┐ └──────┬───────┘ └──────────────┘ │ │ │ Yes │ │ ┌──────────────┐ │ ┌──────▼───────┐ │ 6. Add New │ │ │ Deploy │ │ Test Cases │───┘ └──────────────┘ └──────────────┘

Key Principles

  1. Never deploy without passing tests. Treat failing tests like failing unit tests.
  2. Add test cases for every bug. When you find a bad output, add it to the test suite.
  3. Version your prompts. Use git or a similar system to track prompt changes.
  4. Automate the loop. Run tests in CI/CD pipelines automatically.
  5. Review regressions immediately. A regression means your change broke something.

Tools for Prompt Testing

ToolTypeDescription
Anthropic ConsoleCloudBuilt-in testing tools for Claude prompts
promptfooOpen-sourceCLI tool for testing and grading prompts
BraintrustPlatformLogging, grading, and prompt management
LangSmithPlatformTracing and grading for LLM apps
Weights & BiasesPlatformExperiment tracking for ML and LLM
Custom scriptsDIYPython/TypeScript scripts using the Claude API

Practical Example: Python Testing Script

Here is a complete, production-ready testing script you can adapt for your own projects:

Python
""" prompt_test.py — Automated prompt testing framework Usage: python prompt_test.py --test-set tests/product_descriptions.json --prompt prompts/v2.txt """ from datetime import datetime from pathlib import Path from anthropic import Anthropic client = Anthropic() MODEL = "claude-sonnet-4-20250514" def load_test_set(path: str) -> dict: """Load a test set from a JSON file.""" with open(path) as f: return json.load(f) def load_prompt(path: str) -> str: """Load a prompt template from a text file.""" return Path(path).read_text() def run_prompt(prompt: str, user_input: str) -> tuple[str, float]: """Run a prompt and return the response with latency.""" start = time.time() response = client.messages.create( model=MODEL, max_tokens=1024, messages=[{"role": "user", "content": f"{prompt}\n\nInput: {user_input}"}], ) latency = time.time() - start return response.content[0].text, latency def assess_response( user_input: str, response: str, expected: dict, criteria: list[str], ) -> dict: """Use Claude as a judge to assess a response.""" judge_prompt = f"""You are a strict judge. Assess the AI response against the expected output and criteria. USER INPUT: {user_input} AI RESPONSE: {response} EXPECTED OUTPUT SPEC: {json.dumps(expected)} CRITERIA: {chr(10).join(f"- {c}" for c in criteria)} Return your assessment as JSON: {{ "scores": {{ <criterion_name>: {{ "pass": true/false, "score": 1-5, "reason": "..." }} }}, "overall_pass": true/false, "overall_score": <1-5 float>, "summary": "Brief summary" }} Be strict. Only pass criteria that are clearly met.""" result = client.messages.create( model=MODEL, max_tokens=1024, messages=[{"role": "user", "content": judge_prompt}], ) return json.loads(result.content[0].text) def run_tests(test_set_path: str, prompt_path: str, output_path: str = None): """Run a full test suite and generate a report.""" test_set = load_test_set(test_set_path) prompt = load_prompt(prompt_path) results = { "test_set": test_set.get("test_set", "unknown"), "prompt_file": prompt_path, "timestamp": datetime.now().isoformat(), "model": MODEL, "summary": {}, "details": [], } total_pass = 0 total_score = 0.0 total_latency = 0.0 for case in test_set["test_cases"]: print(f" Running: {case['id']}...", end=" ") response, latency = run_prompt(prompt, case["input"]) assessment = assess_response( case["input"], response, case["expected_output"], test_set.get("criteria", ["accuracy", "format", "completeness"]), ) passed = assessment.get("overall_pass", False) score = assessment.get("overall_score", 0) if passed: total_pass += 1 print("PASS") else: print("FAIL") total_score += score total_latency += latency results["details"].append({ "case_id": case["id"], "input": case["input"], "response": response, "assessment": assessment, "latency_seconds": round(latency, 2), "passed": passed, }) count = len(test_set["test_cases"]) results["summary"] = { "total_cases": count, "passed": total_pass, "failed": count - total_pass, "pass_rate": round(total_pass / count * 100, 1) if count else 0, "avg_score": round(total_score / count, 2) if count else 0, "avg_latency": round(total_latency / count, 2) if count else 0, } # Print summary print("\n" + "=" * 50) print(f" TEST RESULTS: {results['test_set']}") print("=" * 50) print(f" Total: {count}") print(f" Passed: {total_pass}") print(f" Failed: {count - total_pass}") print(f" Pass Rate: {results['summary']['pass_rate']}%") print(f" Avg Score: {results['summary']['avg_score']}/5") print(f" Avg Latency: {results['summary']['avg_latency']}s") print("=" * 50) # Save results if output_path: with open(output_path, "w") as f: json.dump(results, f, indent=2) print(f"\n Results saved to: {output_path}") return results if __name__ == "__main__": parser = argparse.ArgumentParser(description="Test prompts") parser.add_argument("--test-set", required=True, help="Path to test set JSON") parser.add_argument("--prompt", required=True, help="Path to prompt template") parser.add_argument("--output", default=None, help="Path to save results JSON") args = parser.parse_args() run_tests(args.test_set, args.prompt, args.output)

Test Set Template

Save this as your starting point for any new test suite:

JSON
{ "test_set": "my-feature-name", "version": "1.0", "criteria": [ "Output is valid JSON", "All required fields are present and non-empty", "Factual accuracy matches expected values", "Tone is professional and neutral", "Length is within specified bounds" ], "test_cases": [ { "id": "tc-001", "input": "Your test input here", "expected_output": { "format": "json", "required_fields": ["field1", "field2"], "constraints": "Describe what correct looks like" }, "tags": ["basic"] }, { "id": "tc-002", "input": "Edge case input", "expected_output": { "format": "json", "required_fields": ["field1", "field2"], "constraints": "Expected behavior for edge case" }, "tags": ["edge-case"] } ] }

Running the Tests

Terminal
# Basic run python prompt_test.py --test-set tests/products.json --prompt prompts/v2.txt # Save results for comparison python prompt_test.py --test-set tests/products.json --prompt prompts/v2.txt --output results/v2_results.json # Compare two versions python prompt_test.py --test-set tests/products.json --prompt prompts/v1.txt --output results/v1.json python prompt_test.py --test-set tests/products.json --prompt prompts/v2.txt --output results/v2.json

Key Takeaways

  • Treat prompts like code — test them rigorously before deploying
  • Define clear, measurable success criteria before writing tests
  • Build test suites with real data, edge cases, and adversarial inputs
  • Use Claude as a judge for scalable automated grading
  • Choose the right scoring method — binary, Likert, or weighted
  • A/B test prompt variants against the same test set
  • Track performance over time to catch regressions
  • Follow the testing loop: write, test, analyze, refine, repeat
  • Automate everything — manual testing does not scale