📊Evaluating & Testing Prompts
Build test suites for your prompts and measure quality systematically
Evaluating & Testing Prompts
Why Testing Matters
Prompts are code. They execute logic, produce outputs, and affect downstream systems. Yet most teams treat prompts as informal text — they eyeball a few outputs and call it done.
This approach fails at scale because:
- Regressions are invisible. A small prompt change can break edge cases you tested months ago.
- Quality is subjective. Without defined criteria, "good enough" varies by person and mood.
- Iteration is blind. You cannot improve what you do not measure.
- Production failures are expensive. A bad prompt in production can generate incorrect data, offend users, or cause downstream system failures.
Professional prompt engineering requires the same rigor as software testing: defined inputs, expected outputs, automated checks, and continuous monitoring.
Defining Success Criteria
Before testing, you need to know what "good" looks like. Define clear, measurable criteria for your prompt outputs.
Types of Criteria
| Criterion | Description | Example |
|---|---|---|
| Accuracy | Factual correctness | "The output must contain the correct price" |
| Format | Structure compliance | "Output must be valid JSON with specific keys" |
| Completeness | All required info present | "Response must cover all 5 product features" |
| Tone | Voice and style | "Professional tone, no slang, third person" |
| Safety | No harmful content | "No PII, no medical advice, no hallucinations" |
| Length | Within bounds | "Between 100 and 300 words" |
| Latency | Speed of response | "Response generated in under 3 seconds" |
Writing Good Criteria
Bad criteria:
"The output should be good and helpful"
Good criteria:
1. Output is valid JSON matching the schema { name: string, price: number, description: string }
2. Price is accurate to within $0.01 of the source data
3. Description is 1-3 sentences, no marketing superlatives
4. No fields are null or empty
5. Response time is under 2 seconds
Building Assessment Sets
An assessment set is a collection of test cases, each with:
- Input — The user message or context to send to the prompt
- Expected output — What a correct response looks like
- Assessment criteria — How to judge the response
Assessment Set Structure
{
"assessment_set": "product-description-generator",
"version": "1.2",
"test_cases": [
{
"id": "tc-001",
"input": "Generate a description for: Blue Widget, $9.99, waterproof",
"expected_output": {
"format": "json",
"required_fields": ["name", "price", "description"],
"price_value": 9.99,
"must_contain": ["waterproof"],
"must_not_contain": ["amazing", "incredible", "best"]
},
"tags": ["basic", "single-product"]
},
{
"id": "tc-002",
"input": "Generate a description for: Red Gadget, $149.00, bluetooth, rechargeable",
"expected_output": {
"format": "json",
"required_fields": ["name", "price", "description"],
"price_value": 149.00,
"must_contain": ["bluetooth", "rechargeable"],
"must_not_contain": ["amazing", "incredible", "best"]
},
"tags": ["basic", "multi-feature"]
}
]
}How Many Test Cases?
| Use Case | Minimum Cases | Recommended |
|---|---|---|
| Prototype / POC | 5-10 | 20 |
| Internal tool | 20-50 | 50-100 |
| Customer-facing | 50-100 | 200+ |
| Safety-critical | 100+ | 500+ |
Sourcing Test Cases
- Real user queries — Sample from production logs
- Edge cases — Unusual inputs, empty fields, long text, special characters
- Adversarial inputs — Prompt injections, off-topic requests
- Boundary conditions — Maximum/minimum values, exact thresholds
- Failure modes — What the model gets wrong most often
Automated Assessment with Claude
One of the most powerful methods is using Claude itself as a judge. This is called LLM-as-judge or model-graded assessment.
Basic Assessment Prompt
You are an assessment judge. Grade the following AI response against the criteria.
INPUT: {input}
AI RESPONSE: {response}
CRITERIA:
1. Is the output valid JSON? (yes/no)
2. Does it contain all required fields (name, price, description)? (yes/no)
3. Is the price accurate? (yes/no)
4. Is the description 1-3 sentences? (yes/no)
5. Does it avoid marketing superlatives? (yes/no)
Return your assessment as JSON:
{
"scores": {
"valid_json": true/false,
"has_required_fields": true/false,
"accurate_price": true/false,
"correct_length": true/false,
"no_superlatives": true/false
},
"overall_pass": true/false,
"reasoning": "Brief explanation"
}
Structured Assessment Categories
| Method | Best For | Accuracy |
|---|---|---|
| Exact match | Deterministic outputs (codes, IDs) | Very high |
| Contains/regex | Required keywords or patterns | High |
| LLM-as-judge | Subjective quality, tone, helpfulness | Medium-high |
| Human review | Complex, nuanced assessment | Highest |
| Composite | Combining multiple methods | High |
Scoring Methods
Binary Scoring
Simple pass/fail for each criterion.
def binary_score(response, criteria):
results = {}
results["valid_json"] = is_valid_json(response)
results["has_fields"] = has_required_fields(response, criteria["fields"])
results["correct_length"] = check_length(response, criteria["min"], criteria["max"])
results["pass"] = all(results.values())
return resultsLikert Scale Scoring
Rate each criterion on a 1-5 scale for more nuance.
def likert_score(response, criteria):
# Use Claude as a judge
assessment_prompt = f"""
Rate the following response on each criterion from 1-5:
1 = Very poor, 2 = Poor, 3 = Acceptable, 4 = Good, 5 = Excellent
Response: {response}
Criteria:
- Accuracy: How factually correct is the response?
- Completeness: Does it cover all required information?
- Clarity: How easy is it to understand?
- Tone: Does it match the desired voice?
Return JSON: {{ "accuracy": N, "completeness": N, "clarity": N, "tone": N }}
"""
return call_claude(assessment_prompt)Weighted Scoring
Assign weights to different criteria based on importance.
WEIGHTS = {
"accuracy": 0.40,
"completeness": 0.25,
"format": 0.15,
"tone": 0.10,
"length": 0.10,
}
def weighted_score(scores):
total = sum(scores[k] * WEIGHTS[k] for k in WEIGHTS)
return round(total, 2)A/B Testing Prompts
When you have two prompt variants, run them against the same test set and compare.
A/B Test Framework
from anthropic import Anthropic
client = Anthropic()
def run_ab_test(prompt_a, prompt_b, test_cases, judge_prompt):
results = {"a_wins": 0, "b_wins": 0, "ties": 0, "details": []}
for case in test_cases:
# Run both prompts
response_a = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": prompt_a + "\n" + case["input"]}],
).content[0].text
response_b = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": prompt_b + "\n" + case["input"]}],
).content[0].text
# Judge both responses
judgment = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": judge_prompt.format(
input=case["input"],
response_a=response_a,
response_b=response_b,
),
}],
).content[0].text
verdict = json.loads(judgment)
if verdict["winner"] == "A":
results["a_wins"] += 1
elif verdict["winner"] == "B":
results["b_wins"] += 1
else:
results["ties"] += 1
results["details"].append({
"case_id": case["id"],
"winner": verdict["winner"],
"reasoning": verdict["reasoning"],
})
return resultsTracking Performance Over Time
Testing is not a one-time event. Track metrics across prompt versions to spot regressions and measure improvement.
Metrics to Track
| Metric | Formula | Target |
|---|---|---|
| Pass rate | Passed / Total | > 95% for production |
| Average score | Sum of scores / Count | > 4.0 on 5-point scale |
| Regression rate | Cases that used to pass but now fail | 0% |
| Latency p50/p95 | Median and 95th percentile response time | < 2s / < 5s |
| Cost per run | Total tokens x price per token | Within budget |
Version Tracking Table
| Version | Date | Pass Rate | Avg Score | Regressions | Notes |
|---------|------------|-----------|-----------|-------------|--------------------------|
| v1.0 | 2025-01-15 | 72% | 3.4 | - | Initial prompt |
| v1.1 | 2025-01-22 | 81% | 3.8 | 2 | Added few-shot examples |
| v1.2 | 2025-02-01 | 89% | 4.1 | 0 | Restructured format |
| v1.3 | 2025-02-10 | 94% | 4.4 | 1 | Added edge case handling |
| v2.0 | 2025-03-01 | 97% | 4.7 | 0 | Complete rewrite |
The Testing Loop
Testing prompts is an iterative cycle:
┌──────────────┐
│ 1. Write │
│ Prompt │
└──────┬───────┘
│
┌──────▼───────┐
│ 2. Run │
│ Test Suite │
└──────┬───────┘
│
┌──────▼───────┐
│ 3. Analyze │
│ Results │
└──────┬───────┘
│
┌──────▼───────┐ ┌──────────────┐
│ 4. Pass? │──No──>│ 5. Refine │
│ (>95%) │ │ Prompt │──┐
└──────┬───────┘ └──────────────┘ │
│ │
Yes │
│ ┌──────────────┐ │
┌──────▼───────┐ │ 6. Add New │ │
│ Deploy │ │ Test Cases │───┘
└──────────────┘ └──────────────┘
Key Principles
- Never deploy without passing tests. Treat failing tests like failing unit tests.
- Add test cases for every bug. When you find a bad output, add it to the test suite.
- Version your prompts. Use git or a similar system to track prompt changes.
- Automate the loop. Run tests in CI/CD pipelines automatically.
- Review regressions immediately. A regression means your change broke something.
Tools for Prompt Testing
| Tool | Type | Description |
|---|---|---|
| Anthropic Console | Cloud | Built-in testing tools for Claude prompts |
| promptfoo | Open-source | CLI tool for testing and grading prompts |
| Braintrust | Platform | Logging, grading, and prompt management |
| LangSmith | Platform | Tracing and grading for LLM apps |
| Weights & Biases | Platform | Experiment tracking for ML and LLM |
| Custom scripts | DIY | Python/TypeScript scripts using the Claude API |
Practical Example: Python Testing Script
Here is a complete, production-ready testing script you can adapt for your own projects:
"""
prompt_test.py — Automated prompt testing framework
Usage: python prompt_test.py --test-set tests/product_descriptions.json --prompt prompts/v2.txt
"""
from datetime import datetime
from pathlib import Path
from anthropic import Anthropic
client = Anthropic()
MODEL = "claude-sonnet-4-20250514"
def load_test_set(path: str) -> dict:
"""Load a test set from a JSON file."""
with open(path) as f:
return json.load(f)
def load_prompt(path: str) -> str:
"""Load a prompt template from a text file."""
return Path(path).read_text()
def run_prompt(prompt: str, user_input: str) -> tuple[str, float]:
"""Run a prompt and return the response with latency."""
start = time.time()
response = client.messages.create(
model=MODEL,
max_tokens=1024,
messages=[{"role": "user", "content": f"{prompt}\n\nInput: {user_input}"}],
)
latency = time.time() - start
return response.content[0].text, latency
def assess_response(
user_input: str,
response: str,
expected: dict,
criteria: list[str],
) -> dict:
"""Use Claude as a judge to assess a response."""
judge_prompt = f"""You are a strict judge. Assess the AI response against
the expected output and criteria.
USER INPUT: {user_input}
AI RESPONSE: {response}
EXPECTED OUTPUT SPEC: {json.dumps(expected)}
CRITERIA:
{chr(10).join(f"- {c}" for c in criteria)}
Return your assessment as JSON:
{{
"scores": {{
<criterion_name>: {{ "pass": true/false, "score": 1-5, "reason": "..." }}
}},
"overall_pass": true/false,
"overall_score": <1-5 float>,
"summary": "Brief summary"
}}
Be strict. Only pass criteria that are clearly met."""
result = client.messages.create(
model=MODEL,
max_tokens=1024,
messages=[{"role": "user", "content": judge_prompt}],
)
return json.loads(result.content[0].text)
def run_tests(test_set_path: str, prompt_path: str, output_path: str = None):
"""Run a full test suite and generate a report."""
test_set = load_test_set(test_set_path)
prompt = load_prompt(prompt_path)
results = {
"test_set": test_set.get("test_set", "unknown"),
"prompt_file": prompt_path,
"timestamp": datetime.now().isoformat(),
"model": MODEL,
"summary": {},
"details": [],
}
total_pass = 0
total_score = 0.0
total_latency = 0.0
for case in test_set["test_cases"]:
print(f" Running: {case['id']}...", end=" ")
response, latency = run_prompt(prompt, case["input"])
assessment = assess_response(
case["input"],
response,
case["expected_output"],
test_set.get("criteria", ["accuracy", "format", "completeness"]),
)
passed = assessment.get("overall_pass", False)
score = assessment.get("overall_score", 0)
if passed:
total_pass += 1
print("PASS")
else:
print("FAIL")
total_score += score
total_latency += latency
results["details"].append({
"case_id": case["id"],
"input": case["input"],
"response": response,
"assessment": assessment,
"latency_seconds": round(latency, 2),
"passed": passed,
})
count = len(test_set["test_cases"])
results["summary"] = {
"total_cases": count,
"passed": total_pass,
"failed": count - total_pass,
"pass_rate": round(total_pass / count * 100, 1) if count else 0,
"avg_score": round(total_score / count, 2) if count else 0,
"avg_latency": round(total_latency / count, 2) if count else 0,
}
# Print summary
print("\n" + "=" * 50)
print(f" TEST RESULTS: {results['test_set']}")
print("=" * 50)
print(f" Total: {count}")
print(f" Passed: {total_pass}")
print(f" Failed: {count - total_pass}")
print(f" Pass Rate: {results['summary']['pass_rate']}%")
print(f" Avg Score: {results['summary']['avg_score']}/5")
print(f" Avg Latency: {results['summary']['avg_latency']}s")
print("=" * 50)
# Save results
if output_path:
with open(output_path, "w") as f:
json.dump(results, f, indent=2)
print(f"\n Results saved to: {output_path}")
return results
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Test prompts")
parser.add_argument("--test-set", required=True, help="Path to test set JSON")
parser.add_argument("--prompt", required=True, help="Path to prompt template")
parser.add_argument("--output", default=None, help="Path to save results JSON")
args = parser.parse_args()
run_tests(args.test_set, args.prompt, args.output)Test Set Template
Save this as your starting point for any new test suite:
{
"test_set": "my-feature-name",
"version": "1.0",
"criteria": [
"Output is valid JSON",
"All required fields are present and non-empty",
"Factual accuracy matches expected values",
"Tone is professional and neutral",
"Length is within specified bounds"
],
"test_cases": [
{
"id": "tc-001",
"input": "Your test input here",
"expected_output": {
"format": "json",
"required_fields": ["field1", "field2"],
"constraints": "Describe what correct looks like"
},
"tags": ["basic"]
},
{
"id": "tc-002",
"input": "Edge case input",
"expected_output": {
"format": "json",
"required_fields": ["field1", "field2"],
"constraints": "Expected behavior for edge case"
},
"tags": ["edge-case"]
}
]
}Running the Tests
# Basic run
python prompt_test.py --test-set tests/products.json --prompt prompts/v2.txt
# Save results for comparison
python prompt_test.py --test-set tests/products.json --prompt prompts/v2.txt --output results/v2_results.json
# Compare two versions
python prompt_test.py --test-set tests/products.json --prompt prompts/v1.txt --output results/v1.json
python prompt_test.py --test-set tests/products.json --prompt prompts/v2.txt --output results/v2.jsonKey Takeaways
- Treat prompts like code — test them rigorously before deploying
- Define clear, measurable success criteria before writing tests
- Build test suites with real data, edge cases, and adversarial inputs
- Use Claude as a judge for scalable automated grading
- Choose the right scoring method — binary, Likert, or weighted
- A/B test prompt variants against the same test set
- Track performance over time to catch regressions
- Follow the testing loop: write, test, analyze, refine, repeat
- Automate everything — manual testing does not scale