📈Project: AI Data Dashboard
Analyze CSV data with Claude's vision and structured outputs to generate insights
Project: AI Data Dashboard
Overview
In this project you will build a complete AI-powered data analysis pipeline in Python. The pipeline reads any CSV file, sends the data to Claude, and produces a structured JSON analysis plus a human-readable Markdown report.
By the end you will have a reusable tool that can:
- Detect column types automatically
- Identify trends, outliers, and statistical summaries
- Generate chart descriptions (even without a charting library)
- Output a polished Markdown report ready for stakeholders
Prerequisites
| Requirement | Why |
|---|---|
| Python 3.10+ | async / structural pattern matching |
anthropic SDK | Claude API access |
pandas | CSV reading and quick stats |
| An Anthropic API key | Set as ANTHROPIC_API_KEY |
Install the dependencies:
pip install anthropic pandasStep 1 — Project Structure
Create a folder and the files we need:
ai-data-dashboard/
├── dashboard.py # Main script
├── analysis_schema.py # Pydantic-style JSON schema
├── report_generator.py # Markdown report builder
├── sample_data.csv # Any CSV for testing
└── output/
├── analysis.json
└── report.md
Step 2 — Define the Structured Output Schema
We want Claude to return a predictable JSON structure so the rest of the code can consume it without guessing.
# analysis_schema.py
ANALYSIS_SCHEMA = {
"type": "object",
"properties": {
"dataset_summary": {
"type": "object",
"properties": {
"row_count": {"type": "integer"},
"column_count": {"type": "integer"},
"columns": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"dtype": {"type": "string"},
"missing_pct": {"type": "number"},
"unique_values": {"type": "integer"}
},
"required": ["name", "dtype", "missing_pct", "unique_values"]
}
}
},
"required": ["row_count", "column_count", "columns"]
},
"statistics": {
"type": "array",
"items": {
"type": "object",
"properties": {
"column": {"type": "string"},
"mean": {"type": "number"},
"median": {"type": "number"},
"std_dev": {"type": "number"},
"min": {"type": "number"},
"max": {"type": "number"}
},
"required": ["column", "mean", "median", "std_dev", "min", "max"]
}
},
"trends": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"confidence": {"type": "string"},
"affected_columns": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["description", "confidence", "affected_columns"]
}
},
"outliers": {
"type": "array",
"items": {
"type": "object",
"properties": {
"column": {"type": "string"},
"value": {"type": "number"},
"row_index": {"type": "integer"},
"reason": {"type": "string"}
},
"required": ["column", "value", "row_index", "reason"]
}
},
"chart_suggestions": {
"type": "array",
"items": {
"type": "object",
"properties": {
"chart_type": {"type": "string"},
"title": {"type": "string"},
"x_axis": {"type": "string"},
"y_axis": {"type": "string"},
"description": {"type": "string"}
},
"required": ["chart_type", "title", "x_axis", "y_axis", "description"]
}
},
"key_insights": {
"type": "array",
"items": {"type": "string"}
}
},
"required": [
"dataset_summary",
"statistics",
"trends",
"outliers",
"chart_suggestions",
"key_insights"
]
}Why a schema?
When you pass this schema to Claude via the structured output feature the model is forced to return valid JSON that matches every required field. No manual parsing or regex needed.
Step 3 — Read the CSV and Build a Data Summary
# dashboard.py (part 1)
from analysis_schema import ANALYSIS_SCHEMA
def read_csv(path: str) -> pd.DataFrame:
"""Read a CSV and return a DataFrame."""
df = pd.read_csv(path)
print(f"Loaded {len(df)} rows, {len(df.columns)} columns from {path}")
return df
def build_data_summary(df: pd.DataFrame) -> str:
"""Create a text summary Claude can consume."""
lines = []
lines.append(f"Dataset: {len(df)} rows x {len(df.columns)} columns")
lines.append("")
lines.append("Column info:")
for col in df.columns:
dtype = str(df[col].dtype)
missing = df[col].isna().sum()
unique = df[col].nunique()
lines.append(f" - {col}: type={dtype}, missing={missing}, unique={unique}")
lines.append("")
lines.append("First 5 rows (CSV):")
lines.append(df.head().to_csv(index=False))
lines.append("")
lines.append("Descriptive statistics:")
lines.append(df.describe(include="all").to_string())
return "\n".join(lines)Key decisions
- We send the first 5 rows so Claude understands the shape of the data.
- We include
describe()output so the model has raw stats to reference. - Keeping the payload text-based avoids token-heavy base64 images.
Step 4 — Call Claude with Structured Outputs
# dashboard.py (part 2)
def analyze_with_claude(data_summary: str) -> dict:
"""Send the data summary to Claude and get structured analysis."""
client = anthropic.Anthropic() # uses ANTHROPIC_API_KEY env var
system_prompt = """You are a senior data analyst AI.
You will receive a summary of a CSV dataset including column metadata,
sample rows, and descriptive statistics.
Your task:
1. Summarize the dataset structure.
2. Compute or confirm key statistics for every numeric column.
3. Identify trends (time-based, correlations, patterns).
4. Flag outliers with a clear reason.
5. Suggest charts that would best visualize the data.
6. Provide 3-5 key insights a business user would care about.
Be precise. Use the numbers from the provided statistics.
If you are unsure about a trend, say so in the confidence field."""
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
system=system_prompt,
messages=[
{
"role": "user",
"content": (
"Analyze this dataset and return a structured JSON analysis.\n\n"
+ data_summary
),
}
],
# --- STRUCTURED OUTPUT ---
response_format={
"type": "json_schema",
"json_schema": {
"name": "data_analysis",
"schema": ANALYSIS_SCHEMA,
"strict": True
}
}
)
# The response text is guaranteed valid JSON matching our schema
result = json.loads(message.content[0].text)
return resultHow structured outputs work
- You pass
response_formatwith your JSON schema. - Claude constrains its generation so the output always matches the schema.
strict: Truemeans Claude will not add extra keys.- You can safely call
json.loads()without try/except for malformed JSON.
Step 5 — Generate the Markdown Report
# report_generator.py
from datetime import datetime
def generate_report(analysis: dict, csv_name: str) -> str:
"""Turn the structured analysis into a Markdown report."""
lines = []
now = datetime.now().strftime("%Y-%m-%d %H:%M")
# Header
lines.append(f"# Data Analysis Report")
lines.append(f"**Source:** {csv_name} ")
lines.append(f"**Generated:** {now} ")
lines.append("")
# Dataset summary
ds = analysis["dataset_summary"]
lines.append("## Dataset Overview")
lines.append(f"- **Rows:** {ds['row_count']}")
lines.append(f"- **Columns:** {ds['column_count']}")
lines.append("")
lines.append("| Column | Type | Missing % | Unique |")
lines.append("|--------|------|-----------|--------|")
for col in ds["columns"]:
lines.append(
f"| {col['name']} | {col['dtype']} | {col['missing_pct']:.1f}% | {col['unique_values']} |"
)
lines.append("")
# Statistics
lines.append("## Key Statistics")
lines.append("| Column | Mean | Median | Std Dev | Min | Max |")
lines.append("|--------|------|--------|---------|-----|-----|")
for s in analysis["statistics"]:
lines.append(
f"| {s['column']} | {s['mean']:.2f} | {s['median']:.2f} "
f"| {s['std_dev']:.2f} | {s['min']:.2f} | {s['max']:.2f} |"
)
lines.append("")
# Trends
lines.append("## Identified Trends")
for t in analysis["trends"]:
cols = ", ".join(t["affected_columns"])
lines.append(f"- **{t['description']}** ")
lines.append(f" Confidence: {t['confidence']} | Columns: {cols}")
lines.append("")
# Outliers
lines.append("## Outliers Detected")
if analysis["outliers"]:
lines.append("| Column | Value | Row | Reason |")
lines.append("|--------|-------|----|--------|")
for o in analysis["outliers"]:
lines.append(f"| {o['column']} | {o['value']} | {o['row_index']} | {o['reason']} |")
else:
lines.append("No significant outliers detected.")
lines.append("")
# Chart suggestions
lines.append("## Recommended Charts")
for i, c in enumerate(analysis["chart_suggestions"], 1):
lines.append(f"### Chart {i}: {c['title']}")
lines.append(f"- **Type:** {c['chart_type']}")
lines.append(f"- **X-axis:** {c['x_axis']}")
lines.append(f"- **Y-axis:** {c['y_axis']}")
lines.append(f"- {c['description']}")
lines.append("")
# Key insights
lines.append("## Key Insights")
for insight in analysis["key_insights"]:
lines.append(f"- {insight}")
lines.append("")
return "\n".join(lines)Step 6 — Wire Everything Together
# dashboard.py (part 3 — append to the same file)
from report_generator import generate_report
def main():
if len(sys.argv) < 2:
print("Usage: python dashboard.py <path-to-csv>")
sys.exit(1)
csv_path = sys.argv[1]
csv_name = os.path.basename(csv_path)
# 1. Read
df = read_csv(csv_path)
# 2. Summarize
summary = build_data_summary(df)
# 3. Analyze
print("Sending data to Claude for analysis...")
analysis = analyze_with_claude(summary)
# 4. Save JSON
os.makedirs("output", exist_ok=True)
json_path = "output/analysis.json"
with open(json_path, "w") as f:
json.dump(analysis, f, indent=2)
print(f"Structured analysis saved to {json_path}")
# 5. Generate report
report = generate_report(analysis, csv_name)
report_path = "output/report.md"
with open(report_path, "w") as f:
f.write(report)
print(f"Markdown report saved to {report_path}")
# 6. Print key insights to terminal
print("\n=== KEY INSIGHTS ===")
for insight in analysis["key_insights"]:
print(f" • {insight}")
if __name__ == "__main__":
main()Running the Project
# Set your API key
# Run with any CSV
python dashboard.py sample_data.csvExpected output
Loaded 1200 rows, 8 columns from sample_data.csv
Sending data to Claude for analysis...
Structured analysis saved to output/analysis.json
Markdown report saved to output/report.md
=== KEY INSIGHTS ===
• Revenue grew 23% quarter-over-quarter driven by the Enterprise segment.
• Customer churn spiked in March — investigate support ticket volume.
• The "price" column has 3 outliers above $10,000 that may be data entry errors.
Step 7 — Adding a Sample CSV for Testing
Create a quick test file so you can try the pipeline immediately:
# generate_sample.py
np.random.seed(42)
n = 500
data = {
"date": pd.date_range("2024-01-01", periods=n, freq="D"),
"revenue": np.random.normal(5000, 1200, n).round(2),
"customers": np.random.poisson(150, n),
"region": np.random.choice(["North", "South", "East", "West"], n),
"product": np.random.choice(["Basic", "Pro", "Enterprise"], n, p=[0.5, 0.35, 0.15]),
"satisfaction": np.clip(np.random.normal(4.0, 0.8, n), 1, 5).round(1),
}
df = pd.DataFrame(data)
# Inject a few outliers
df.loc[42, "revenue"] = 25000.00
df.loc[99, "revenue"] = -500.00
df.loc[200, "satisfaction"] = 1.0
df.to_csv("sample_data.csv", index=False)
print(f"Generated sample_data.csv with {len(df)} rows")Step 8 — Example JSON Output
Here is what output/analysis.json looks like (abbreviated):
{
"dataset_summary": {
"row_count": 500,
"column_count": 6,
"columns": [
{
"name": "date",
"dtype": "datetime",
"missing_pct": 0.0,
"unique_values": 500
},
{
"name": "revenue",
"dtype": "float64",
"missing_pct": 0.0,
"unique_values": 498
}
]
},
"statistics": [
{
"column": "revenue",
"mean": 5032.14,
"median": 4985.50,
"std_dev": 1245.32,
"min": -500.0,
"max": 25000.0
}
],
"trends": [
{
"description": "Revenue shows a slight upward trend over the year",
"confidence": "medium",
"affected_columns": ["revenue", "date"]
}
],
"outliers": [
{
"column": "revenue",
"value": 25000.0,
"row_index": 42,
"reason": "Value is 16+ standard deviations above the mean"
},
{
"column": "revenue",
"value": -500.0,
"row_index": 99,
"reason": "Negative revenue likely indicates a data entry error"
}
],
"chart_suggestions": [
{
"chart_type": "line",
"title": "Revenue Over Time",
"x_axis": "date",
"y_axis": "revenue",
"description": "A line chart showing daily revenue to visualize trends and seasonality."
}
],
"key_insights": [
"Average daily revenue is approximately $5,032 with moderate variability.",
"Two significant outliers in the revenue column require investigation.",
"Customer satisfaction averages 4.0/5.0 across all regions."
]
}How It All Connects
CSV File
│
▼
┌──────────────────┐
│ read_csv() │ ← pandas reads the file
│ build_summary() │ ← text summary for Claude
└───────┬──────────┘
│
▼
┌──────────────────────────────────────┐
│ Claude API (structured output) │
│ - system prompt: "data analyst" │
│ - response_format: JSON schema │
│ - returns validated JSON │
└───────┬──────────────────────────────┘
│
├──► output/analysis.json
│
▼
┌──────────────────────┐
│ generate_report() │ ← turns JSON into Markdown
└───────┬──────────────┘
│
├──► output/report.md
│
▼
Terminal: key insights printed
Extending the Project
| Enhancement | Description |
|---|---|
| Add charts | Use matplotlib or plotly to render the suggested charts |
| Multi-file | Accept a folder of CSVs and produce a combined report |
| Streaming | Use streaming to show analysis in real time |
| Vision | Convert a chart image to base64, send to Claude for description |
| Web UI | Wrap the pipeline in Flask or Streamlit for a browser dashboard |
Common Mistakes to Avoid
- Sending too much data — Claude has a context window; summarize large CSVs instead of sending every row.
- Ignoring schema validation — Always use structured outputs so downstream code never breaks on unexpected JSON shapes.
- Hardcoding column names — Keep the pipeline generic so it works with any CSV.
- Skipping the system prompt — Without a clear role ("senior data analyst") the analysis will be shallow.
Recap
- You built a full Python pipeline: CSV in, JSON analysis + Markdown report out.
- Structured outputs guarantee the JSON matches your schema every time.
- The report generator turns machine-readable data into human-readable documents.
- This pattern (read → summarize → analyze → report) is reusable across domains.
You now have a production-ready foundation for AI-powered data analysis.