📈Project: AI Data Dashboard

Analyze CSV data with Claude's vision and structured outputs to generate insights

Project: AI Data Dashboard

Overview

In this project you will build a complete AI-powered data analysis pipeline in Python. The pipeline reads any CSV file, sends the data to Claude, and produces a structured JSON analysis plus a human-readable Markdown report.

By the end you will have a reusable tool that can:

Detect column types automatically
Identify trends, outliers, and statistical summaries
Generate chart descriptions (even without a charting library)
Output a polished Markdown report ready for stakeholders

Prerequisites

Requirement	Why
Python 3.10+	async / structural pattern matching
`anthropic` SDK	Claude API access
`pandas`	CSV reading and quick stats
An Anthropic API key	Set as `ANTHROPIC_API_KEY`

Install the dependencies:

Terminal

pip install anthropic pandas

Step 1 — Project Structure

Create a folder and the files we need:

ai-data-dashboard/
├── dashboard.py          # Main script
├── analysis_schema.py    # Pydantic-style JSON schema
├── report_generator.py   # Markdown report builder
├── sample_data.csv       # Any CSV for testing
└── output/
    ├── analysis.json
    └── report.md

Step 2 — Define the Structured Output Schema

We want Claude to return a predictable JSON structure so the rest of the code can consume it without guessing.

Python
# analysis_schema.py

ANALYSIS_SCHEMA = {
    "type": "object",
    "properties": {
        "dataset_summary": {
            "type": "object",
            "properties": {
                "row_count": {"type": "integer"},
                "column_count": {"type": "integer"},
                "columns": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "name": {"type": "string"},
                            "dtype": {"type": "string"},
                            "missing_pct": {"type": "number"},
                            "unique_values": {"type": "integer"}
                        },
                        "required": ["name", "dtype", "missing_pct", "unique_values"]
                    }
                }
            },
            "required": ["row_count", "column_count", "columns"]
        },
        "statistics": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "column": {"type": "string"},
                    "mean": {"type": "number"},
                    "median": {"type": "number"},
                    "std_dev": {"type": "number"},
                    "min": {"type": "number"},
                    "max": {"type": "number"}
                },
                "required": ["column", "mean", "median", "std_dev", "min", "max"]
            }
        },
        "trends": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string"},
                    "confidence": {"type": "string"},
                    "affected_columns": {
                        "type": "array",
                        "items": {"type": "string"}
                    }
                },
                "required": ["description", "confidence", "affected_columns"]
            }
        },
        "outliers": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "column": {"type": "string"},
                    "value": {"type": "number"},
                    "row_index": {"type": "integer"},
                    "reason": {"type": "string"}
                },
                "required": ["column", "value", "row_index", "reason"]
            }
        },
        "chart_suggestions": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "chart_type": {"type": "string"},
                    "title": {"type": "string"},
                    "x_axis": {"type": "string"},
                    "y_axis": {"type": "string"},
                    "description": {"type": "string"}
                },
                "required": ["chart_type", "title", "x_axis", "y_axis", "description"]
            }
        },
        "key_insights": {
            "type": "array",
            "items": {"type": "string"}
        }
    },
    "required": [
        "dataset_summary",
        "statistics",
        "trends",
        "outliers",
        "chart_suggestions",
        "key_insights"
    ]
}

Why a schema?

When you pass this schema to Claude via the structured output feature the model is forced to return valid JSON that matches every required field. No manual parsing or regex needed.

Step 3 — Read the CSV and Build a Data Summary

Python
# dashboard.py  (part 1)

from analysis_schema import ANALYSIS_SCHEMA

def read_csv(path: str) -> pd.DataFrame:
    """Read a CSV and return a DataFrame."""
    df = pd.read_csv(path)
    print(f"Loaded {len(df)} rows, {len(df.columns)} columns from {path}")
    return df


def build_data_summary(df: pd.DataFrame) -> str:
    """Create a text summary Claude can consume."""
    lines = []
    lines.append(f"Dataset: {len(df)} rows x {len(df.columns)} columns")
    lines.append("")
    lines.append("Column info:")
    for col in df.columns:
        dtype = str(df[col].dtype)
        missing = df[col].isna().sum()
        unique = df[col].nunique()
        lines.append(f"  - {col}: type={dtype}, missing={missing}, unique={unique}")

    lines.append("")
    lines.append("First 5 rows (CSV):")
    lines.append(df.head().to_csv(index=False))

    lines.append("")
    lines.append("Descriptive statistics:")
    lines.append(df.describe(include="all").to_string())

    return "\n".join(lines)

Key decisions

We send the first 5 rows so Claude understands the shape of the data.
We include describe() output so the model has raw stats to reference.
Keeping the payload text-based avoids token-heavy base64 images.

Step 4 — Call Claude with Structured Outputs

Python
# dashboard.py  (part 2)

def analyze_with_claude(data_summary: str) -> dict:
    """Send the data summary to Claude and get structured analysis."""
    client = anthropic.Anthropic()  # uses ANTHROPIC_API_KEY env var

    system_prompt = """You are a senior data analyst AI.
You will receive a summary of a CSV dataset including column metadata,
sample rows, and descriptive statistics.

Your task:
1. Summarize the dataset structure.
2. Compute or confirm key statistics for every numeric column.
3. Identify trends (time-based, correlations, patterns).
4. Flag outliers with a clear reason.
5. Suggest charts that would best visualize the data.
6. Provide 3-5 key insights a business user would care about.

Be precise.  Use the numbers from the provided statistics.
If you are unsure about a trend, say so in the confidence field."""

    message = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        system=system_prompt,
        messages=[
            {
                "role": "user",
                "content": (
                    "Analyze this dataset and return a structured JSON analysis.\n\n"
                    + data_summary
                ),
            }
        ],
        # --- STRUCTURED OUTPUT ---
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "data_analysis",
                "schema": ANALYSIS_SCHEMA,
                "strict": True
            }
        }
    )

    # The response text is guaranteed valid JSON matching our schema
    result = json.loads(message.content[0].text)
    return result

How structured outputs work

You pass response_format with your JSON schema.
Claude constrains its generation so the output always matches the schema.
strict: True means Claude will not add extra keys.
You can safely call json.loads() without try/except for malformed JSON.

Step 5 — Generate the Markdown Report

Python
# report_generator.py

from datetime import datetime


def generate_report(analysis: dict, csv_name: str) -> str:
    """Turn the structured analysis into a Markdown report."""
    lines = []
    now = datetime.now().strftime("%Y-%m-%d %H:%M")

    # Header
    lines.append(f"# Data Analysis Report")
    lines.append(f"**Source:** {csv_name}  ")
    lines.append(f"**Generated:** {now}  ")
    lines.append("")

    # Dataset summary
    ds = analysis["dataset_summary"]
    lines.append("## Dataset Overview")
    lines.append(f"- **Rows:** {ds['row_count']}")
    lines.append(f"- **Columns:** {ds['column_count']}")
    lines.append("")
    lines.append("| Column | Type | Missing % | Unique |")
    lines.append("|--------|------|-----------|--------|")
    for col in ds["columns"]:
        lines.append(
            f"| {col['name']} | {col['dtype']} | {col['missing_pct']:.1f}% | {col['unique_values']} |"
        )
    lines.append("")

    # Statistics
    lines.append("## Key Statistics")
    lines.append("| Column | Mean | Median | Std Dev | Min | Max |")
    lines.append("|--------|------|--------|---------|-----|-----|")
    for s in analysis["statistics"]:
        lines.append(
            f"| {s['column']} | {s['mean']:.2f} | {s['median']:.2f} "
            f"| {s['std_dev']:.2f} | {s['min']:.2f} | {s['max']:.2f} |"
        )
    lines.append("")

    # Trends
    lines.append("## Identified Trends")
    for t in analysis["trends"]:
        cols = ", ".join(t["affected_columns"])
        lines.append(f"- **{t['description']}**  ")
        lines.append(f"  Confidence: {t['confidence']} | Columns: {cols}")
    lines.append("")

    # Outliers
    lines.append("## Outliers Detected")
    if analysis["outliers"]:
        lines.append("| Column | Value | Row | Reason |")
        lines.append("|--------|-------|----|--------|")
        for o in analysis["outliers"]:
            lines.append(f"| {o['column']} | {o['value']} | {o['row_index']} | {o['reason']} |")
    else:
        lines.append("No significant outliers detected.")
    lines.append("")

    # Chart suggestions
    lines.append("## Recommended Charts")
    for i, c in enumerate(analysis["chart_suggestions"], 1):
        lines.append(f"### Chart {i}: {c['title']}")
        lines.append(f"- **Type:** {c['chart_type']}")
        lines.append(f"- **X-axis:** {c['x_axis']}")
        lines.append(f"- **Y-axis:** {c['y_axis']}")
        lines.append(f"- {c['description']}")
        lines.append("")

    # Key insights
    lines.append("## Key Insights")
    for insight in analysis["key_insights"]:
        lines.append(f"- {insight}")
    lines.append("")

    return "\n".join(lines)

Step 6 — Wire Everything Together

Python
# dashboard.py  (part 3 — append to the same file)

from report_generator import generate_report


def main():
    if len(sys.argv) < 2:
        print("Usage: python dashboard.py <path-to-csv>")
        sys.exit(1)

    csv_path = sys.argv[1]
    csv_name = os.path.basename(csv_path)

    # 1. Read
    df = read_csv(csv_path)

    # 2. Summarize
    summary = build_data_summary(df)

    # 3. Analyze
    print("Sending data to Claude for analysis...")
    analysis = analyze_with_claude(summary)

    # 4. Save JSON
    os.makedirs("output", exist_ok=True)
    json_path = "output/analysis.json"
    with open(json_path, "w") as f:
        json.dump(analysis, f, indent=2)
    print(f"Structured analysis saved to {json_path}")

    # 5. Generate report
    report = generate_report(analysis, csv_name)
    report_path = "output/report.md"
    with open(report_path, "w") as f:
        f.write(report)
    print(f"Markdown report saved to {report_path}")

    # 6. Print key insights to terminal
    print("\n=== KEY INSIGHTS ===")
    for insight in analysis["key_insights"]:
        print(f"  • {insight}")


if __name__ == "__main__":
    main()

Running the Project

Terminal

# Set your API key

# Run with any CSV
python dashboard.py sample_data.csv

Expected output

Loaded 1200 rows, 8 columns from sample_data.csv
Sending data to Claude for analysis...
Structured analysis saved to output/analysis.json
Markdown report saved to output/report.md

=== KEY INSIGHTS ===
  • Revenue grew 23% quarter-over-quarter driven by the Enterprise segment.
  • Customer churn spiked in March — investigate support ticket volume.
  • The "price" column has 3 outliers above $10,000 that may be data entry errors.

Step 7 — Adding a Sample CSV for Testing

Create a quick test file so you can try the pipeline immediately:

Python
# generate_sample.py

np.random.seed(42)
n = 500

data = {
    "date": pd.date_range("2024-01-01", periods=n, freq="D"),
    "revenue": np.random.normal(5000, 1200, n).round(2),
    "customers": np.random.poisson(150, n),
    "region": np.random.choice(["North", "South", "East", "West"], n),
    "product": np.random.choice(["Basic", "Pro", "Enterprise"], n, p=[0.5, 0.35, 0.15]),
    "satisfaction": np.clip(np.random.normal(4.0, 0.8, n), 1, 5).round(1),
}

df = pd.DataFrame(data)

# Inject a few outliers
df.loc[42, "revenue"] = 25000.00
df.loc[99, "revenue"] = -500.00
df.loc[200, "satisfaction"] = 1.0

df.to_csv("sample_data.csv", index=False)
print(f"Generated sample_data.csv with {len(df)} rows")

Step 8 — Example JSON Output

Here is what output/analysis.json looks like (abbreviated):

JSON
{
  "dataset_summary": {
    "row_count": 500,
    "column_count": 6,
    "columns": [
      {
        "name": "date",
        "dtype": "datetime",
        "missing_pct": 0.0,
        "unique_values": 500
      },
      {
        "name": "revenue",
        "dtype": "float64",
        "missing_pct": 0.0,
        "unique_values": 498
      }
    ]
  },
  "statistics": [
    {
      "column": "revenue",
      "mean": 5032.14,
      "median": 4985.50,
      "std_dev": 1245.32,
      "min": -500.0,
      "max": 25000.0
    }
  ],
  "trends": [
    {
      "description": "Revenue shows a slight upward trend over the year",
      "confidence": "medium",
      "affected_columns": ["revenue", "date"]
    }
  ],
  "outliers": [
    {
      "column": "revenue",
      "value": 25000.0,
      "row_index": 42,
      "reason": "Value is 16+ standard deviations above the mean"
    },
    {
      "column": "revenue",
      "value": -500.0,
      "row_index": 99,
      "reason": "Negative revenue likely indicates a data entry error"
    }
  ],
  "chart_suggestions": [
    {
      "chart_type": "line",
      "title": "Revenue Over Time",
      "x_axis": "date",
      "y_axis": "revenue",
      "description": "A line chart showing daily revenue to visualize trends and seasonality."
    }
  ],
  "key_insights": [
    "Average daily revenue is approximately $5,032 with moderate variability.",
    "Two significant outliers in the revenue column require investigation.",
    "Customer satisfaction averages 4.0/5.0 across all regions."
  ]
}

How It All Connects

CSV File
   │
   ▼
┌──────────────────┐
│  read_csv()      │  ← pandas reads the file
│  build_summary() │  ← text summary for Claude
└───────┬──────────┘
        │
        ▼
┌──────────────────────────────────────┐
│  Claude API  (structured output)     │
│  - system prompt: "data analyst"     │
│  - response_format: JSON schema      │
│  - returns validated JSON            │
└───────┬──────────────────────────────┘
        │
        ├──► output/analysis.json
        │
        ▼
┌──────────────────────┐
│  generate_report()   │  ← turns JSON into Markdown
└───────┬──────────────┘
        │
        ├──► output/report.md
        │
        ▼
     Terminal: key insights printed

Extending the Project

Enhancement	Description
Add charts	Use `matplotlib` or `plotly` to render the suggested charts
Multi-file	Accept a folder of CSVs and produce a combined report
Streaming	Use streaming to show analysis in real time
Vision	Convert a chart image to base64, send to Claude for description
Web UI	Wrap the pipeline in Flask or Streamlit for a browser dashboard

Common Mistakes to Avoid

Sending too much data — Claude has a context window; summarize large CSVs instead of sending every row.
Ignoring schema validation — Always use structured outputs so downstream code never breaks on unexpected JSON shapes.
Hardcoding column names — Keep the pipeline generic so it works with any CSV.
Skipping the system prompt — Without a clear role ("senior data analyst") the analysis will be shallow.

Recap

You built a full Python pipeline: CSV in, JSON analysis + Markdown report out.
Structured outputs guarantee the JSON matches your schema every time.
The report generator turns machine-readable data into human-readable documents.
This pattern (read → summarize → analyze → report) is reusable across domains.

You now have a production-ready foundation for AI-powered data analysis.

Module 8

8/9

🔍 Project: Web Research Agent

Project: AI-Powered Slack/Discord Bot 💬

8/9