HomeSpecialized Use CasesData Analysis & Data Science

intermediate15 min read· Module 13, Lesson 5

📊Data Analysis & Data Science

CSV/JSON analysis, statistical summaries, and AI-powered insights

Data Analysis & Data Science with Claude

Data analysis is one of the most powerful applications of large language models. Claude can ingest raw datasets, compute statistical summaries, detect trends, identify anomalies, clean messy data, and generate human-readable insight reports — all through natural language conversation. This lesson covers every layer of that workflow, from loading a CSV file to producing a polished analysis deck.

1. Why Use Claude for Data Analysis?

Traditional data analysis requires proficiency in Python, R, SQL, or specialised BI tools. Claude removes that barrier. You describe what you want in plain English, and Claude writes the code, executes it, and explains the results — or performs the analysis directly in conversation when datasets are small enough.

Key advantages:

Zero setup — no local environment, no package installation, no dependency conflicts.
Natural language queries — ask questions the way you think, not the way a query language demands.
Iterative exploration — follow-up questions refine the analysis without restarting.
Automatic explanation — every result comes with context, not just raw numbers.
Multi-format support — CSV, JSON, TSV, Excel exports, and even pasted tables.

2. Loading and Inspecting Data

The first step in any analysis is understanding the shape of your data. Claude can read files directly when using the code execution tool or accept pasted data in conversation.

2.1 Uploading a CSV File

When working with Claude's code execution capability, you can upload files and ask Claude to inspect them:

Output

Prompt: "I have uploaded sales_2025.csv. Show me the first 10 rows, the column names,
data types, and the number of missing values per column."

Claude will generate and execute Python code similar to:

Python

df = pd.read_csv("sales_2025.csv")

print("Shape:", df.shape)
print("\nColumn names:", list(df.columns))
print("\nData types:")
print(df.dtypes)
print("\nMissing values per column:")
print(df.isnull().sum())
print("\nFirst 10 rows:")
df.head(10)

2.2 Pasting JSON Data

For smaller datasets you can paste JSON directly into the conversation:

Output

Prompt: "Here is my dataset in JSON format:
[
  { "month": "Jan", "revenue": 12000, "expenses": 8500 },
  { "month": "Feb", "revenue": 15000, "expenses": 9200 },
  ...
]
Summarise the monthly profit trend."

2.3 Handling Large Files

For files that exceed context limits:

Chunked analysis — ask Claude to process the file in batches.
Sampling — request a random sample for exploratory analysis first.
Schema-first approach — send only the schema and ask Claude to write a complete analysis script you run locally.

3. Statistical Summaries

Claude can produce comprehensive statistical summaries that go beyond simple describe() output.

3.1 Descriptive Statistics

Output

Prompt: "For each numeric column in the dataset, compute the mean, median, standard
deviation, skewness, kurtosis, and the 5th/95th percentiles. Present the results in a
markdown table."

Claude generates:

Python
from scipy import stats

df = pd.read_csv("sales_2025.csv")
numeric_cols = df.select_dtypes(include="number").columns

summary = []
for col in numeric_cols:
    data = df[col].dropna()
    summary.append({
        "Column": col,
        "Mean": round(data.mean(), 2),
        "Median": round(data.median(), 2),
        "Std Dev": round(data.std(), 2),
        "Skewness": round(stats.skew(data), 2),
        "Kurtosis": round(stats.kurtosis(data), 2),
        "P5": round(data.quantile(0.05), 2),
        "P95": round(data.quantile(0.95), 2),
    })

pd.DataFrame(summary)

3.2 Correlation Analysis

Output

Prompt: "Compute the Pearson and Spearman correlation matrices for all numeric columns.
Highlight any pair with |r| > 0.7."

3.3 Distribution Analysis

Output

Prompt: "For the 'revenue' column, test for normality using the Shapiro-Wilk test,
describe the distribution shape, and suggest an appropriate transformation if it is
significantly non-normal."

4. Trend Detection

Identifying trends is where Claude truly shines — it can combine quantitative analysis with qualitative interpretation.

4.1 Time-Series Trends

Python

df = pd.read_csv("sales_2025.csv", parse_dates=["date"])
df = df.sort_values("date")

# Rolling averages
df["revenue_7d"] = df["revenue"].rolling(7).mean()
df["revenue_30d"] = df["revenue"].rolling(30).mean()

# Linear trend
x = np.arange(len(df))
slope, intercept = np.polyfit(x, df["revenue"].values, 1)
print(f"Linear trend: slope = {slope:.2f} per day")
print(f"Projected annual change: {slope * 365:.2f}")

4.2 Seasonal Decomposition

Output

Prompt: "Decompose the revenue time series into trend, seasonal, and residual components.
Use an additive model with a period of 7 days. Explain each component."

Claude will use statsmodels:

Python
from statsmodels.tsa.seasonal import seasonal_decompose

result = seasonal_decompose(df.set_index("date")["revenue"], model="additive", period=7)
print("Trend (last 5):", result.trend.dropna().tail())
print("Seasonal pattern:", result.seasonal.head(7).values)
print("Residual std:", result.resid.dropna().std())

4.3 Growth Rate Analysis

Output

Prompt: "Calculate week-over-week and month-over-month growth rates for revenue. Flag any
period where growth dropped below -10% or exceeded +30%."

5. Anomaly Identification

Anomaly detection is critical for data quality and business monitoring.

5.1 Statistical Methods

Python

df = pd.read_csv("sales_2025.csv")

# Z-score method
mean = df["revenue"].mean()
std = df["revenue"].std()
df["z_score"] = (df["revenue"] - mean) / std
anomalies_z = df[df["z_score"].abs() > 3]

# IQR method
Q1 = df["revenue"].quantile(0.25)
Q3 = df["revenue"].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
anomalies_iqr = df[(df["revenue"] < lower) | (df["revenue"] > upper)]

print(f"Z-score anomalies: {len(anomalies_z)}")
print(f"IQR anomalies: {len(anomalies_iqr)}")

5.2 Contextual Anomalies

Output

Prompt: "Identify revenue anomalies that are unusual for their specific day of week.
A Monday with 50k revenue might be normal, but a Sunday with the same amount is an
anomaly. Group by day of week and flag values outside 2 standard deviations of their
group mean."

5.3 Multi-Dimensional Anomalies

Output

Prompt: "Find records where the combination of revenue, order_count, and avg_order_value
is anomalous. Use the Isolation Forest algorithm and explain which records were flagged
and why."

6. Data Cleaning

Real-world data is messy. Claude can automate cleaning pipelines.

6.1 Missing Value Strategy

Python

df = pd.read_csv("sales_2025.csv")

# Analyse missing patterns
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
print("Missing value percentages:")
print(missing_pct[missing_pct > 0])

# Strategy per column
# Numeric: median imputation for skewed, mean for normal
# Categorical: mode imputation if < 5% missing, "Unknown" otherwise
# Date: forward fill for time series

6.2 Duplicate Detection

Output

Prompt: "Find duplicate rows. Also find near-duplicates where the customer name differs
by only 1-2 characters (possible typos) but the email and order amount are the same."

Python
from difflib import SequenceMatcher

df = pd.read_csv("sales_2025.csv")

# Exact duplicates
exact_dupes = df[df.duplicated(keep=False)]
print(f"Exact duplicates: {len(exact_dupes)}")

# Near-duplicate detection
def similar(a, b):
    return SequenceMatcher(None, str(a).lower(), str(b).lower()).ratio()

near_dupes = []
for i in range(len(df)):
    for j in range(i + 1, len(df)):
        if (df.iloc[i]["email"] == df.iloc[j]["email"] and
            df.iloc[i]["amount"] == df.iloc[j]["amount"] and
            similar(df.iloc[i]["customer"], df.iloc[j]["customer"]) > 0.85):
            near_dupes.append((i, j))

print(f"Near-duplicates found: {len(near_dupes)}")

6.3 Data Type Correction and Standardisation

Output

Prompt: "Clean the dataset: convert date strings to datetime, parse currency strings like
'$1,234.56' into floats, standardise country names to ISO 3166-1 alpha-2 codes, and
normalise phone numbers to E.164 format."

7. The Code Execution Tool

Claude's code execution tool (also called the analysis tool) lets Claude write and run Python code within a sandboxed environment. This is the backbone of serious data analysis.

7.1 How It Works

You upload a file or paste data.
You ask a question in natural language.
Claude writes Python code to answer your question.
The code runs in a sandbox with common libraries pre-installed.
Claude reads the output and provides a human-friendly explanation.

7.2 Available Libraries

The sandbox typically includes:

pandas — data manipulation and analysis.
numpy — numerical computing.
scipy — scientific computing and statistics.
matplotlib / seaborn — visualisation (images returned inline).
scikit-learn — machine learning models.
statsmodels — statistical modelling and tests.

7.3 Iterative Workflows

The real power is iteration. Each execution builds on previous results:

Output

You: "Load the dataset and show summary statistics."
Claude: [executes code, shows results]

You: "The 'price' column has negative values. Remove those rows."
Claude: [executes cleaning code]

You: "Now run a linear regression of price vs. quantity."
Claude: [executes regression, shows coefficients and R-squared]

You: "Plot the residuals."
Claude: [generates and displays chart]

8. Natural Language Queries

One of Claude's most powerful features is translating plain English questions into precise analytical operations.

8.1 Simple Queries

Output

"What was the best-selling product last quarter?"
"Which region had the highest growth rate?"
"How many customers made more than 3 purchases?"
"What percentage of orders were returned?"

8.2 Complex Queries

Output

"Show me the top 10 customers by lifetime value, excluding anyone who has not purchased
in the last 90 days. Include their first purchase date, total spend, and average order
frequency."

8.3 Comparative Queries

Output

"Compare Q1 and Q2 performance across all product categories. For each category, show
revenue change, order count change, and whether the difference is statistically
significant (use a two-sample t-test)."

9. Insights Reports

Claude can generate polished, stakeholder-ready reports from raw data.

9.1 Executive Summary

Output

Prompt: "Analyse the sales dataset and produce an executive summary covering:
1. Overall performance vs. last period
2. Top 3 positive trends
3. Top 3 areas of concern
4. Recommended actions
Keep it under 500 words and use a professional tone."

9.2 Structured Report Template

Python
report_template = {
    "title": "Monthly Sales Analysis — March 2025",
    "sections": [
        {
            "heading": "Key Metrics",
            "metrics": [
                {"name": "Total Revenue", "value": "$2.4M", "change": "+12%"},
                {"name": "Order Count", "value": "18,432", "change": "+8%"},
                {"name": "Avg Order Value", "value": "$130.22", "change": "+3.7%"},
                {"name": "Customer Retention", "value": "68%", "change": "-2%"},
            ],
        },
        {
            "heading": "Segment Analysis",
            "content": "Enterprise segment grew 22% while SMB declined 4%..."
        },
        {
            "heading": "Anomalies",
            "items": [
                "Spike in returns on March 14 (3x normal rate)",
                "Region APAC underperformed forecast by 18%",
            ],
        },
        {
            "heading": "Recommendations",
            "items": [
                "Investigate March 14 return spike — likely linked to batch #4421",
                "Increase marketing spend in APAC to recover momentum",
            ],
        },
    ],
}

9.3 Automated Weekly Reports

Output

Prompt: "Write a Python script that reads the latest week of sales data, computes KPIs,
compares them with the previous week, generates a markdown report, and sends it via
email using SMTP. Include error handling and logging."

10. Visualisation Descriptions

When working in text-only mode or when charts are not available, Claude can produce detailed descriptions of what a visualisation would look like, or generate the code to create one.

10.1 Chart Recommendations

Output

Prompt: "I have monthly revenue data for 5 product lines over 2 years. What chart types
would best show: (a) overall trend, (b) composition, (c) comparison between lines?"

Claude recommends:

(a) Overall trend — Line chart with one line per product line, x-axis = month.
(b) Composition — Stacked area chart showing each line's share of total revenue.
(c) Comparison — Grouped bar chart with months on x-axis, bars grouped by product.

10.2 Generating Chart Code

Python

df = pd.read_csv("sales_2025.csv", parse_dates=["date"])
monthly = df.groupby([df["date"].dt.to_period("M"), "product_line"])["revenue"].sum()
monthly = monthly.unstack(fill_value=0)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Trend
monthly.plot(ax=axes[0], marker="o")
axes[0].set_title("Revenue Trend by Product Line")
axes[0].set_ylabel("Revenue ($)")

# Composition
monthly.plot.area(ax=axes[1], stacked=True, alpha=0.7)
axes[1].set_title("Revenue Composition")

# Comparison (last 6 months)
monthly.tail(6).plot.bar(ax=axes[2])
axes[2].set_title("Last 6 Months Comparison")
axes[2].tick_params(axis="x", rotation=45)

plt.tight_layout()
plt.savefig("analysis_charts.png", dpi=150)
plt.show()

10.3 Accessible Descriptions

Output

Prompt: "Describe the chart in detail for a visually impaired colleague. Include the
overall pattern, notable data points, and the key takeaway."

11. Complete Example — Analysing a Sales Dataset

Let us walk through a full analysis workflow using a hypothetical sales dataset.

Step 1: Load and Inspect

Output

You: "I uploaded sales_data.csv. It has columns: date, region, product, quantity, price,
customer_id, and channel. Give me an overview."

Claude responds with shape, dtypes, missing values, and sample rows.

Step 2: Clean

Output

You: "Clean the data. Remove rows with negative prices. Fill missing regions using the
customer_id mapping from the majority of their orders. Convert dates."

Step 3: Explore

Output

You: "Show revenue by region and channel. Which combination is strongest?"

Step 4: Analyse Trends

Output

You: "Plot monthly revenue trend. Is there a seasonal pattern? What is the overall
growth rate?"

Step 5: Detect Anomalies

Output

You: "Flag any days where revenue was more than 2 standard deviations from the 30-day
rolling mean. What happened on those days?"

Step 6: Deep Dive

Output

You: "Run a cohort analysis on customer_id. Show retention rates by month of first
purchase. Which cohort has the best 90-day retention?"

Step 7: Generate Report

Output

You: "Produce a final analysis report with: executive summary, key metrics, trends,
anomalies, customer insights, and three actionable recommendations."

12. Best Practices

Start with questions, not techniques — define what you need to know before diving into code.
Validate early — check data types, nulls, and row counts before running analysis.
Use sampling for exploration — work with 10% of the data to iterate faster, then run on the full set.
Ask Claude to explain assumptions — every statistical test has assumptions. Ask Claude to verify them.
Save intermediate results — export cleaned datasets so you do not repeat work.
Cross-check surprising results — if a number seems too good or too bad, ask Claude to verify with an alternative method.
Version your prompts — keep a log of the prompts that produced useful outputs for reuse.

Summary

Claude transforms data analysis from a specialist skill into a conversational workflow. By combining natural language understanding with code execution, it covers the full pipeline: ingestion, cleaning, exploration, statistical analysis, anomaly detection, trend identification, and report generation. The key is to be specific in your prompts, validate results at each step, and iterate toward deeper insights.

Module 13

5/7

🔄 Code Migration & Refactoring

Project: AI-Powered Search Engine 🔎

5/7