📊Data Analysis & Data Science
CSV/JSON analysis, statistical summaries, and AI-powered insights
Data Analysis & Data Science with Claude
Data analysis is one of the most powerful applications of large language models. Claude can ingest raw datasets, compute statistical summaries, detect trends, identify anomalies, clean messy data, and generate human-readable insight reports — all through natural language conversation. This lesson covers every layer of that workflow, from loading a CSV file to producing a polished analysis deck.
1. Why Use Claude for Data Analysis?
Traditional data analysis requires proficiency in Python, R, SQL, or specialised BI tools. Claude removes that barrier. You describe what you want in plain English, and Claude writes the code, executes it, and explains the results — or performs the analysis directly in conversation when datasets are small enough.
Key advantages:
- Zero setup — no local environment, no package installation, no dependency conflicts.
- Natural language queries — ask questions the way you think, not the way a query language demands.
- Iterative exploration — follow-up questions refine the analysis without restarting.
- Automatic explanation — every result comes with context, not just raw numbers.
- Multi-format support — CSV, JSON, TSV, Excel exports, and even pasted tables.
2. Loading and Inspecting Data
The first step in any analysis is understanding the shape of your data. Claude can read files directly when using the code execution tool or accept pasted data in conversation.
2.1 Uploading a CSV File
When working with Claude's code execution capability, you can upload files and ask Claude to inspect them:
Prompt: "I have uploaded sales_2025.csv. Show me the first 10 rows, the column names,
data types, and the number of missing values per column."Claude will generate and execute Python code similar to:
df = pd.read_csv("sales_2025.csv")
print("Shape:", df.shape)
print("\nColumn names:", list(df.columns))
print("\nData types:")
print(df.dtypes)
print("\nMissing values per column:")
print(df.isnull().sum())
print("\nFirst 10 rows:")
df.head(10)2.2 Pasting JSON Data
For smaller datasets you can paste JSON directly into the conversation:
Prompt: "Here is my dataset in JSON format:
[
{ "month": "Jan", "revenue": 12000, "expenses": 8500 },
{ "month": "Feb", "revenue": 15000, "expenses": 9200 },
...
]
Summarise the monthly profit trend."2.3 Handling Large Files
For files that exceed context limits:
- Chunked analysis — ask Claude to process the file in batches.
- Sampling — request a random sample for exploratory analysis first.
- Schema-first approach — send only the schema and ask Claude to write a complete analysis script you run locally.
3. Statistical Summaries
Claude can produce comprehensive statistical summaries that go beyond simple describe() output.
3.1 Descriptive Statistics
Prompt: "For each numeric column in the dataset, compute the mean, median, standard
deviation, skewness, kurtosis, and the 5th/95th percentiles. Present the results in a
markdown table."Claude generates:
from scipy import stats
df = pd.read_csv("sales_2025.csv")
numeric_cols = df.select_dtypes(include="number").columns
summary = []
for col in numeric_cols:
data = df[col].dropna()
summary.append({
"Column": col,
"Mean": round(data.mean(), 2),
"Median": round(data.median(), 2),
"Std Dev": round(data.std(), 2),
"Skewness": round(stats.skew(data), 2),
"Kurtosis": round(stats.kurtosis(data), 2),
"P5": round(data.quantile(0.05), 2),
"P95": round(data.quantile(0.95), 2),
})
pd.DataFrame(summary)3.2 Correlation Analysis
Prompt: "Compute the Pearson and Spearman correlation matrices for all numeric columns.
Highlight any pair with |r| > 0.7."3.3 Distribution Analysis
Prompt: "For the 'revenue' column, test for normality using the Shapiro-Wilk test,
describe the distribution shape, and suggest an appropriate transformation if it is
significantly non-normal."4. Trend Detection
Identifying trends is where Claude truly shines — it can combine quantitative analysis with qualitative interpretation.
4.1 Time-Series Trends
df = pd.read_csv("sales_2025.csv", parse_dates=["date"])
df = df.sort_values("date")
# Rolling averages
df["revenue_7d"] = df["revenue"].rolling(7).mean()
df["revenue_30d"] = df["revenue"].rolling(30).mean()
# Linear trend
x = np.arange(len(df))
slope, intercept = np.polyfit(x, df["revenue"].values, 1)
print(f"Linear trend: slope = {slope:.2f} per day")
print(f"Projected annual change: {slope * 365:.2f}")4.2 Seasonal Decomposition
Prompt: "Decompose the revenue time series into trend, seasonal, and residual components.
Use an additive model with a period of 7 days. Explain each component."Claude will use statsmodels:
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(df.set_index("date")["revenue"], model="additive", period=7)
print("Trend (last 5):", result.trend.dropna().tail())
print("Seasonal pattern:", result.seasonal.head(7).values)
print("Residual std:", result.resid.dropna().std())4.3 Growth Rate Analysis
Prompt: "Calculate week-over-week and month-over-month growth rates for revenue. Flag any
period where growth dropped below -10% or exceeded +30%."5. Anomaly Identification
Anomaly detection is critical for data quality and business monitoring.
5.1 Statistical Methods
df = pd.read_csv("sales_2025.csv")
# Z-score method
mean = df["revenue"].mean()
std = df["revenue"].std()
df["z_score"] = (df["revenue"] - mean) / std
anomalies_z = df[df["z_score"].abs() > 3]
# IQR method
Q1 = df["revenue"].quantile(0.25)
Q3 = df["revenue"].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
anomalies_iqr = df[(df["revenue"] < lower) | (df["revenue"] > upper)]
print(f"Z-score anomalies: {len(anomalies_z)}")
print(f"IQR anomalies: {len(anomalies_iqr)}")5.2 Contextual Anomalies
Prompt: "Identify revenue anomalies that are unusual for their specific day of week.
A Monday with 50k revenue might be normal, but a Sunday with the same amount is an
anomaly. Group by day of week and flag values outside 2 standard deviations of their
group mean."5.3 Multi-Dimensional Anomalies
Prompt: "Find records where the combination of revenue, order_count, and avg_order_value
is anomalous. Use the Isolation Forest algorithm and explain which records were flagged
and why."6. Data Cleaning
Real-world data is messy. Claude can automate cleaning pipelines.
6.1 Missing Value Strategy
df = pd.read_csv("sales_2025.csv")
# Analyse missing patterns
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
print("Missing value percentages:")
print(missing_pct[missing_pct > 0])
# Strategy per column
# Numeric: median imputation for skewed, mean for normal
# Categorical: mode imputation if < 5% missing, "Unknown" otherwise
# Date: forward fill for time series6.2 Duplicate Detection
Prompt: "Find duplicate rows. Also find near-duplicates where the customer name differs
by only 1-2 characters (possible typos) but the email and order amount are the same."from difflib import SequenceMatcher
df = pd.read_csv("sales_2025.csv")
# Exact duplicates
exact_dupes = df[df.duplicated(keep=False)]
print(f"Exact duplicates: {len(exact_dupes)}")
# Near-duplicate detection
def similar(a, b):
return SequenceMatcher(None, str(a).lower(), str(b).lower()).ratio()
near_dupes = []
for i in range(len(df)):
for j in range(i + 1, len(df)):
if (df.iloc[i]["email"] == df.iloc[j]["email"] and
df.iloc[i]["amount"] == df.iloc[j]["amount"] and
similar(df.iloc[i]["customer"], df.iloc[j]["customer"]) > 0.85):
near_dupes.append((i, j))
print(f"Near-duplicates found: {len(near_dupes)}")6.3 Data Type Correction and Standardisation
Prompt: "Clean the dataset: convert date strings to datetime, parse currency strings like
'$1,234.56' into floats, standardise country names to ISO 3166-1 alpha-2 codes, and
normalise phone numbers to E.164 format."7. The Code Execution Tool
Claude's code execution tool (also called the analysis tool) lets Claude write and run Python code within a sandboxed environment. This is the backbone of serious data analysis.
7.1 How It Works
- You upload a file or paste data.
- You ask a question in natural language.
- Claude writes Python code to answer your question.
- The code runs in a sandbox with common libraries pre-installed.
- Claude reads the output and provides a human-friendly explanation.
7.2 Available Libraries
The sandbox typically includes:
- pandas — data manipulation and analysis.
- numpy — numerical computing.
- scipy — scientific computing and statistics.
- matplotlib / seaborn — visualisation (images returned inline).
- scikit-learn — machine learning models.
- statsmodels — statistical modelling and tests.
7.3 Iterative Workflows
The real power is iteration. Each execution builds on previous results:
You: "Load the dataset and show summary statistics."
Claude: [executes code, shows results]
You: "The 'price' column has negative values. Remove those rows."
Claude: [executes cleaning code]
You: "Now run a linear regression of price vs. quantity."
Claude: [executes regression, shows coefficients and R-squared]
You: "Plot the residuals."
Claude: [generates and displays chart]8. Natural Language Queries
One of Claude's most powerful features is translating plain English questions into precise analytical operations.
8.1 Simple Queries
"What was the best-selling product last quarter?"
"Which region had the highest growth rate?"
"How many customers made more than 3 purchases?"
"What percentage of orders were returned?"8.2 Complex Queries
"Show me the top 10 customers by lifetime value, excluding anyone who has not purchased
in the last 90 days. Include their first purchase date, total spend, and average order
frequency."8.3 Comparative Queries
"Compare Q1 and Q2 performance across all product categories. For each category, show
revenue change, order count change, and whether the difference is statistically
significant (use a two-sample t-test)."9. Insights Reports
Claude can generate polished, stakeholder-ready reports from raw data.
9.1 Executive Summary
Prompt: "Analyse the sales dataset and produce an executive summary covering:
1. Overall performance vs. last period
2. Top 3 positive trends
3. Top 3 areas of concern
4. Recommended actions
Keep it under 500 words and use a professional tone."9.2 Structured Report Template
report_template = {
"title": "Monthly Sales Analysis — March 2025",
"sections": [
{
"heading": "Key Metrics",
"metrics": [
{"name": "Total Revenue", "value": "$2.4M", "change": "+12%"},
{"name": "Order Count", "value": "18,432", "change": "+8%"},
{"name": "Avg Order Value", "value": "$130.22", "change": "+3.7%"},
{"name": "Customer Retention", "value": "68%", "change": "-2%"},
],
},
{
"heading": "Segment Analysis",
"content": "Enterprise segment grew 22% while SMB declined 4%..."
},
{
"heading": "Anomalies",
"items": [
"Spike in returns on March 14 (3x normal rate)",
"Region APAC underperformed forecast by 18%",
],
},
{
"heading": "Recommendations",
"items": [
"Investigate March 14 return spike — likely linked to batch #4421",
"Increase marketing spend in APAC to recover momentum",
],
},
],
}9.3 Automated Weekly Reports
Prompt: "Write a Python script that reads the latest week of sales data, computes KPIs,
compares them with the previous week, generates a markdown report, and sends it via
email using SMTP. Include error handling and logging."10. Visualisation Descriptions
When working in text-only mode or when charts are not available, Claude can produce detailed descriptions of what a visualisation would look like, or generate the code to create one.
10.1 Chart Recommendations
Prompt: "I have monthly revenue data for 5 product lines over 2 years. What chart types
would best show: (a) overall trend, (b) composition, (c) comparison between lines?"Claude recommends:
- (a) Overall trend — Line chart with one line per product line, x-axis = month.
- (b) Composition — Stacked area chart showing each line's share of total revenue.
- (c) Comparison — Grouped bar chart with months on x-axis, bars grouped by product.
10.2 Generating Chart Code
df = pd.read_csv("sales_2025.csv", parse_dates=["date"])
monthly = df.groupby([df["date"].dt.to_period("M"), "product_line"])["revenue"].sum()
monthly = monthly.unstack(fill_value=0)
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Trend
monthly.plot(ax=axes[0], marker="o")
axes[0].set_title("Revenue Trend by Product Line")
axes[0].set_ylabel("Revenue ($)")
# Composition
monthly.plot.area(ax=axes[1], stacked=True, alpha=0.7)
axes[1].set_title("Revenue Composition")
# Comparison (last 6 months)
monthly.tail(6).plot.bar(ax=axes[2])
axes[2].set_title("Last 6 Months Comparison")
axes[2].tick_params(axis="x", rotation=45)
plt.tight_layout()
plt.savefig("analysis_charts.png", dpi=150)
plt.show()10.3 Accessible Descriptions
Prompt: "Describe the chart in detail for a visually impaired colleague. Include the
overall pattern, notable data points, and the key takeaway."11. Complete Example — Analysing a Sales Dataset
Let us walk through a full analysis workflow using a hypothetical sales dataset.
Step 1: Load and Inspect
You: "I uploaded sales_data.csv. It has columns: date, region, product, quantity, price,
customer_id, and channel. Give me an overview."Claude responds with shape, dtypes, missing values, and sample rows.
Step 2: Clean
You: "Clean the data. Remove rows with negative prices. Fill missing regions using the
customer_id mapping from the majority of their orders. Convert dates."Step 3: Explore
You: "Show revenue by region and channel. Which combination is strongest?"Step 4: Analyse Trends
You: "Plot monthly revenue trend. Is there a seasonal pattern? What is the overall
growth rate?"Step 5: Detect Anomalies
You: "Flag any days where revenue was more than 2 standard deviations from the 30-day
rolling mean. What happened on those days?"Step 6: Deep Dive
You: "Run a cohort analysis on customer_id. Show retention rates by month of first
purchase. Which cohort has the best 90-day retention?"Step 7: Generate Report
You: "Produce a final analysis report with: executive summary, key metrics, trends,
anomalies, customer insights, and three actionable recommendations."12. Best Practices
- Start with questions, not techniques — define what you need to know before diving into code.
- Validate early — check data types, nulls, and row counts before running analysis.
- Use sampling for exploration — work with 10% of the data to iterate faster, then run on the full set.
- Ask Claude to explain assumptions — every statistical test has assumptions. Ask Claude to verify them.
- Save intermediate results — export cleaned datasets so you do not repeat work.
- Cross-check surprising results — if a number seems too good or too bad, ask Claude to verify with an alternative method.
- Version your prompts — keep a log of the prompts that produced useful outputs for reuse.
Summary
Claude transforms data analysis from a specialist skill into a conversational workflow. By combining natural language understanding with code execution, it covers the full pipeline: ingestion, cleaning, exploration, statistical analysis, anomaly detection, trend identification, and report generation. The key is to be specific in your prompts, validate results at each step, and iterate toward deeper insights.