HomeSpecialized Use CasesData Analysis & Data Science
intermediate15 min read· Module 13, Lesson 5

📊Data Analysis & Data Science

CSV/JSON analysis, statistical summaries, and AI-powered insights

Data Analysis & Data Science with Claude

Data analysis is one of the most powerful applications of large language models. Claude can ingest raw datasets, compute statistical summaries, detect trends, identify anomalies, clean messy data, and generate human-readable insight reports — all through natural language conversation. This lesson covers every layer of that workflow, from loading a CSV file to producing a polished analysis deck.


1. Why Use Claude for Data Analysis?

Traditional data analysis requires proficiency in Python, R, SQL, or specialised BI tools. Claude removes that barrier. You describe what you want in plain English, and Claude writes the code, executes it, and explains the results — or performs the analysis directly in conversation when datasets are small enough.

Key advantages:

  • Zero setup — no local environment, no package installation, no dependency conflicts.
  • Natural language queries — ask questions the way you think, not the way a query language demands.
  • Iterative exploration — follow-up questions refine the analysis without restarting.
  • Automatic explanation — every result comes with context, not just raw numbers.
  • Multi-format support — CSV, JSON, TSV, Excel exports, and even pasted tables.

2. Loading and Inspecting Data

The first step in any analysis is understanding the shape of your data. Claude can read files directly when using the code execution tool or accept pasted data in conversation.

2.1 Uploading a CSV File

When working with Claude's code execution capability, you can upload files and ask Claude to inspect them:

Output
Prompt: "I have uploaded sales_2025.csv. Show me the first 10 rows, the column names, data types, and the number of missing values per column."

Claude will generate and execute Python code similar to:

Python
df = pd.read_csv("sales_2025.csv") print("Shape:", df.shape) print("\nColumn names:", list(df.columns)) print("\nData types:") print(df.dtypes) print("\nMissing values per column:") print(df.isnull().sum()) print("\nFirst 10 rows:") df.head(10)

2.2 Pasting JSON Data

For smaller datasets you can paste JSON directly into the conversation:

Output
Prompt: "Here is my dataset in JSON format: [ { "month": "Jan", "revenue": 12000, "expenses": 8500 }, { "month": "Feb", "revenue": 15000, "expenses": 9200 }, ... ] Summarise the monthly profit trend."

2.3 Handling Large Files

For files that exceed context limits:

  • Chunked analysis — ask Claude to process the file in batches.
  • Sampling — request a random sample for exploratory analysis first.
  • Schema-first approach — send only the schema and ask Claude to write a complete analysis script you run locally.

3. Statistical Summaries

Claude can produce comprehensive statistical summaries that go beyond simple describe() output.

3.1 Descriptive Statistics

Output
Prompt: "For each numeric column in the dataset, compute the mean, median, standard deviation, skewness, kurtosis, and the 5th/95th percentiles. Present the results in a markdown table."

Claude generates:

Python
from scipy import stats df = pd.read_csv("sales_2025.csv") numeric_cols = df.select_dtypes(include="number").columns summary = [] for col in numeric_cols: data = df[col].dropna() summary.append({ "Column": col, "Mean": round(data.mean(), 2), "Median": round(data.median(), 2), "Std Dev": round(data.std(), 2), "Skewness": round(stats.skew(data), 2), "Kurtosis": round(stats.kurtosis(data), 2), "P5": round(data.quantile(0.05), 2), "P95": round(data.quantile(0.95), 2), }) pd.DataFrame(summary)

3.2 Correlation Analysis

Output
Prompt: "Compute the Pearson and Spearman correlation matrices for all numeric columns. Highlight any pair with |r| > 0.7."

3.3 Distribution Analysis

Output
Prompt: "For the 'revenue' column, test for normality using the Shapiro-Wilk test, describe the distribution shape, and suggest an appropriate transformation if it is significantly non-normal."

4. Trend Detection

Identifying trends is where Claude truly shines — it can combine quantitative analysis with qualitative interpretation.

4.1 Time-Series Trends

Python
df = pd.read_csv("sales_2025.csv", parse_dates=["date"]) df = df.sort_values("date") # Rolling averages df["revenue_7d"] = df["revenue"].rolling(7).mean() df["revenue_30d"] = df["revenue"].rolling(30).mean() # Linear trend x = np.arange(len(df)) slope, intercept = np.polyfit(x, df["revenue"].values, 1) print(f"Linear trend: slope = {slope:.2f} per day") print(f"Projected annual change: {slope * 365:.2f}")

4.2 Seasonal Decomposition

Output
Prompt: "Decompose the revenue time series into trend, seasonal, and residual components. Use an additive model with a period of 7 days. Explain each component."

Claude will use statsmodels:

Python
from statsmodels.tsa.seasonal import seasonal_decompose result = seasonal_decompose(df.set_index("date")["revenue"], model="additive", period=7) print("Trend (last 5):", result.trend.dropna().tail()) print("Seasonal pattern:", result.seasonal.head(7).values) print("Residual std:", result.resid.dropna().std())

4.3 Growth Rate Analysis

Output
Prompt: "Calculate week-over-week and month-over-month growth rates for revenue. Flag any period where growth dropped below -10% or exceeded +30%."

5. Anomaly Identification

Anomaly detection is critical for data quality and business monitoring.

5.1 Statistical Methods

Python
df = pd.read_csv("sales_2025.csv") # Z-score method mean = df["revenue"].mean() std = df["revenue"].std() df["z_score"] = (df["revenue"] - mean) / std anomalies_z = df[df["z_score"].abs() > 3] # IQR method Q1 = df["revenue"].quantile(0.25) Q3 = df["revenue"].quantile(0.75) IQR = Q3 - Q1 lower = Q1 - 1.5 * IQR upper = Q3 + 1.5 * IQR anomalies_iqr = df[(df["revenue"] < lower) | (df["revenue"] > upper)] print(f"Z-score anomalies: {len(anomalies_z)}") print(f"IQR anomalies: {len(anomalies_iqr)}")

5.2 Contextual Anomalies

Output
Prompt: "Identify revenue anomalies that are unusual for their specific day of week. A Monday with 50k revenue might be normal, but a Sunday with the same amount is an anomaly. Group by day of week and flag values outside 2 standard deviations of their group mean."

5.3 Multi-Dimensional Anomalies

Output
Prompt: "Find records where the combination of revenue, order_count, and avg_order_value is anomalous. Use the Isolation Forest algorithm and explain which records were flagged and why."

6. Data Cleaning

Real-world data is messy. Claude can automate cleaning pipelines.

6.1 Missing Value Strategy

Python
df = pd.read_csv("sales_2025.csv") # Analyse missing patterns missing = df.isnull().sum() missing_pct = (missing / len(df) * 100).round(2) print("Missing value percentages:") print(missing_pct[missing_pct > 0]) # Strategy per column # Numeric: median imputation for skewed, mean for normal # Categorical: mode imputation if < 5% missing, "Unknown" otherwise # Date: forward fill for time series

6.2 Duplicate Detection

Output
Prompt: "Find duplicate rows. Also find near-duplicates where the customer name differs by only 1-2 characters (possible typos) but the email and order amount are the same."
Python
from difflib import SequenceMatcher df = pd.read_csv("sales_2025.csv") # Exact duplicates exact_dupes = df[df.duplicated(keep=False)] print(f"Exact duplicates: {len(exact_dupes)}") # Near-duplicate detection def similar(a, b): return SequenceMatcher(None, str(a).lower(), str(b).lower()).ratio() near_dupes = [] for i in range(len(df)): for j in range(i + 1, len(df)): if (df.iloc[i]["email"] == df.iloc[j]["email"] and df.iloc[i]["amount"] == df.iloc[j]["amount"] and similar(df.iloc[i]["customer"], df.iloc[j]["customer"]) > 0.85): near_dupes.append((i, j)) print(f"Near-duplicates found: {len(near_dupes)}")

6.3 Data Type Correction and Standardisation

Output
Prompt: "Clean the dataset: convert date strings to datetime, parse currency strings like '$1,234.56' into floats, standardise country names to ISO 3166-1 alpha-2 codes, and normalise phone numbers to E.164 format."

7. The Code Execution Tool

Claude's code execution tool (also called the analysis tool) lets Claude write and run Python code within a sandboxed environment. This is the backbone of serious data analysis.

7.1 How It Works

  1. You upload a file or paste data.
  2. You ask a question in natural language.
  3. Claude writes Python code to answer your question.
  4. The code runs in a sandbox with common libraries pre-installed.
  5. Claude reads the output and provides a human-friendly explanation.

7.2 Available Libraries

The sandbox typically includes:

  • pandas — data manipulation and analysis.
  • numpy — numerical computing.
  • scipy — scientific computing and statistics.
  • matplotlib / seaborn — visualisation (images returned inline).
  • scikit-learn — machine learning models.
  • statsmodels — statistical modelling and tests.

7.3 Iterative Workflows

The real power is iteration. Each execution builds on previous results:

Output
You: "Load the dataset and show summary statistics." Claude: [executes code, shows results] You: "The 'price' column has negative values. Remove those rows." Claude: [executes cleaning code] You: "Now run a linear regression of price vs. quantity." Claude: [executes regression, shows coefficients and R-squared] You: "Plot the residuals." Claude: [generates and displays chart]

8. Natural Language Queries

One of Claude's most powerful features is translating plain English questions into precise analytical operations.

8.1 Simple Queries

Output
"What was the best-selling product last quarter?" "Which region had the highest growth rate?" "How many customers made more than 3 purchases?" "What percentage of orders were returned?"

8.2 Complex Queries

Output
"Show me the top 10 customers by lifetime value, excluding anyone who has not purchased in the last 90 days. Include their first purchase date, total spend, and average order frequency."

8.3 Comparative Queries

Output
"Compare Q1 and Q2 performance across all product categories. For each category, show revenue change, order count change, and whether the difference is statistically significant (use a two-sample t-test)."

9. Insights Reports

Claude can generate polished, stakeholder-ready reports from raw data.

9.1 Executive Summary

Output
Prompt: "Analyse the sales dataset and produce an executive summary covering: 1. Overall performance vs. last period 2. Top 3 positive trends 3. Top 3 areas of concern 4. Recommended actions Keep it under 500 words and use a professional tone."

9.2 Structured Report Template

Python
report_template = { "title": "Monthly Sales Analysis — March 2025", "sections": [ { "heading": "Key Metrics", "metrics": [ {"name": "Total Revenue", "value": "$2.4M", "change": "+12%"}, {"name": "Order Count", "value": "18,432", "change": "+8%"}, {"name": "Avg Order Value", "value": "$130.22", "change": "+3.7%"}, {"name": "Customer Retention", "value": "68%", "change": "-2%"}, ], }, { "heading": "Segment Analysis", "content": "Enterprise segment grew 22% while SMB declined 4%..." }, { "heading": "Anomalies", "items": [ "Spike in returns on March 14 (3x normal rate)", "Region APAC underperformed forecast by 18%", ], }, { "heading": "Recommendations", "items": [ "Investigate March 14 return spike — likely linked to batch #4421", "Increase marketing spend in APAC to recover momentum", ], }, ], }

9.3 Automated Weekly Reports

Output
Prompt: "Write a Python script that reads the latest week of sales data, computes KPIs, compares them with the previous week, generates a markdown report, and sends it via email using SMTP. Include error handling and logging."

10. Visualisation Descriptions

When working in text-only mode or when charts are not available, Claude can produce detailed descriptions of what a visualisation would look like, or generate the code to create one.

10.1 Chart Recommendations

Output
Prompt: "I have monthly revenue data for 5 product lines over 2 years. What chart types would best show: (a) overall trend, (b) composition, (c) comparison between lines?"

Claude recommends:

  • (a) Overall trend — Line chart with one line per product line, x-axis = month.
  • (b) Composition — Stacked area chart showing each line's share of total revenue.
  • (c) Comparison — Grouped bar chart with months on x-axis, bars grouped by product.

10.2 Generating Chart Code

Python
df = pd.read_csv("sales_2025.csv", parse_dates=["date"]) monthly = df.groupby([df["date"].dt.to_period("M"), "product_line"])["revenue"].sum() monthly = monthly.unstack(fill_value=0) fig, axes = plt.subplots(1, 3, figsize=(18, 5)) # Trend monthly.plot(ax=axes[0], marker="o") axes[0].set_title("Revenue Trend by Product Line") axes[0].set_ylabel("Revenue ($)") # Composition monthly.plot.area(ax=axes[1], stacked=True, alpha=0.7) axes[1].set_title("Revenue Composition") # Comparison (last 6 months) monthly.tail(6).plot.bar(ax=axes[2]) axes[2].set_title("Last 6 Months Comparison") axes[2].tick_params(axis="x", rotation=45) plt.tight_layout() plt.savefig("analysis_charts.png", dpi=150) plt.show()

10.3 Accessible Descriptions

Output
Prompt: "Describe the chart in detail for a visually impaired colleague. Include the overall pattern, notable data points, and the key takeaway."

11. Complete Example — Analysing a Sales Dataset

Let us walk through a full analysis workflow using a hypothetical sales dataset.

Step 1: Load and Inspect

Output
You: "I uploaded sales_data.csv. It has columns: date, region, product, quantity, price, customer_id, and channel. Give me an overview."

Claude responds with shape, dtypes, missing values, and sample rows.

Step 2: Clean

Output
You: "Clean the data. Remove rows with negative prices. Fill missing regions using the customer_id mapping from the majority of their orders. Convert dates."

Step 3: Explore

Output
You: "Show revenue by region and channel. Which combination is strongest?"

Step 4: Analyse Trends

Output
You: "Plot monthly revenue trend. Is there a seasonal pattern? What is the overall growth rate?"

Step 5: Detect Anomalies

Output
You: "Flag any days where revenue was more than 2 standard deviations from the 30-day rolling mean. What happened on those days?"

Step 6: Deep Dive

Output
You: "Run a cohort analysis on customer_id. Show retention rates by month of first purchase. Which cohort has the best 90-day retention?"

Step 7: Generate Report

Output
You: "Produce a final analysis report with: executive summary, key metrics, trends, anomalies, customer insights, and three actionable recommendations."

12. Best Practices

  • Start with questions, not techniques — define what you need to know before diving into code.
  • Validate early — check data types, nulls, and row counts before running analysis.
  • Use sampling for exploration — work with 10% of the data to iterate faster, then run on the full set.
  • Ask Claude to explain assumptions — every statistical test has assumptions. Ask Claude to verify them.
  • Save intermediate results — export cleaned datasets so you do not repeat work.
  • Cross-check surprising results — if a number seems too good or too bad, ask Claude to verify with an alternative method.
  • Version your prompts — keep a log of the prompts that produced useful outputs for reuse.

Summary

Claude transforms data analysis from a specialist skill into a conversational workflow. By combining natural language understanding with code execution, it covers the full pipeline: ingestion, cleaning, exploration, statistical analysis, anomaly detection, trend identification, and report generation. The key is to be specific in your prompts, validate results at each step, and iterate toward deeper insights.