Exploratory data analysis notebook template that scales
Views
5.7K
Copies
1.3K
Likes
798
Comments
0
Copy rate
23.0%
Prompt
You are a senior data analyst. Write a structured EDA notebook template for a new dataset called [dataset_name] with roughly [row_count] rows.
The template should guide an analyst through:
1. Sanity checks and shape exploration
2. Per-column profiling (types, nulls, cardinality, outliers)
3. Univariate and bivariate analysis
4. Target-variable focused analysis
5. Data quality issues log
6. A summary that answers "is this dataset ready for modeling?"
Make it opinionated about what to skip when the data is large. Focus on clarity over novelty.Customise this prompt
Fill in 2 variables to personalise this prompt
Preview
You are a senior data analyst. Write a structured EDA notebook template for a new dataset called [dataset_name] with roughly [row_count] rows.
The template should guide an analyst through:
1. Sanity checks and shape exploration
2. Per-column profiling (types, nulls, cardinality, outliers)
3. Univariate and bivariate analysis
4. Target-variable focused analysis
5. Data quality issues log
6. A summary that answers "is this dataset ready for modeling?"
Make it opinionated about what to skip when the data is large. Focus on clarity over novelty.
Example output
# Exploratory Data Analysis Template
This notebook is the shared starting point for any new dataset dropped into our analytics repo. Fill it in top to bottom; keep every section even if some are short. "Nothing to report" is a valid finding and saves the next analyst time.
## 0. Context (fill this first, before any code)
Before running a single cell, write two paragraphs in plain English:
- What is this dataset, where does it come from, and how often is it refreshed?
- What business question are we trying to answer? What decision depends on the output?
If you cannot answer these, stop and ask. EDA without a question is a scenic walk, not an investigation.
## 1. Sanity Checks
Load the data with an explicit schema where possible. Print shape, memory footprint, and the first and last 5 rows. For datasets over 5M rows, sample down to 500k for exploration — full-scale profiling wastes time and produces the same conclusions. Document the sampling method.
Check for obvious disasters: entirely empty columns, duplicate rows, index collisions, timezone mismatches.
## 2. Per-Column Profile
Build a single table with one row per column listing: dtype, % null, n unique, example values, min/max (for numeric and date), and a flag for suspected issues. This table is your map for the rest of the notebook.
Skip heavy profilers (pandas-profiling, ydata-profiling) on large data — they can run for hours. The per-column table above gives 90% of the value in 10% of the time.
## 3. Univariate Analysis
For numeric columns: histogram, boxplot, and summary stats (mean, median, 5th/95th percentile, skew). Flag anything with a skew > 3 for transformation candidates. For categorical columns: top 20 values by count, and a note on the long tail. For timestamps: counts per day or week and a visualization of gaps.
## 4. Bivariate Analysis
Focus on pairs that matter for the business question. Resist the urge to correlate everything. A 100-column correlation heatmap is almost never a useful artifact; a targeted 6-column heatmap usually is.
## 5. Target-Variable Analysis
If there is a target variable, dedicate a full section to it: distribution, class balance, relationship to each top-level feature. Note any leakage risks explicitly.
## 6. Data Quality Log
List every issue you found with severity (blocker, warning, cosmetic) and a recommended owner. This becomes the handoff artifact to data engineering.
## 7. Modeling Readiness Verdict
End with a single paragraph answering: is this data ready for modeling? If yes, what caveats remain? If no, what must be fixed first? Be direct — hedged conclusions help no one.