25 days ago•

Exploratory data analysis notebook template that scales

Views

5.7K

Copies

1.3K

Likes

798

Comments

Copy rate

23.0%

Prompt

You are a senior data analyst. Write a structured EDA notebook template for a new dataset called [dataset_name] with roughly [row_count] rows.

The template should guide an analyst through:
1. Sanity checks and shape exploration
2. Per-column profiling (types, nulls, cardinality, outliers)
3. Univariate and bivariate analysis
4. Target-variable focused analysis
5. Data quality issues log
6. A summary that answers "is this dataset ready for modeling?"

Make it opinionated about what to skip when the data is large. Focus on clarity over novelty.

Customise this prompt

Fill in 2 variables to personalise this prompt

Dataset Name

Row Count

Preview

You are a senior data analyst. Write a structured EDA notebook template for a new dataset called [dataset_name] with roughly [row_count] rows. The template should guide an analyst through: 1. Sanity checks and shape exploration 2. Per-column profiling (types, nulls, cardinality, outliers) 3. Univariate and bivariate analysis 4. Target-variable focused analysis 5. Data quality issues log 6. A summary that answers "is this dataset ready for modeling?" Make it opinionated about what to skip when the data is large. Focus on clarity over novelty.

Example output

# Exploratory Data Analysis Template This notebook is the shared starting point for any new dataset dropped into our analytics repo. Fill it in top to bottom; keep every section even if some are short. "Nothing to report" is a valid finding and saves the next analyst time. ## 0. Context (fill this first, before any code) Before running a single cell, write two paragraphs in plain English: - What is this dataset, where does it come from, and how often is it refreshed? - What business question are we trying to answer? What decision depends on the output? If you cannot answer these, stop and ask. EDA without a question is a scenic walk, not an investigation. ## 1. Sanity Checks Load the data with an explicit schema where possible. Print shape, memory footprint, and the first and last 5 rows. For datasets over 5M rows, sample down to 500k for exploration — full-scale profiling wastes time and produces the same conclusions. Document the sampling method. Check for obvious disasters: entirely empty columns, duplicate rows, index collisions, timezone mismatches. ## 2. Per-Column Profile Build a single table with one row per column listing: dtype, % null, n unique, example values, min/max (for numeric and date), and a flag for suspected issues. This table is your map for the rest of the notebook. Skip heavy profilers (pandas-profiling, ydata-profiling) on large data — they can run for hours. The per-column table above gives 90% of the value in 10% of the time. ## 3. Univariate Analysis For numeric columns: histogram, boxplot, and summary stats (mean, median, 5th/95th percentile, skew). Flag anything with a skew > 3 for transformation candidates. For categorical columns: top 20 values by count, and a note on the long tail. For timestamps: counts per day or week and a visualization of gaps. ## 4. Bivariate Analysis Focus on pairs that matter for the business question. Resist the urge to correlate everything. A 100-column correlation heatmap is almost never a useful artifact; a targeted 6-column heatmap usually is. ## 5. Target-Variable Analysis If there is a target variable, dedicate a full section to it: distribution, class balance, relationship to each top-level feature. Note any leakage risks explicitly. ## 6. Data Quality Log List every issue you found with severity (blocker, warning, cosmetic) and a recommended owner. This becomes the handoff artifact to data engineering. ## 7. Modeling Readiness Verdict End with a single paragraph answering: is this data ready for modeling? If yes, what caveats remain? If no, what must be fixed first? Be direct — hedged conclusions help no one.

Prompt

Customise this prompt

Example output

Related prompts