---
title: Sample Datasets
description: Documentation for built-in sample datasets including Iris, Palmer Penguins, Gapminder, Auto MPG, Bike Sharing, Earthquakes, Heart Failure, Dose Response, and Student's Sleep. Includes column descriptions, use cases, and license information.
priority: 0.6
---

# Sample Datasets {#sample-datasets}

MIDAS includes sample data that you can use to learn data analysis and visualization.

Datasets licensed under CC BY 4.0 require attribution when you redistribute or publish the data or adaptations of it. You can use the attribution text listed in each applicable section as is. Datasets under CC0 or in the public domain carry no attribution requirement.

## How to Open Sample Data {#how-to-open-sample-data}

1. Open MIDAS to see the launcher screen
2. Click the dataset you want from the "Sample Data" section in the left sidebar
3. The data loads and the project screen opens

## Palmer Penguins {#palmer-penguins}

Measurement data of three penguin species observed in Antarctica (344 rows, 8 columns).

**Columns**
- `species`: Penguin species (Adelie, Chinstrap, Gentoo)
- `island`: Island name
- `bill_length_mm`: Bill length
- `bill_depth_mm`: Bill depth
- `flipper_length_mm`: Flipper length
- `body_mass_g`: Body mass
- `sex`: Sex
- `year`: Survey year

Contains some missing values.

You can draw scatter plots colored by species in Graph Builder, or compare statistics by species in the Statistics tab.

**Data source**: https://allisonhorst.github.io/palmerpenguins/

**License**: CC0 (Public Domain)

## Gapminder {#gapminder}

Data for 142 countries from 1952 to 2007 (1,704 rows, 6 columns, 5-year intervals). Analyze trends in life expectancy, population, and GDP.

**Columns**
- `country`: Country name
- `continent`: Continent
- `year`: Year
- `lifeExp`: Life expectancy
- `pop`: Population
- `gdpPercap`: GDP per capita (PPP, constant 2005 international dollars)

**Data source**: https://www.gapminder.org/data/

**License**: CC BY 4.0

**Attribution**: "Data from Gapminder Foundation, https://www.gapminder.org/data/, CC BY 4.0"

## Auto MPG {#auto-mpg}

Automobile fuel efficiency data from 1970 to 1982 (398 rows, 9 columns).

**Columns**
- `mpg`: Fuel efficiency (miles per gallon)
- `cylinders`: Number of cylinders (3, 4, 5, 6, 8)
- `displacement`: Engine displacement (cubic inches)
- `horsepower`: Horsepower
- `weight`: Vehicle weight (pounds)
- `acceleration`: Time to accelerate from 0 to 60 mph (seconds)
- `model_year`: Model year (70 = 1970, 82 = 1982)
- `origin`: Country of origin (usa, europe, japan)
- `name`: Vehicle model name

Contains some missing values.

You can run a regression with `mpg` as the response variable in the Linear Regression tab, or examine correlations in the Statistics tab.

**Data source**: https://archive.ics.uci.edu/dataset/9/auto+mpg

**License**: Public Domain

## World Bank {#world-bank}

Development indicators for 52 major countries (52 rows, 10 columns, 2021-2022 data).

**Columns**
- `country`: Country name
- `country_code`: Country code
- `region`: Region
- `income_group`: Income group
- `population_2022`: Population (2022)
- `gdp_usd_billions_2022`: GDP (billions USD, 2022)
- `gdp_per_capita_2022`: GDP per capita (2022, current USD)
- `life_expectancy_2021`: Life expectancy (2021)
- `urban_population_percent_2022`: Urban population percentage (2022)
- `internet_users_percent_2021`: Internet usage rate (2021)

You can compare statistics by income group in the Statistics tab, or visualize relationships between indicators in Graph Builder.

**Data source**: https://data.worldbank.org/

**License**: CC BY 4.0

**Attribution**: "Data from World Bank Open Data, https://data.worldbank.org/, CC BY 4.0"

## Bike Sharing {#bike-sharing}

Washington D.C. bike sharing data (2011-2012). Available in two versions: daily (731 rows) and hourly (17,379 rows). These appear in the launcher as two separate entries: "Bike Sharing (Daily)" and "Bike Sharing (Hourly)".

**Time Variables**
- `instant`: Record ID
- `dteday`: Date (YYYY-MM-DD)
- `season`: Season (1: Winter, 2: Spring, 3: Summer, 4: Fall)
- `yr`: Year (0: 2011, 1: 2012)
- `mnth`: Month (1-12)
- `hr`: Hour (0-23, hourly data only)
- `weekday`: Day of week (0: Sunday, 6: Saturday)
- `holiday`: Holiday flag (0: Regular day, 1: Holiday)
- `workingday`: Working day flag (1: Weekday, 0: Weekend or holiday)

**Weather Variables**
- `weathersit`: Weather condition
  - 1: Clear, few clouds, partly cloudy
  - 2: Mist + cloudy, mist + broken clouds
  - 3: Light snow, light rain + thunderstorm + scattered clouds
  - 4: Heavy rain + ice pellets + thunderstorm + mist
- `temp`: Normalized temperature (Celsius divided by 41)
- `atemp`: Normalized feeling temperature (Celsius divided by 50)
- `hum`: Normalized humidity (humidity divided by 100)
- `windspeed`: Normalized wind speed (divided by max wind speed of 67 km/h)

**Usage Counts**
- `casual`: Casual user count
- `registered`: Registered user count
- `cnt`: Total count (casual + registered)

The usage counts are count data where [overdispersion](glossary#overdispersion) — variance exceeding the mean — is expected. This makes the dataset a good exercise in diagnosing overdispersion with Poisson regression in the GLM tab.

**Data source**: https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset

**License**: CC0 (Public Domain)

## Earthquakes {#earthquakes}

Worldwide earthquake data from September 2024 (1,041 rows, 7 columns, magnitude 4.0+).

**Columns**
- `time`: Occurrence datetime
- `latitude`, `longitude`: Location
- `depth`: Hypocentral depth (km)
- `mag`: Magnitude
- `magType`: Magnitude type (mb: body-wave magnitude, mww: moment magnitude (W-phase), etc.)
- `place`: Location description

You can visualize how earthquake frequency changes over time with the time series plot or datetime histogram in Graph Builder, or check epicenter locations with a latitude-longitude scatter plot.

**Data source**: https://www.usgs.gov/programs/earthquake-hazards

**License**: Public Domain (USGS Data)

## Iris {#iris}

Measurement data of three iris species, a classic classification dataset (150 rows, 5 columns).

**Columns**
- `sepal_length`, `sepal_width`: Sepal dimensions
- `petal_length`, `petal_width`: Petal dimensions
- `species`: Species

You can run a classification with `species` as the response variable in the Random Forest tab, or draw scatter plots colored by species in Graph Builder.

**Data source**: https://archive.ics.uci.edu/dataset/53/iris

**License**: Public Domain

## Heart Failure {#heart-failure}

Clinical records of 299 heart failure patients (299 rows, 13 columns).

**Columns**
- `age`: Age
- `anaemia`: Anaemia status (0: No, 1: Yes)
- `creatinine_phosphokinase`: CPK enzyme level (U/L)
- `diabetes`: Diabetes status (0: No, 1: Yes)
- `ejection_fraction`: Ejection fraction (%)
- `high_blood_pressure`: High blood pressure status (0: No, 1: Yes)
- `platelets`: Platelet count (kiloplatelets/mL)
- `serum_creatinine`: Serum creatinine (mg/dL)
- `serum_sodium`: Serum sodium (mEq/L)
- `sex`: Sex (0: Female, 1: Male)
- `smoking`: Smoking status (0: No, 1: Yes)
- `time`: Follow-up period (days)
- `DEATH_EVENT`: Death event (0: Survived, 1: Died)

From the Analysis menu, open Survival Analysis and select the Kaplan-Meier tab. Set `time` as the Time Variable and `DEATH_EVENT` as the Event Variable to generate Kaplan-Meier survival curves. See the [Survival Analysis with the Kaplan-Meier Method tutorial](tutorial-kaplan-meier) for step-by-step instructions.

**Data source**: https://archive.ics.uci.edu/dataset/519/heart+failure+clinical+records

**License**: CC BY 4.0

**Attribution**: "Chicco, D., & Jurman, G. (2020). Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making, 20, 16. https://doi.org/10.1186/s12911-020-1023-5"

## Dose Response {#dose-response}

Insecticide dose-response data (8 rows, 4 columns).

**Columns**
- `dose`: Insecticide concentration (mg/L)
- `exposed`: Number of insects exposed at each dose (trials)
- `dead`: Number of insects that died (successes)
- `mortality_rate`: Mortality rate (calculated from dead / exposed)

In the GLM tab, select the Binomial family, switch Response format to Grouped, and set `dead` as Successes and `exposed` as Trials. See the [Grouped Binomial GLM Tutorial](glm-grouped-binomial) for step-by-step instructions.

**Data source**: Synthetic data created by the MIDAS project

**License**: CC0 (Public Domain)

## Assembly Line {#assembly-line}

Dimensional inspection data from an automotive parts assembly plant (300 rows, 7 columns). Records dimension errors and environmental conditions across 3 production lines, 2 shifts, and 5 operators.

**Columns**
- `line`: Assembly line (A, B, C)
- `shift`: Shift (Day, Night)
- `operator`: Operator ID (Op1 -- Op5)
- `temperature`: Ambient temperature (°C)
- `humidity`: Humidity (%)
- `cycle_time`: Cycle time (seconds)
- `dimension_error`: Deviation from target dimension (mm)

Use the ANOVA tab with `line` × `dimension_error` to estimate differences between lines, or the Linear Regression tab with environmental variables as predictors to analyze contributing factors. See the [Assembly Line Dimension Error Analysis tutorial](tutorial-manufacturing) for step-by-step instructions.

**Data source**: Synthetic data created by the MIDAS project

**License**: CC0 (Public Domain)

## Injection Molding {#injection-molding}

Synthetic data representing a factorial design of experiments (DoE) for injection molding (16 rows, 4 columns).

**Columns**
- `Temperature`: Mold temperature
- `Pressure`: Injection pressure
- `CycleTime`: Cycle time
- `Strength`: Strength of the molded part (response variable)

Designed as a full factorial experiment over combinations of factor levels, suitable for practicing estimation of main effects and interactions.

**Data source**: Synthetic data created by the MIDAS project

**License**: CC0 (Public Domain)

## Student's Sleep {#students-sleep}

Data published in 1908 by William Sealy Gosset under the pseudonym "Student" — the same paper in which he derived the t-distribution (20 rows, 3 columns). Each of 10 subjects received two soporific drugs, and the increase in sleep compared to an unmedicated baseline was recorded. This is a paired design where the same subject received both drugs.

**Columns**
- `ID`: Subject identifier (1-10)
- `extra`: Increase in hours of sleep compared to unmedicated baseline
- `group`: Drug administered (Drug 1, Drug 2)

You can compare statistics of `extra` by drug in the Statistics tab, or visualize the difference in distributions between the drugs in Graph Builder.

**Data source**: Student (1908). The Probable Error of a Mean. *Biometrika*, 6(1), 1-25.

**License**: Public domain (published 1908)
