Sample Datasets

MIDAS includes sample data that you can use to learn data analysis and visualization.

Datasets licensed under CC BY 4.0 require attribution when you redistribute or publish the data or adaptations of it. You can use the attribution text listed in each applicable section as is. Datasets under CC0 or in the public domain carry no attribution requirement.

How to Open Sample Data

Open MIDAS to see the launcher screen
Click the dataset you want from the "Sample Data" section in the left sidebar
The data loads and the project screen opens

Palmer Penguins

Measurement data of three penguin species observed in Antarctica (344 rows, 8 columns).

Columns

species: Penguin species (Adelie, Chinstrap, Gentoo)
island: Island name
bill_length_mm: Bill length
bill_depth_mm: Bill depth
flipper_length_mm: Flipper length
body_mass_g: Body mass
sex: Sex
year: Survey year

Contains some missing values.

You can draw scatter plots colored by species in Graph Builder, or compare statistics by species in the Statistics tab.

Data source: https://allisonhorst.github.io/palmerpenguins/

License: CC0 (Public Domain)

Gapminder

Data for 142 countries from 1952 to 2007 (1,704 rows, 6 columns, 5-year intervals). Analyze trends in life expectancy, population, and GDP.

Columns

country: Country name
continent: Continent
year: Year
lifeExp: Life expectancy
pop: Population
gdpPercap: GDP per capita (PPP, constant 2005 international dollars)

Data source: https://www.gapminder.org/data/

License: CC BY 4.0

Attribution: "Data from Gapminder Foundation, https://www.gapminder.org/data/, CC BY 4.0"

Auto MPG

Automobile fuel efficiency data from 1970 to 1982 (398 rows, 9 columns).

Columns

mpg: Fuel efficiency (miles per gallon)
cylinders: Number of cylinders (3, 4, 5, 6, 8)
displacement: Engine displacement (cubic inches)
horsepower: Horsepower
weight: Vehicle weight (pounds)
acceleration: Time to accelerate from 0 to 60 mph (seconds)
model_year: Model year (70 = 1970, 82 = 1982)
origin: Country of origin (usa, europe, japan)
name: Vehicle model name

Contains some missing values.

You can run a regression with mpg as the response variable in the Linear Regression tab, or examine correlations in the Statistics tab.

Data source: https://archive.ics.uci.edu/dataset/9/auto+mpg

License: Public Domain

World Bank

Development indicators for 52 major countries (52 rows, 10 columns, 2021-2022 data).

Columns

country: Country name
country_code: Country code
region: Region
income_group: Income group
population_2022: Population (2022)
gdp_usd_billions_2022: GDP (billions USD, 2022)
gdp_per_capita_2022: GDP per capita (2022, current USD)
life_expectancy_2021: Life expectancy (2021)
urban_population_percent_2022: Urban population percentage (2022)
internet_users_percent_2021: Internet usage rate (2021)

You can compare statistics by income group in the Statistics tab, or visualize relationships between indicators in Graph Builder.

Data source: https://data.worldbank.org/

License: CC BY 4.0

Attribution: "Data from World Bank Open Data, https://data.worldbank.org/, CC BY 4.0"

Washington D.C. bike sharing data (2011-2012). Available in two versions: daily (731 rows) and hourly (17,379 rows). These appear in the launcher as two separate entries: "Bike Sharing (Daily)" and "Bike Sharing (Hourly)".

Time Variables

instant: Record ID
dteday: Date (YYYY-MM-DD)
season: Season (1: Winter, 2: Spring, 3: Summer, 4: Fall)
yr: Year (0: 2011, 1: 2012)
mnth: Month (1-12)
hr: Hour (0-23, hourly data only)
weekday: Day of week (0: Sunday, 6: Saturday)
holiday: Holiday flag (0: Regular day, 1: Holiday)
workingday: Working day flag (1: Weekday, 0: Weekend or holiday)

Weather Variables

weathersit: Weather condition
- 1: Clear, few clouds, partly cloudy
- 2: Mist + cloudy, mist + broken clouds
- 3: Light snow, light rain + thunderstorm + scattered clouds
- 4: Heavy rain + ice pellets + thunderstorm + mist
temp: Normalized temperature (Celsius divided by 41)
atemp: Normalized feeling temperature (Celsius divided by 50)
hum: Normalized humidity (humidity divided by 100)
windspeed: Normalized wind speed (divided by max wind speed of 67 km/h)

Usage Counts

casual: Casual user count
registered: Registered user count
cnt: Total count (casual + registered)

The usage counts are count data where overdispersion — variance exceeding the mean — is expected. This makes the dataset a good exercise in diagnosing overdispersion with Poisson regression in the GLM tab.

Data source: https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset

License: CC0 (Public Domain)

Earthquakes

Worldwide earthquake data from September 2024 (1,041 rows, 7 columns, magnitude 4.0+).

Columns

time: Occurrence datetime
latitude, longitude: Location
depth: Hypocentral depth (km)
mag: Magnitude
magType: Magnitude type (mb: body-wave magnitude, mww: moment magnitude (W-phase), etc.)
place: Location description

You can visualize how earthquake frequency changes over time with the time series plot or datetime histogram in Graph Builder, or check epicenter locations with a latitude-longitude scatter plot.

Data source: https://www.usgs.gov/programs/earthquake-hazards

License: Public Domain (USGS Data)

Iris

Measurement data of three iris species, a classic classification dataset (150 rows, 5 columns).

Columns

sepal_length, sepal_width: Sepal dimensions
petal_length, petal_width: Petal dimensions
species: Species

You can run a classification with species as the response variable in the Random Forest tab, or draw scatter plots colored by species in Graph Builder.

Data source: https://archive.ics.uci.edu/dataset/53/iris

License: Public Domain

Heart Failure

Clinical records of 299 heart failure patients (299 rows, 13 columns).

Columns

age: Age
anaemia: Anaemia status (0: No, 1: Yes)
creatinine_phosphokinase: CPK enzyme level (U/L)
diabetes: Diabetes status (0: No, 1: Yes)
ejection_fraction: Ejection fraction (%)
high_blood_pressure: High blood pressure status (0: No, 1: Yes)
platelets: Platelet count (kiloplatelets/mL)
serum_creatinine: Serum creatinine (mg/dL)
serum_sodium: Serum sodium (mEq/L)
sex: Sex (0: Female, 1: Male)
smoking: Smoking status (0: No, 1: Yes)
time: Follow-up period (days)
DEATH_EVENT: Death event (0: Survived, 1: Died)

From the Analysis menu, open Survival Analysis and select the Kaplan-Meier tab. Set time as the Time Variable and DEATH_EVENT as the Event Variable to generate Kaplan-Meier survival curves. See the Survival Analysis with the Kaplan-Meier Method tutorial for step-by-step instructions.

Data source: https://archive.ics.uci.edu/dataset/519/heart+failure+clinical+records

License: CC BY 4.0

Attribution: "Chicco, D., & Jurman, G. (2020). Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making, 20, 16. https://doi.org/10.1186/s12911-020-1023-5"

Dose Response

Insecticide dose-response data (8 rows, 4 columns).

Columns

dose: Insecticide concentration (mg/L)
exposed: Number of insects exposed at each dose (trials)
dead: Number of insects that died (successes)
mortality_rate: Mortality rate (calculated from dead / exposed)

In the GLM tab, select the Binomial family, switch Response format to Grouped, and set dead as Successes and exposed as Trials. See the Grouped Binomial GLM Tutorial for step-by-step instructions.

Data source: Synthetic data created by the MIDAS project

License: CC0 (Public Domain)

Assembly Line

Dimensional inspection data from an automotive parts assembly plant (300 rows, 7 columns). Records dimension errors and environmental conditions across 3 production lines, 2 shifts, and 5 operators.

Columns

line: Assembly line (A, B, C)
shift: Shift (Day, Night)
operator: Operator ID (Op1 -- Op5)
temperature: Ambient temperature (°C)
humidity: Humidity (%)
cycle_time: Cycle time (seconds)
dimension_error: Deviation from target dimension (mm)

Use the ANOVA tab with line × dimension_error to estimate differences between lines, or the Linear Regression tab with environmental variables as predictors to analyze contributing factors. See the Assembly Line Dimension Error Analysis tutorial for step-by-step instructions.

Data source: Synthetic data created by the MIDAS project

License: CC0 (Public Domain)

Injection Molding

Synthetic data representing a factorial design of experiments (DoE) for injection molding (16 rows, 4 columns).

Columns

Temperature: Mold temperature
Pressure: Injection pressure
CycleTime: Cycle time
Strength: Strength of the molded part (response variable)

Designed as a full factorial experiment over combinations of factor levels, suitable for practicing estimation of main effects and interactions.

Data source: Synthetic data created by the MIDAS project

License: CC0 (Public Domain)

Student's Sleep

Data published in 1908 by William Sealy Gosset under the pseudonym "Student" — the same paper in which he derived the t-distribution (20 rows, 3 columns). Each of 10 subjects received two soporific drugs, and the increase in sleep compared to an unmedicated baseline was recorded. This is a paired design where the same subject received both drugs.

Columns

ID: Subject identifier (1-10)
extra: Increase in hours of sleep compared to unmedicated baseline
group: Drug administered (Drug 1, Drug 2)

You can compare statistics of extra by drug in the Statistics tab, or visualize the difference in distributions between the drugs in Graph Builder.

Data source: Student (1908). The Probable Error of a Mean. Biometrika, 6(1), 1-25.

License: Public domain (published 1908)

Also available as a Markdown file.