Sample Datasets

MIDAS includes sample data that you can use to learn data analysis and visualization.

Datasets licensed under CC BY 4.0 require attribution when you redistribute or publish the data or adaptations of it. You can use the attribution text listed in each applicable section as is. Datasets under CC0 or in the public domain carry no attribution requirement.

How to Open Sample Data

  1. Open MIDAS to see the launcher screen
  2. Click the dataset you want from the "Sample Data" section in the left sidebar
  3. The data loads and the project screen opens

Palmer Penguins

Measurement data of three penguin species observed in Antarctica (344 rows, 8 columns).

Columns

  • species: Penguin species (Adelie, Chinstrap, Gentoo)
  • island: Island name
  • bill_length_mm: Bill length
  • bill_depth_mm: Bill depth
  • flipper_length_mm: Flipper length
  • body_mass_g: Body mass
  • sex: Sex
  • year: Survey year

Contains some missing values.

You can draw scatter plots colored by species in Graph Builder, or compare statistics by species in the Statistics tab.

Data source: https://allisonhorst.github.io/palmerpenguins/

License: CC0 (Public Domain)

Gapminder

Data for 142 countries from 1952 to 2007 (1,704 rows, 6 columns, 5-year intervals). Analyze trends in life expectancy, population, and GDP.

Columns

  • country: Country name
  • continent: Continent
  • year: Year
  • lifeExp: Life expectancy
  • pop: Population
  • gdpPercap: GDP per capita (PPP, constant 2005 international dollars)

Data source: https://www.gapminder.org/data/

License: CC BY 4.0

Attribution: "Data from Gapminder Foundation, https://www.gapminder.org/data/, CC BY 4.0"

Auto MPG

Automobile fuel efficiency data from 1970 to 1982 (398 rows, 9 columns).

Columns

  • mpg: Fuel efficiency (miles per gallon)
  • cylinders: Number of cylinders (3, 4, 5, 6, 8)
  • displacement: Engine displacement (cubic inches)
  • horsepower: Horsepower
  • weight: Vehicle weight (pounds)
  • acceleration: Time to accelerate from 0 to 60 mph (seconds)
  • model_year: Model year (70 = 1970, 82 = 1982)
  • origin: Country of origin (usa, europe, japan)
  • name: Vehicle model name

Contains some missing values.

You can run a regression with mpg as the response variable in the Linear Regression tab, or examine correlations in the Statistics tab.

Data source: https://archive.ics.uci.edu/dataset/9/auto+mpg

License: Public Domain

World Bank

Development indicators for 52 major countries (52 rows, 10 columns, 2021-2022 data).

Columns

  • country: Country name
  • country_code: Country code
  • region: Region
  • income_group: Income group
  • population_2022: Population (2022)
  • gdp_usd_billions_2022: GDP (billions USD, 2022)
  • gdp_per_capita_2022: GDP per capita (2022, current USD)
  • life_expectancy_2021: Life expectancy (2021)
  • urban_population_percent_2022: Urban population percentage (2022)
  • internet_users_percent_2021: Internet usage rate (2021)

You can compare statistics by income group in the Statistics tab, or visualize relationships between indicators in Graph Builder.

Data source: https://data.worldbank.org/

License: CC BY 4.0

Attribution: "Data from World Bank Open Data, https://data.worldbank.org/, CC BY 4.0"

Bike Sharing

Washington D.C. bike sharing data (2011-2012). Available in two versions: daily (731 rows) and hourly (17,379 rows). These appear in the launcher as two separate entries: "Bike Sharing (Daily)" and "Bike Sharing (Hourly)".

Time Variables

  • instant: Record ID
  • dteday: Date (YYYY-MM-DD)
  • season: Season (1: Winter, 2: Spring, 3: Summer, 4: Fall)
  • yr: Year (0: 2011, 1: 2012)
  • mnth: Month (1-12)
  • hr: Hour (0-23, hourly data only)
  • weekday: Day of week (0: Sunday, 6: Saturday)
  • holiday: Holiday flag (0: Regular day, 1: Holiday)
  • workingday: Working day flag (1: Weekday, 0: Weekend or holiday)

Weather Variables

  • weathersit: Weather condition
    • 1: Clear, few clouds, partly cloudy
    • 2: Mist + cloudy, mist + broken clouds
    • 3: Light snow, light rain + thunderstorm + scattered clouds
    • 4: Heavy rain + ice pellets + thunderstorm + mist
  • temp: Normalized temperature (Celsius divided by 41)
  • atemp: Normalized feeling temperature (Celsius divided by 50)
  • hum: Normalized humidity (humidity divided by 100)
  • windspeed: Normalized wind speed (divided by max wind speed of 67 km/h)

Usage Counts

  • casual: Casual user count
  • registered: Registered user count
  • cnt: Total count (casual + registered)

The usage counts are count data where overdispersion — variance exceeding the mean — is expected. This makes the dataset a good exercise in diagnosing overdispersion with Poisson regression in the GLM tab.

Data source: https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset

License: CC0 (Public Domain)

Earthquakes

Worldwide earthquake data from September 2024 (1,041 rows, 7 columns, magnitude 4.0+).

Columns

  • time: Occurrence datetime
  • latitude, longitude: Location
  • depth: Hypocentral depth (km)
  • mag: Magnitude
  • magType: Magnitude type (mb: body-wave magnitude, mww: moment magnitude (W-phase), etc.)
  • place: Location description

You can visualize how earthquake frequency changes over time with the time series plot or datetime histogram in Graph Builder, or check epicenter locations with a latitude-longitude scatter plot.

Data source: https://www.usgs.gov/programs/earthquake-hazards

License: Public Domain (USGS Data)

Iris

Measurement data of three iris species, a classic classification dataset (150 rows, 5 columns).

Columns

  • sepal_length, sepal_width: Sepal dimensions
  • petal_length, petal_width: Petal dimensions
  • species: Species

You can run a classification with species as the response variable in the Random Forest tab, or draw scatter plots colored by species in Graph Builder.

Data source: https://archive.ics.uci.edu/dataset/53/iris

License: Public Domain

Heart Failure

Clinical records of 299 heart failure patients (299 rows, 13 columns).

Columns

  • age: Age
  • anaemia: Anaemia status (0: No, 1: Yes)
  • creatinine_phosphokinase: CPK enzyme level (U/L)
  • diabetes: Diabetes status (0: No, 1: Yes)
  • ejection_fraction: Ejection fraction (%)
  • high_blood_pressure: High blood pressure status (0: No, 1: Yes)
  • platelets: Platelet count (kiloplatelets/mL)
  • serum_creatinine: Serum creatinine (mg/dL)
  • serum_sodium: Serum sodium (mEq/L)
  • sex: Sex (0: Female, 1: Male)
  • smoking: Smoking status (0: No, 1: Yes)
  • time: Follow-up period (days)
  • DEATH_EVENT: Death event (0: Survived, 1: Died)

From the Analysis menu, open Survival Analysis and select the Kaplan-Meier tab. Set time as the Time Variable and DEATH_EVENT as the Event Variable to generate Kaplan-Meier survival curves. See the Survival Analysis with the Kaplan-Meier Method tutorial for step-by-step instructions.

Data source: https://archive.ics.uci.edu/dataset/519/heart+failure+clinical+records

License: CC BY 4.0

Attribution: "Chicco, D., & Jurman, G. (2020). Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making, 20, 16. https://doi.org/10.1186/s12911-020-1023-5"

Dose Response

Insecticide dose-response data (8 rows, 4 columns).

Columns

  • dose: Insecticide concentration (mg/L)
  • exposed: Number of insects exposed at each dose (trials)
  • dead: Number of insects that died (successes)
  • mortality_rate: Mortality rate (calculated from dead / exposed)

In the GLM tab, select the Binomial family, switch Response format to Grouped, and set dead as Successes and exposed as Trials. See the Grouped Binomial GLM Tutorial for step-by-step instructions.

Data source: Synthetic data created by the MIDAS project

License: CC0 (Public Domain)

Assembly Line

Dimensional inspection data from an automotive parts assembly plant (300 rows, 7 columns). Records dimension errors and environmental conditions across 3 production lines, 2 shifts, and 5 operators.

Columns

  • line: Assembly line (A, B, C)
  • shift: Shift (Day, Night)
  • operator: Operator ID (Op1 -- Op5)
  • temperature: Ambient temperature (°C)
  • humidity: Humidity (%)
  • cycle_time: Cycle time (seconds)
  • dimension_error: Deviation from target dimension (mm)

Use the ANOVA tab with line × dimension_error to estimate differences between lines, or the Linear Regression tab with environmental variables as predictors to analyze contributing factors. See the Assembly Line Dimension Error Analysis tutorial for step-by-step instructions.

Data source: Synthetic data created by the MIDAS project

License: CC0 (Public Domain)

Injection Molding

Synthetic data representing a factorial design of experiments (DoE) for injection molding (16 rows, 4 columns).

Columns

  • Temperature: Mold temperature
  • Pressure: Injection pressure
  • CycleTime: Cycle time
  • Strength: Strength of the molded part (response variable)

Designed as a full factorial experiment over combinations of factor levels, suitable for practicing estimation of main effects and interactions.

Data source: Synthetic data created by the MIDAS project

License: CC0 (Public Domain)

Student's Sleep

Data published in 1908 by William Sealy Gosset under the pseudonym "Student" — the same paper in which he derived the t-distribution (20 rows, 3 columns). Each of 10 subjects received two soporific drugs, and the increase in sleep compared to an unmedicated baseline was recorded. This is a paired design where the same subject received both drugs.

Columns

  • ID: Subject identifier (1-10)
  • extra: Increase in hours of sleep compared to unmedicated baseline
  • group: Drug administered (Drug 1, Drug 2)

You can compare statistics of extra by drug in the Statistics tab, or visualize the difference in distributions between the drugs in Graph Builder.

Data source: Student (1908). The Probable Error of a Mean. Biometrika, 6(1), 1-25.

License: Public domain (published 1908)