Sample Datasets

MIDAS includes sample data that you can use to learn data analysis and visualization.

How to Open Sample Data

  1. Open MIDAS to see the launcher screen
  2. Click the dataset you want from the "Sample Data" section in the left sidebar
  3. The data loads and the project screen opens

Palmer Penguins

Measurement data of three penguin species observed in Antarctica (344 rows, 8 columns).

Columns

  • species: Penguin species (Adelie, Chinstrap, Gentoo)
  • island: Island name
  • bill_length_mm: Bill length
  • bill_depth_mm: Bill depth
  • flipper_length_mm: Flipper length
  • body_mass_g: Body mass
  • sex: Sex
  • year: Survey year

Contains some missing values.

Data source: https://allisonhorst.github.io/palmerpenguins/

License: CC0 (Public Domain)

Gapminder

Country-level data from 1952 to 2007 (1,704 rows, 6 columns). Analyze trends in life expectancy, population, and GDP.

Columns

  • country: Country name
  • continent: Continent
  • year: Year
  • lifeExp: Life expectancy
  • pop: Population
  • gdpPercap: GDP per capita

Data source: https://www.gapminder.org/data/

License: CC BY 4.0

Attribution: "Data from Gapminder Foundation, https://www.gapminder.org/data/, CC BY 4.0"

Auto MPG

Automobile fuel efficiency data from 1970 to 1982 (398 rows, 9 columns).

Columns

  • mpg: Fuel efficiency (miles per gallon)
  • cylinders: Number of cylinders (4, 6, 8)
  • displacement: Engine displacement (cubic inches)
  • horsepower: Horsepower
  • weight: Vehicle weight (pounds)
  • acceleration: Acceleration (0-60 mph time in seconds)
  • model_year: Model year (70 = 1970, 82 = 1982)
  • origin: Country of origin (usa, europe, japan)
  • name: Vehicle model name

Contains some missing values.

Data source: https://archive.ics.uci.edu/dataset/9/auto+mpg

License: Public Domain

World Bank

Development indicators for 50 major countries (50 rows, 10 columns, 2021-2022 data).

Columns

  • country: Country name
  • country_code: Country code
  • region: Region
  • income_group: Income group
  • population_2022: Population (2022)
  • gdp_usd_billions_2022: GDP (billions USD, 2022)
  • gdp_per_capita_2022: GDP per capita (2022)
  • life_expectancy_2021: Life expectancy (2021)
  • urban_population_percent_2022: Urban population percentage (2022)
  • internet_users_percent_2021: Internet usage rate (2021)

Data source: https://data.worldbank.org/

License: CC BY 4.0

Attribution: "Data from World Bank Open Data, https://data.worldbank.org/, CC BY 4.0"

Bike Sharing

Washington D.C. bike sharing data (2011-2012). Available in two versions: daily (731 rows) and hourly (17,379 rows).

Time Variables

  • instant: Record ID
  • dteday: Date (YYYY-MM-DD)
  • season: Season (1: Spring, 2: Summer, 3: Fall, 4: Winter)
  • yr: Year (0: 2011, 1: 2012)
  • mnth: Month (1-12)
  • hr: Hour (0-23, hourly data only)
  • weekday: Day of week (0: Sunday, 6: Saturday)
  • holiday: Holiday flag (0: Regular day, 1: Holiday)
  • workingday: Working day flag (1: Weekday, 0: Weekend or holiday)

Weather Variables

  • weathersit: Weather condition
    • 1: Clear, few clouds, partly cloudy
    • 2: Mist + cloudy, mist + broken clouds
    • 3: Light snow, light rain + thunderstorm + scattered clouds
    • 4: Heavy rain + ice pellets + thunderstorm + mist
  • temp: Normalized temperature (Celsius divided by 41)
  • atemp: Normalized feeling temperature (Celsius divided by 50)
  • hum: Normalized humidity (humidity divided by 100)
  • windspeed: Normalized wind speed (wind speed divided by 67)

Usage Counts

  • casual: Casual user count
  • registered: Registered user count
  • cnt: Total count (casual + registered)

Count data with expected overdispersion (variance > mean).

Data source: https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset

License: CC0 (Public Domain)

Earthquakes

Worldwide earthquake data from September 2024 (1,041 rows, 7 columns, magnitude 4.0+).

Columns

  • time: Occurrence datetime
  • latitude, longitude: Location
  • depth: Depth
  • mag: Magnitude
  • place: Location description

Data source: https://www.usgs.gov/programs/earthquake-hazards

License: Public Domain (USGS Data)

Iris

Measurement data of three iris species, a classic classification dataset (150 rows, 5 columns).

Columns

  • sepal_length, sepal_width: Sepal dimensions
  • petal_length, petal_width: Petal dimensions
  • species: Species

Data source: https://archive.ics.uci.edu/dataset/53/iris

License: Public Domain

Heart Failure

Clinical records of 299 heart failure patients (299 rows, 13 columns).

Columns

  • age: Age
  • anaemia: Anaemia status (0: No, 1: Yes)
  • creatinine_phosphokinase: CPK enzyme level (mcg/L)
  • diabetes: Diabetes status (0: No, 1: Yes)
  • ejection_fraction: Ejection fraction (%)
  • high_blood_pressure: High blood pressure status (0: No, 1: Yes)
  • platelets: Platelet count (kiloplatelets/mL)
  • serum_creatinine: Serum creatinine (mg/dL)
  • serum_sodium: Serum sodium (mEq/L)
  • sex: Sex (0: Female, 1: Male)
  • smoking: Smoking status (0: No, 1: Yes)
  • time: Follow-up period (days)
  • DEATH_EVENT: Death event (0: Survived, 1: Died)

In the Survival Analysis tab, select time as the Time Variable and DEATH_EVENT as the Event Variable to generate Kaplan-Meier survival curves.

Data source: https://archive.ics.uci.edu/dataset/519/heart+failure+clinical+records

License: CC BY 4.0

Attribution: "Chicco, D., Jurman, G. (2020). BMC Medical Informatics and Decision Making. https://doi.org/10.1186/s12911-020-1023-5"

Dose Response

Insecticide dose-response data (8 rows, 4 columns).

Columns

  • dose: Insecticide concentration (mg/L)
  • exposed: Number of insects exposed at each dose (trials)
  • dead: Number of insects that died (successes)
  • mortality_rate: Mortality rate (for reference)

In the GLM tab, select the Binomial family, switch Response format to Grouped, and set dead as Successes and exposed as Trials. See the Grouped Binomial GLM Tutorial for step-by-step instructions.

Data source: Synthetic data (inspired by Bliss, 1935)

License: CC0 (Public Domain)

Student's Sleep

Data published in 1908 by William Sealy Gosset under the pseudonym "Student" — the same paper that introduced the t-test (20 rows, 3 columns). Records the extra hours of sleep gained by 10 subjects under two soporific drugs, compared to a control.

Columns

  • ID: Subject identifier (1-10)
  • extra: Increase in hours of sleep compared to control
  • group: Drug administered (Drug 1, Drug 2)

Data source: Student (1908). The Probable Error of a Mean. Biometrika, 6(1), 1-25.

License: Public domain (published 1908)