Sample Datasets
MIDAS includes sample data that you can use to learn data analysis and visualization.
Datasets licensed under CC BY 4.0 require attribution when you redistribute or publish the data or adaptations of it. You can use the attribution text listed in each applicable section as is. Datasets under CC0 or in the public domain carry no attribution requirement.
How to Open Sample Data
- Open MIDAS to see the launcher screen
- Click the dataset you want from the "Sample Data" section in the left sidebar
- The data loads and the project screen opens
Palmer Penguins
Measurement data of three penguin species observed in Antarctica (344 rows, 8 columns).
Columns
species: Penguin species (Adelie, Chinstrap, Gentoo)island: Island namebill_length_mm: Bill lengthbill_depth_mm: Bill depthflipper_length_mm: Flipper lengthbody_mass_g: Body masssex: Sexyear: Survey year
Contains some missing values.
You can draw scatter plots colored by species in Graph Builder, or compare statistics by species in the Statistics tab.
Data source: https://allisonhorst.github.io/palmerpenguins/
License: CC0 (Public Domain)
Gapminder
Data for 142 countries from 1952 to 2007 (1,704 rows, 6 columns, 5-year intervals). Analyze trends in life expectancy, population, and GDP.
Columns
country: Country namecontinent: Continentyear: YearlifeExp: Life expectancypop: PopulationgdpPercap: GDP per capita (PPP, constant 2005 international dollars)
Data source: https://www.gapminder.org/data/
License: CC BY 4.0
Attribution: "Data from Gapminder Foundation, https://www.gapminder.org/data/, CC BY 4.0"
Auto MPG
Automobile fuel efficiency data from 1970 to 1982 (398 rows, 9 columns).
Columns
mpg: Fuel efficiency (miles per gallon)cylinders: Number of cylinders (3, 4, 5, 6, 8)displacement: Engine displacement (cubic inches)horsepower: Horsepowerweight: Vehicle weight (pounds)acceleration: Time to accelerate from 0 to 60 mph (seconds)model_year: Model year (70 = 1970, 82 = 1982)origin: Country of origin (usa, europe, japan)name: Vehicle model name
Contains some missing values.
You can run a regression with mpg as the response variable in the Linear Regression tab, or examine correlations in the Statistics tab.
Data source: https://archive.ics.uci.edu/dataset/9/auto+mpg
License: Public Domain
World Bank
Development indicators for 52 major countries (52 rows, 10 columns, 2021-2022 data).
Columns
country: Country namecountry_code: Country coderegion: Regionincome_group: Income grouppopulation_2022: Population (2022)gdp_usd_billions_2022: GDP (billions USD, 2022)gdp_per_capita_2022: GDP per capita (2022, current USD)life_expectancy_2021: Life expectancy (2021)urban_population_percent_2022: Urban population percentage (2022)internet_users_percent_2021: Internet usage rate (2021)
You can compare statistics by income group in the Statistics tab, or visualize relationships between indicators in Graph Builder.
Data source: https://data.worldbank.org/
License: CC BY 4.0
Attribution: "Data from World Bank Open Data, https://data.worldbank.org/, CC BY 4.0"
Bike Sharing
Washington D.C. bike sharing data (2011-2012). Available in two versions: daily (731 rows) and hourly (17,379 rows). These appear in the launcher as two separate entries: "Bike Sharing (Daily)" and "Bike Sharing (Hourly)".
Time Variables
instant: Record IDdteday: Date (YYYY-MM-DD)season: Season (1: Winter, 2: Spring, 3: Summer, 4: Fall)yr: Year (0: 2011, 1: 2012)mnth: Month (1-12)hr: Hour (0-23, hourly data only)weekday: Day of week (0: Sunday, 6: Saturday)holiday: Holiday flag (0: Regular day, 1: Holiday)workingday: Working day flag (1: Weekday, 0: Weekend or holiday)
Weather Variables
weathersit: Weather condition- 1: Clear, few clouds, partly cloudy
- 2: Mist + cloudy, mist + broken clouds
- 3: Light snow, light rain + thunderstorm + scattered clouds
- 4: Heavy rain + ice pellets + thunderstorm + mist
temp: Normalized temperature (Celsius divided by 41)atemp: Normalized feeling temperature (Celsius divided by 50)hum: Normalized humidity (humidity divided by 100)windspeed: Normalized wind speed (divided by max wind speed of 67 km/h)
Usage Counts
casual: Casual user countregistered: Registered user countcnt: Total count (casual + registered)
The usage counts are count data where overdispersion — variance exceeding the mean — is expected. This makes the dataset a good exercise in diagnosing overdispersion with Poisson regression in the GLM tab.
Data source: https://archive.ics.uci.edu/dataset/275/bike+sharing+dataset
License: CC0 (Public Domain)
Earthquakes
Worldwide earthquake data from September 2024 (1,041 rows, 7 columns, magnitude 4.0+).
Columns
time: Occurrence datetimelatitude,longitude: Locationdepth: Hypocentral depth (km)mag: MagnitudemagType: Magnitude type (mb: body-wave magnitude, mww: moment magnitude (W-phase), etc.)place: Location description
You can visualize how earthquake frequency changes over time with the time series plot or datetime histogram in Graph Builder, or check epicenter locations with a latitude-longitude scatter plot.
Data source: https://www.usgs.gov/programs/earthquake-hazards
License: Public Domain (USGS Data)
Iris
Measurement data of three iris species, a classic classification dataset (150 rows, 5 columns).
Columns
sepal_length,sepal_width: Sepal dimensionspetal_length,petal_width: Petal dimensionsspecies: Species
You can run a classification with species as the response variable in the Random Forest tab, or draw scatter plots colored by species in Graph Builder.
Data source: https://archive.ics.uci.edu/dataset/53/iris
License: Public Domain
Heart Failure
Clinical records of 299 heart failure patients (299 rows, 13 columns).
Columns
age: Ageanaemia: Anaemia status (0: No, 1: Yes)creatinine_phosphokinase: CPK enzyme level (U/L)diabetes: Diabetes status (0: No, 1: Yes)ejection_fraction: Ejection fraction (%)high_blood_pressure: High blood pressure status (0: No, 1: Yes)platelets: Platelet count (kiloplatelets/mL)serum_creatinine: Serum creatinine (mg/dL)serum_sodium: Serum sodium (mEq/L)sex: Sex (0: Female, 1: Male)smoking: Smoking status (0: No, 1: Yes)time: Follow-up period (days)DEATH_EVENT: Death event (0: Survived, 1: Died)
From the Analysis menu, open Survival Analysis and select the Kaplan-Meier tab. Set time as the Time Variable and DEATH_EVENT as the Event Variable to generate Kaplan-Meier survival curves. See the Survival Analysis with the Kaplan-Meier Method tutorial for step-by-step instructions.
Data source: https://archive.ics.uci.edu/dataset/519/heart+failure+clinical+records
License: CC BY 4.0
Attribution: "Chicco, D., & Jurman, G. (2020). Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making, 20, 16. https://doi.org/10.1186/s12911-020-1023-5"
Dose Response
Insecticide dose-response data (8 rows, 4 columns).
Columns
dose: Insecticide concentration (mg/L)exposed: Number of insects exposed at each dose (trials)dead: Number of insects that died (successes)mortality_rate: Mortality rate (calculated from dead / exposed)
In the GLM tab, select the Binomial family, switch Response format to Grouped, and set dead as Successes and exposed as Trials. See the Grouped Binomial GLM Tutorial for step-by-step instructions.
Data source: Synthetic data created by the MIDAS project
License: CC0 (Public Domain)
Assembly Line
Dimensional inspection data from an automotive parts assembly plant (300 rows, 7 columns). Records dimension errors and environmental conditions across 3 production lines, 2 shifts, and 5 operators.
Columns
line: Assembly line (A, B, C)shift: Shift (Day, Night)operator: Operator ID (Op1 -- Op5)temperature: Ambient temperature (°C)humidity: Humidity (%)cycle_time: Cycle time (seconds)dimension_error: Deviation from target dimension (mm)
Use the ANOVA tab with line × dimension_error to estimate differences between lines, or the Linear Regression tab with environmental variables as predictors to analyze contributing factors. See the Assembly Line Dimension Error Analysis tutorial for step-by-step instructions.
Data source: Synthetic data created by the MIDAS project
License: CC0 (Public Domain)
Injection Molding
Synthetic data representing a factorial design of experiments (DoE) for injection molding (16 rows, 4 columns).
Columns
Temperature: Mold temperaturePressure: Injection pressureCycleTime: Cycle timeStrength: Strength of the molded part (response variable)
Designed as a full factorial experiment over combinations of factor levels, suitable for practicing estimation of main effects and interactions.
Data source: Synthetic data created by the MIDAS project
License: CC0 (Public Domain)
Student's Sleep
Data published in 1908 by William Sealy Gosset under the pseudonym "Student" — the same paper in which he derived the t-distribution (20 rows, 3 columns). Each of 10 subjects received two soporific drugs, and the increase in sleep compared to an unmedicated baseline was recorded. This is a paired design where the same subject received both drugs.
Columns
ID: Subject identifier (1-10)extra: Increase in hours of sleep compared to unmedicated baselinegroup: Drug administered (Drug 1, Drug 2)
You can compare statistics of extra by drug in the Statistics tab, or visualize the difference in distributions between the drugs in Graph Builder.
Data source: Student (1908). The Probable Error of a Mean. Biometrika, 6(1), 1-25.
License: Public domain (published 1908)
Also available as a Markdown file.