Datasets

A MIDAS project consists of data loaded from CSV files and derived data generated through SQL queries and other transformations. This page describes the different dataset types and how they behave.

Dataset Types

Primary Dataset

When you load a CSV or TSV file, a Primary Dataset is created. It stores the imported data as-is and supports direct cell editing and row exclusion.

Project Overview displays metadata such as the original file name, import date, and file size.

Derived Dataset

Derived Datasets are generated by transformation operations including SQL Editor, Crosstab, Reshape, and Dummy Coding, or by Save as Dataset in Filtered Data. Each Derived Dataset records which parent dataset(s) it depends on and which operation produced it.

You cannot directly edit data in a Derived Dataset. To change the data, modify the transformation operation or update the parent dataset.

An Ephemeral Dataset is a temporary dataset that is not persisted in the project. Unlike Primary and Derived Datasets, which are saved in the project, an Ephemeral Dataset exists only while its tab is open and is discarded when you close the tab. Filter results in the Filtered Data tab are an example.

If you save the project while the tab is still open, the data is included in the project file so the tab can be restored. Ephemeral Datasets not referenced by any open tab are excluded when you save. To keep a filter result permanently, save it as a Derived Dataset with Save as Dataset in Filtered Data or Save Filtered Data in Data Table.

Automatic Data Type Inference

When you load a CSV file, MIDAS automatically determines the data type for each column. See Data Preparation and Import for the supported data types (boolean, int64, float64, date, datetime, string, enum).

Inference follows a priority order: boolean is checked first, then numeric types (int64/float64), then date types (date/datetime). If none match, the column is assigned string type. The enum type is never auto-inferred -- create it by manually converting from string type.

Empty cells and missing values are treated as null and do not affect type inference.

Schema Inheritance in Derived Datasets

When you create a Derived Dataset via SQL Editor or other operations, the resulting columns inherit metadata from the parent datasets. Specifically, the measurement scale (nominal, ordinal, interval, ratio), data type, and enum name are inherited.

The inheritance rules are:

  • If a result column has the same name as a column in the parent dataset, the parent's settings are inherited
  • The measurement scale, data type, and enum name are inherited independently. When multiple parents exist (e.g., JOIN), the first dataset in the FROM clause takes priority for a column found in several parents, and an attribute missing from an earlier parent is filled in from a later one
  • If no matching parent column exists, the type is determined automatically from the query result

For example, if you changed a zip code column from interval to nominal scale in the parent dataset, running SELECT zip_code FROM ... in SQL Editor preserves the nominal scale setting.

However, when a transformation like CAST(category AS INTEGER) changes the result type, the parent's measurement scale is not inherited and is determined automatically from the result type instead. Adjust it manually if the automatic result does not match your intent.

Cascade Deletion

Deleting a dataset cascades to dependent resources.

Deleted

  • Derived Datasets that depend on the deleted dataset (transitively, including grandchildren)
  • Models fitted on the deleted dataset

Not deleted

  • Reports (though report elements referencing deleted datasets are automatically removed from the report)

For example, if Primary Dataset "A" has Derived Dataset "B", and "B" has Derived Dataset "C" and Model "M", deleting "A" also deletes "B", "C", and "M".

To review the impact before deleting, check the dependency graph in Project Lineage.

Lazy Evaluation and Caching

Derived Datasets do not compute their data until it is needed. For example, immediately after opening a project file, Derived Dataset data has not yet been computed. The computation runs when you first reference the dataset -- such as opening a Data Table tab or rendering a graph -- and the result is cached in memory.

When a parent dataset is updated (via reload, type change, cell edit, etc.), all downstream Derived Dataset caches are discarded. Data is automatically recomputed the next time it is needed.

Dependent Models are automatically re-estimated. Their estimation results are hidden until re-estimation completes.

Materialized View

By default, Derived Dataset data is not saved in the project file (MDS). It is recomputed from parent datasets each time the project is opened.

Enabling Materialized View includes the computed data in the MDS file:

  • Disabled (default): Smaller file size. Data is recomputed when needed
  • Enabled: Larger file size. Data is available immediately when the project is opened

This is useful for datasets produced by expensive SQL queries or large data transformations. When Materialized View is enabled, the MDS file contains the actual data, so be mindful of the contents when sharing the file with others. Configure it from the Datasets section in Project Overview.

Some datasets created from analysis results, such as prediction intervals and DoE plot data, cannot be recomputed from their operation. For these datasets, Materialized View is always enabled and cannot be turned off.

Renaming Datasets and Automatic SQL Updates

When you rename a dataset in Project Overview, SQL queries in Derived Datasets that reference the old name are automatically updated.

For example, renaming "sales_2024" to "sales" changes FROM "sales_2024" to FROM "sales" in any dependent SQL. The update is based on SQL parsing, so occurrences in string literals or comments are not affected.

The update applies to table references enclosed in double quotes, such as FROM "sales_2024". References written without quotes, such as FROM sales_2024, are not updated -- edit the SQL manually after renaming.

After renaming, affected Derived Dataset caches are discarded and recomputed on next access.

See also