Datasets

A MIDAS project consists of data loaded from CSV files and derived data generated through SQL queries and other transformations. This page describes the different dataset types and how they behave.

Dataset Types

Primary Dataset

When you load a CSV or TSV file, a Primary Dataset is created. It stores the imported data as-is and supports direct cell editing and row exclusion.

Project Overview displays metadata such as the original file name, import date, and file size.

Derived Dataset

Derived Datasets are generated by transformation operations including SQL Editor, Crosstab, Reshape, and Dummy Coding. Each Derived Dataset records which parent dataset(s) it depends on and which operation produced it.

You cannot directly edit data in a Derived Dataset. To change the data, modify the transformation operation or update the parent dataset.

Temporary data generated by filter operations (Ephemeral Datasets) is not saved in project files. Use "Save as Dataset" in Data Table to convert it to a Derived Dataset that persists in the project.

Automatic Data Type Inference

When you load a CSV file, MIDAS automatically determines the data type for each column. See Data Preparation and Import for the supported data types (boolean, int64, float64, date, datetime, string, enum).

Inference follows a priority order: boolean is checked first, then numeric types (int64/float64), then date types (date/datetime). If none match, the column is assigned string type. The enum type is never auto-inferred -- create it by manually converting from string type.

Empty cells and missing values are treated as null and do not affect type inference.

Schema Inheritance in Derived Datasets

When you create a Derived Dataset via SQL Editor or other operations, the resulting columns inherit metadata from the parent datasets. Specifically, the measurement scale (nominal, ordinal, interval, ratio), data type, and enum name are inherited.

The inheritance rules are:

  • If a result column has the same name as a column in the parent dataset, the parent's settings are inherited
  • When multiple parents exist (e.g., JOIN), the first dataset in the FROM clause takes priority
  • If no matching parent column exists, the type is determined automatically from the query result

For example, if you changed a zip code column from ratio to nominal scale in the parent dataset, running SELECT zip_code FROM ... in SQL Editor preserves the nominal scale setting.

However, transformations like CAST(category AS INTEGER) change the semantic meaning of the data, so the inherited scale may be incorrect. Adjust it manually in such cases.

Cascade Deletion

Deleting a dataset cascades to dependent resources.

Deleted

  • Derived Datasets that depend on the deleted dataset (transitively, including grandchildren)
  • Models trained on the deleted dataset

Not deleted

  • Reports (though report elements referencing deleted datasets will show errors)

For example, if Primary Dataset "A" has Derived Dataset "B", and "B" has Derived Dataset "C" and Model "M", deleting "A" also deletes "B", "C", and "M".

To review the impact before deleting, check the dependency graph in Project Lineage.

Lazy Evaluation and Caching

Derived Datasets do not compute their data until it is needed. For example, immediately after opening a project file, Derived Dataset data has not yet been computed. The computation runs when you first reference the dataset -- such as opening a Data Table tab or rendering a graph -- and the result is cached in memory.

When a parent dataset is updated (via reload, type change, cell edit, etc.), all downstream Derived Dataset caches are discarded. Data is automatically recomputed the next time it is needed.

Dependent Models are marked as stale and cannot be used for predictions until retrained.

Materialized View

By default, Derived Dataset data is not saved in the project file (MDS). It is recomputed from parent datasets each time the project is opened.

Enabling Materialized View includes the computed data in the MDS file:

  • Disabled (default): Smaller file size. Data is recomputed when needed
  • Enabled: Larger file size. Data is available immediately when the project is opened

This is useful for datasets produced by expensive SQL queries or large data transformations. When Materialized View is enabled, the MDS file contains the actual data, so be mindful of the contents when sharing the file with others. Configure it from the Datasets section in Project Overview.

Renaming Datasets and Automatic SQL Updates

When you rename a dataset in Project Overview, SQL queries in Derived Datasets that reference the old name are automatically updated.

For example, renaming "sales_2024" to "sales" changes FROM "sales_2024" to FROM "sales" in any dependent SQL. The update is based on SQL parsing, so occurrences in string literals or comments are not affected.

After renaming, affected Derived Dataset caches are discarded and recomputed on next access.

See also