Dummy Coding

The Dummy Coding tab converts categorical variables (nominal/ordinal scale) into numeric dummy variables (0/1). Use this to prepare categorical variables as predictors for regression analysis or GLM.

Basic Usage

Opening Dummy Coding

Select Data > Dummy Coding... from the menu bar to open a new Dummy Coding tab.

Configuring the Transformation

  1. Select the target dataset from the Dataset dropdown
  2. Configure each column's Scale and Action in the column table
  3. Review the transformation in the Encoding Preview section
  4. Click Create Dataset
  5. Enter a dataset name and click OK

Sample Data Used in This Page

The examples on this page use a survey dataset (survey.csv) for 5 people. blood_type and education are categorical variables.

nameageblood_typeeducation
Alice28Acollege
Bob35Bgraduate
Carol42Ocollege
Dave31Ahigh_school
Eve26ABgraduate

Column Configuration

Configure the Scale and Action for each column.

Scale

ScaleDescriptionDummy Coding
nominalNominal scaleAvailable
ordinalOrdinal scaleAvailable
intervalInterval scaleNot available (used as numeric)
ratioRatio scaleNot available (used as numeric)

Change the Scale from the dropdown as needed.

Action

ActionDescription
Not includedExclude from the output dataset
Include as-isKeep column as-is in the output
Dummy codeConvert to dummy variables (original column excluded)
Dummy code (keep original)Convert to dummy variables and keep the original column

Dummy code options are only available for categorical columns (nominal/ordinal). Boolean columns are excluded.

How It Works

From k categories, k-1 dummy variables are generated, omitting one category as the reference category1.

  1. Extract unique category values
  2. Sort alphabetically
  3. Exclude the first category as the reference
  4. Generate a dummy variable for each of the remaining k-1 categories
  5. Code 1 for matching rows, 0 otherwise

Example

Converting the blood_type column (unique values: A, AB, B, O):

  • Reference category: A (alphabetically first)
  • Dummy variables generated: blood_type_AB, blood_type_B, blood_type_O
blood_typeblood_type_ABblood_type_Bblood_type_O
A000
B010
O001
A000
AB100

Rows with the reference category A have all dummy variables set to 0.

Output Dataset

With the sample data above, setting name to Not included, age to Include as-is, and both blood_type and education to Dummy code produces the following output.

ageblood_type_ABblood_type_Bblood_type_Oeducation_graduateeducation_high_school
2800000
3501010
4200100
3100001
2610010

blood_type (4 categories) produces 3 dummy variables, and education (3 categories: college, graduate, high_school; reference: college) produces 2. Column names follow the {original_column_name}_{category_name} format, with data type int64 (0 or 1). Row count stays the same. The original dataset is not modified; results are saved as a new derived dataset.

Notes

The reference category is always the alphabetically first category and cannot be manually selected. Missing values are preserved as missing in dummy variables and are not counted as unique values. Columns with only one unique category value cannot be dummy coded. Boolean columns are already equivalent to 0/1, so use Include as-is to include them directly.

Next steps

See also

Footnotes

  1. Creating dummy variables for all k categories would be linearly dependent with the intercept, so one is omitted.