Dummy Coding

The Dummy Coding tab converts categorical variables (nominal/ordinal scale) into numeric dummy variables (0/1). Use this to prepare categorical variables as predictors for regression analysis or GLM.

Basic Usage

Opening Dummy Coding

Select Data > Dummy Coding... from the menu bar to open a new Dummy Coding tab.

Configuring the Transformation

Select the target dataset from the Dataset dropdown
Configure each column's Scale and Action in the column table
Review the transformation in the Encoding Preview section
Click Create Dataset
Enter a dataset name and click OK

Sample Data Used in This Page

The examples on this page use a survey dataset (survey.csv) for 5 people. blood_type and education are categorical variables.

name	age	blood_type	education
Alice	28	A	college
Bob	35	B	graduate
Carol	42	O	college
Dave	31	A	high_school
Eve	26	AB	graduate

Column Configuration

Configure the Scale and Action for each column.

Scale

Scale	Description	Dummy Coding
nominal	Nominal scale	Available
ordinal	Ordinal scale	Available
interval	Interval scale	Not available (used as numeric)
ratio	Ratio scale	Not available (used as numeric)

Change the Scale from the dropdown as needed.

Action

Action	Description
Not included	Exclude from the output dataset
Include as-is	Keep column as-is in the output
Dummy code	Convert to dummy variables (original column excluded)
Dummy code (keep original)	Convert to dummy variables and keep the original column

Dummy code options are only available for categorical columns (nominal/ordinal). Boolean columns are excluded.

How It Works

From k categories, k-1 dummy variables are generated, omitting one category as the reference category¹.

Extract unique category values
Sort alphabetically
Exclude the first category as the reference
Generate a dummy variable for each of the remaining k-1 categories
Code 1 for matching rows, 0 otherwise

Example

Converting the blood_type column (unique values: A, AB, B, O):

Reference category: A (alphabetically first)
Dummy variables generated: blood_type_AB, blood_type_B, blood_type_O

blood_type	blood_type_AB	blood_type_B	blood_type_O
A	0	0	0
B	0	1	0
O	0	0	1
A	0	0	0
AB	1	0	0

Rows with the reference category A have all dummy variables set to 0.

Output Dataset

With the sample data above, setting name to Not included, age to Include as-is, and both blood_type and education to Dummy code produces the following output.

age	blood_type_AB	blood_type_B	blood_type_O	education_graduate	education_high_school
28	0	0	0	0	0
35	0	1	0	1	0
42	0	0	1	0	0
31	0	0	0	0	1
26	1	0	0	1	0

blood_type (4 categories) produces 3 dummy variables, and education (3 categories: college, graduate, high_school; reference: college) produces 2. Column names follow the {original_column_name}_{category_name} format, with data type int64 (0 or 1). Row count stays the same. The original dataset is not modified; results are saved as a new derived dataset.

Notes

The reference category is always the alphabetically first category and cannot be manually selected. Missing values are preserved as missing in dummy variables and are not counted as unique values. Columns with only one unique category value cannot be dummy coded. Boolean columns are already equivalent to 0/1, so use Include as-is to include them directly.

Next steps

Linear Regression - Regression analysis using dummy variables
Generalized Linear Model (GLM) - GLM using dummy variables