Dummy Coding

The Dummy Coding tab converts categorical variables (nominal/ordinal scale) into numeric dummy variables (0/1). Use this to prepare categorical variables as predictors for regression analysis or GLM.

Basic Usage

Opening Dummy Coding

Select Data > Dummy Coding... from the menu bar to open a new Dummy Coding tab.

Configuring the Transformation

Select the target dataset from the Dataset dropdown
Configure each column's Scale and Action, and optionally the Reference (reference category), in the column table
Review each column's reference category and the number of dummy variables in the Encoding Preview section
Click Create Dataset
Enter a dataset name and click OK

In the Encoding Preview, check that the unique value count is what you expect, with no extras from typos or inconsistent labels, and that the reference category is the one you intend. A column with only one unique value cannot be encoded and shows an error.

Sample Data Used in This Page

The examples on this page use a survey dataset (survey.csv) for 5 people. blood_type and education are categorical variables.

name	age	blood_type	education
Alice	28	A	college
Bob	35	B	graduate
Carol	42	O	college
Dave	31	A	high_school
Eve	26	AB	graduate

Column Configuration

Configure the Scale, Action, and Reference for each column.

Scale

Scale	Description	Dummy Coding
nominal	Nominal scale	Available
ordinal	Ordinal scale	Available
interval	Interval scale	Not available (used as numeric)
ratio	Ratio scale	Not available (used as numeric)

Even on a column set to the ordinal scale, the defined order (low → mid → high, for example) is not used in dummy coding. Both the order in which dummy variables are generated and the default reference category follow the alphabetical order of the category names. To base the coding on a specific category, select it in Reference.

The initial Scale is the measurement scale set on the column. For columns without a scale, it is inferred from the data type: string and boolean columns become nominal, and numeric and date/time columns become interval. See Data Types and Measurement Scales for details. Change the Scale from the dropdown as needed. A code recorded as a number — an equipment or line number, for example — is inferred as interval, but changing its Scale to nominal makes it available for dummy coding.

Action

Action	Description
Not included	Exclude from the output dataset
Include as-is	Keep column as-is in the output
Dummy code	Convert to dummy variables (original column excluded)
Dummy code (keep original)	Convert to dummy variables and keep the original column. Choose this to use the original categorical column for graphs or crosstabs in the output dataset

Dummy code options are only available for categorical columns (nominal/ordinal). Boolean columns are excluded¹.

Reference

The reference category is the group that dummy variable coefficients are compared against. Choosing a meaningful baseline — a control group, a standard condition, or a pre-intervention level — makes the coefficients easier to interpret.

For columns with the Action set to Dummy code or Dummy code (keep original), select the reference category from the Reference dropdown. The default is the alphabetically first category.

How It Works

From k categories, k-1 dummy variables are generated, omitting one category as the reference category². This scheme is called treatment coding (coding relative to a reference category).

Extract unique category values³
Sort alphabetically
Exclude the reference category. The default is the alphabetically first category, which can be changed from the Reference dropdown
Generate a dummy variable for each of the remaining k-1 categories
Code 1 for matching rows, 0 otherwise

Example

Converting the blood_type column (unique values: A, AB, B, O):

Reference category: A (the default, alphabetically first)
Dummy variables generated: blood_type_AB, blood_type_B, blood_type_O

blood_type	blood_type_AB	blood_type_B	blood_type_O
A	0	0	0
B	0	1	0
O	0	0	1
A	0	0	0
AB	1	0	0

Rows with the reference category A have all dummy variables set to 0.

When the dummy variables are used in a regression model, each dummy variable coefficient is an estimate of the difference from the reference category. In this example, the coefficient of blood_type_B estimates the difference in the response variable between type B and the reference type A. In a model with other predictors, this is the difference holding them constant.

Output Dataset

With the sample data above, setting name to Not included, age to Include as-is, and both blood_type and education to Dummy code produces the following output.

age	blood_type_AB	blood_type_B	blood_type_O	education_graduate	education_high_school
28	0	0	0	0	0
35	0	1	0	1	0
42	0	0	1	0	0
31	0	0	0	0	1
26	1	0	0	1	0

blood_type (4 categories) produces 3 dummy variables, and education (3 categories: college, graduate, high_school; reference: college) produces 2. Column names follow the {original_column_name}_{category_name} format, with data type int64 (0 or 1) and a ratio measurement scale. They go straight into regression or GLM as predictors and are not dummy coded again. Row count stays the same. The original dataset is not modified; results are saved as a new derived dataset.

Next steps

Linear Regression - Regression analysis using dummy variables
Generalized Linear Model (GLM) - GLM using dummy variables