Dummy Coding
The Dummy Coding tab converts categorical variables (nominal/ordinal scale) into numeric dummy variables (0/1). Use this to prepare categorical variables as predictors for regression analysis or GLM.
Basic Usage
Opening Dummy Coding
Select Data > Dummy Coding... from the menu bar to open a new Dummy Coding tab.
Configuring the Transformation
- Select the target dataset from the Dataset dropdown
- Configure each column's Scale and Action in the column table
- Review the transformation in the Encoding Preview section
- Click Create Dataset
- Enter a dataset name and click OK
Sample Data Used in This Page
The examples on this page use a survey dataset (survey.csv) for 5 people. blood_type and education are categorical variables.
| name | age | blood_type | education |
|---|---|---|---|
| Alice | 28 | A | college |
| Bob | 35 | B | graduate |
| Carol | 42 | O | college |
| Dave | 31 | A | high_school |
| Eve | 26 | AB | graduate |
Column Configuration
Configure the Scale and Action for each column.
Scale
| Scale | Description | Dummy Coding |
|---|---|---|
| nominal | Nominal scale | Available |
| ordinal | Ordinal scale | Available |
| interval | Interval scale | Not available (used as numeric) |
| ratio | Ratio scale | Not available (used as numeric) |
Change the Scale from the dropdown as needed.
Action
| Action | Description |
|---|---|
| Not included | Exclude from the output dataset |
| Include as-is | Keep column as-is in the output |
| Dummy code | Convert to dummy variables (original column excluded) |
| Dummy code (keep original) | Convert to dummy variables and keep the original column |
Dummy code options are only available for categorical columns (nominal/ordinal). Boolean columns are excluded.
How It Works
From k categories, k-1 dummy variables are generated, omitting one category as the reference category1.
- Extract unique category values
- Sort alphabetically
- Exclude the first category as the reference
- Generate a dummy variable for each of the remaining k-1 categories
- Code 1 for matching rows, 0 otherwise
Example
Converting the blood_type column (unique values: A, AB, B, O):
- Reference category:
A(alphabetically first) - Dummy variables generated:
blood_type_AB,blood_type_B,blood_type_O
| blood_type | blood_type_AB | blood_type_B | blood_type_O |
|---|---|---|---|
| A | 0 | 0 | 0 |
| B | 0 | 1 | 0 |
| O | 0 | 0 | 1 |
| A | 0 | 0 | 0 |
| AB | 1 | 0 | 0 |
Rows with the reference category A have all dummy variables set to 0.
Output Dataset
With the sample data above, setting name to Not included, age to Include as-is, and both blood_type and education to Dummy code produces the following output.
| age | blood_type_AB | blood_type_B | blood_type_O | education_graduate | education_high_school |
|---|---|---|---|---|---|
| 28 | 0 | 0 | 0 | 0 | 0 |
| 35 | 0 | 1 | 0 | 1 | 0 |
| 42 | 0 | 0 | 1 | 0 | 0 |
| 31 | 0 | 0 | 0 | 0 | 1 |
| 26 | 1 | 0 | 0 | 1 | 0 |
blood_type (4 categories) produces 3 dummy variables, and education (3 categories: college, graduate, high_school; reference: college) produces 2. Column names follow the {original_column_name}_{category_name} format, with data type int64 (0 or 1). Row count stays the same. The original dataset is not modified; results are saved as a new derived dataset.
Notes
The reference category is always the alphabetically first category and cannot be manually selected. Missing values are preserved as missing in dummy variables and are not counted as unique values. Columns with only one unique category value cannot be dummy coded. Boolean columns are already equivalent to 0/1, so use Include as-is to include them directly.
Next steps
- Linear Regression - Regression analysis using dummy variables
- Generalized Linear Model (GLM) - GLM using dummy variables
See also
- Column Type Conversion - Converting data types
Footnotes
-
Creating dummy variables for all k categories would be linearly dependent with the intercept, so one is omitted. ↩