Dummy Coding
The Dummy Coding tab converts categorical variables (nominal/ordinal scale) into numeric dummy variables (0/1). Use this to prepare categorical variables as predictors for regression analysis or GLM.
Basic Usage
Opening Dummy Coding
Select Data > Dummy Coding... from the menu bar to open a new Dummy Coding tab.
Configuring the Transformation
- Select the target dataset from the Dataset dropdown
- Configure each column's Scale and Action, and optionally the Reference (reference category), in the column table
- Review each column's reference category and the number of dummy variables in the Encoding Preview section
- Click Create Dataset
- Enter a dataset name and click OK
In the Encoding Preview, check that the unique value count is what you expect, with no extras from typos or inconsistent labels, and that the reference category is the one you intend. A column with only one unique value cannot be encoded and shows an error.
Sample Data Used in This Page
The examples on this page use a survey dataset (survey.csv) for 5 people. blood_type and education are categorical variables.
| name | age | blood_type | education |
|---|---|---|---|
| Alice | 28 | A | college |
| Bob | 35 | B | graduate |
| Carol | 42 | O | college |
| Dave | 31 | A | high_school |
| Eve | 26 | AB | graduate |
Column Configuration
Configure the Scale, Action, and Reference for each column.
Scale
| Scale | Description | Dummy Coding |
|---|---|---|
| nominal | Nominal scale | Available |
| ordinal | Ordinal scale | Available |
| interval | Interval scale | Not available (used as numeric) |
| ratio | Ratio scale | Not available (used as numeric) |
Even on a column set to the ordinal scale, the defined order (low → mid → high, for example) is not used in dummy coding. Both the order in which dummy variables are generated and the default reference category follow the alphabetical order of the category names. To base the coding on a specific category, select it in Reference.
The initial Scale is the measurement scale set on the column. For columns without a scale, it is inferred from the data type: string and boolean columns become nominal, and numeric and date/time columns become interval. See Data Types and Measurement Scales for details. Change the Scale from the dropdown as needed. A code recorded as a number — an equipment or line number, for example — is inferred as interval, but changing its Scale to nominal makes it available for dummy coding.
Action
| Action | Description |
|---|---|
| Not included | Exclude from the output dataset |
| Include as-is | Keep column as-is in the output |
| Dummy code | Convert to dummy variables (original column excluded) |
| Dummy code (keep original) | Convert to dummy variables and keep the original column. Choose this to use the original categorical column for graphs or crosstabs in the output dataset |
Dummy code options are only available for categorical columns (nominal/ordinal). Boolean columns are excluded1.
Reference
The reference category is the group that dummy variable coefficients are compared against. Choosing a meaningful baseline — a control group, a standard condition, or a pre-intervention level — makes the coefficients easier to interpret.
For columns with the Action set to Dummy code or Dummy code (keep original), select the reference category from the Reference dropdown. The default is the alphabetically first category.
How It Works
From k categories, k-1 dummy variables are generated, omitting one category as the reference category2. This scheme is called treatment coding (coding relative to a reference category).
- Extract unique category values3
- Sort alphabetically
- Exclude the reference category. The default is the alphabetically first category, which can be changed from the Reference dropdown
- Generate a dummy variable for each of the remaining k-1 categories
- Code 1 for matching rows, 0 otherwise
Example
Converting the blood_type column (unique values: A, AB, B, O):
- Reference category:
A(the default, alphabetically first) - Dummy variables generated:
blood_type_AB,blood_type_B,blood_type_O
| blood_type | blood_type_AB | blood_type_B | blood_type_O |
|---|---|---|---|
| A | 0 | 0 | 0 |
| B | 0 | 1 | 0 |
| O | 0 | 0 | 1 |
| A | 0 | 0 | 0 |
| AB | 1 | 0 | 0 |
Rows with the reference category A have all dummy variables set to 0.
When the dummy variables are used in a regression model, each dummy variable coefficient is an estimate of the difference from the reference category. In this example, the coefficient of blood_type_B estimates the difference in the response variable between type B and the reference type A. In a model with other predictors, this is the difference holding them constant.
Output Dataset
With the sample data above, setting name to Not included, age to Include as-is, and both blood_type and education to Dummy code produces the following output.
| age | blood_type_AB | blood_type_B | blood_type_O | education_graduate | education_high_school |
|---|---|---|---|---|---|
| 28 | 0 | 0 | 0 | 0 | 0 |
| 35 | 0 | 1 | 0 | 1 | 0 |
| 42 | 0 | 0 | 1 | 0 | 0 |
| 31 | 0 | 0 | 0 | 0 | 1 |
| 26 | 1 | 0 | 0 | 1 | 0 |
blood_type (4 categories) produces 3 dummy variables, and education (3 categories: college, graduate, high_school; reference: college) produces 2. Column names follow the {original_column_name}_{category_name} format, with data type int64 (0 or 1) and a ratio measurement scale. They go straight into regression or GLM as predictors and are not dummy coded again. Row count stays the same. The original dataset is not modified; results are saved as a new derived dataset.
Next steps
- Linear Regression - Regression analysis using dummy variables
- Generalized Linear Model (GLM) - GLM using dummy variables
See also
- Column Type Conversion - Converting data types
Footnotes
-
Boolean columns are already equivalent to 0/1, so select Include as-is to use them directly. ↩
-
Creating dummy variables for all k categories would make their sum equal 1 in every row. The intercept is also 1 in every row, so the two would be linearly dependent and the regression coefficients could not be estimated uniquely. Omitting one category avoids this problem. ↩
-
Missing values are not counted as unique values. Rows where the original column is missing have missing values in all generated dummy variables and are dropped by listwise deletion in later regression or GLM analyses. Columns with only one unique value cannot be dummy coded. ↩
Also available as a Markdown file.