---
title: Dummy Coding
description: Convert categorical variables (nominal/ordinal) into binary dummy variables for use in regression analysis and GLM. Uses k-1 coding to avoid linear dependence with the intercept.
priority: 0.6
---

# Dummy Coding {#dummy-coding}

The Dummy Coding tab converts categorical variables (nominal/ordinal scale) into numeric dummy variables (0/1). Use this to prepare categorical variables as predictors for regression analysis or GLM.

## Basic Usage {#basic-usage}

### Opening Dummy Coding {#opening-dummy-coding}

Select **Data > Dummy Coding...** from the menu bar to open a new Dummy Coding tab.

### Configuring the Transformation {#configuring-the-transformation}

1. Select the target dataset from the **Dataset** dropdown
2. Configure each column's **Scale** and **Action**, and optionally the **Reference** (reference category), in the column table
3. Review each column's reference category and the number of dummy variables in the Encoding Preview section
4. Click **Create Dataset**
5. Enter a dataset name and click **OK**

In the Encoding Preview, check that the unique value count is what you expect, with no extras from typos or inconsistent labels, and that the reference category is the one you intend. A column with only one unique value cannot be encoded and shows an error.

## Sample Data Used in This Page {#sample-data-used-in-this-page}

The examples on this page use a survey dataset ([survey.csv](../shared/files/dummy-coding-survey.csv)) for 5 people. `blood_type` and `education` are categorical variables.

| name  | age | blood_type | education   |
|-------|-----|------------|-------------|
| Alice | 28  | A          | college     |
| Bob   | 35  | B          | graduate    |
| Carol | 42  | O          | college     |
| Dave  | 31  | A          | high_school |
| Eve   | 26  | AB         | graduate    |

## Column Configuration {#column-configuration}

Configure the Scale, Action, and Reference for each column.

### Scale {#scale}

| Scale    | Description    | Dummy Coding |
|----------|----------------|--------------|
| nominal  | Nominal scale  | Available    |
| ordinal  | Ordinal scale  | Available     |
| interval | Interval scale | Not available (used as numeric) |
| ratio    | Ratio scale    | Not available (used as numeric) |

Even on a column set to the ordinal scale, the defined order (low → mid → high, for example) is not used in dummy coding. Both the order in which dummy variables are generated and the default reference category follow the alphabetical order of the category names. To base the coding on a specific category, select it in [Reference](#reference).

The initial Scale is the measurement scale set on the column. For columns without a scale, it is inferred from the data type: string and boolean columns become nominal, and numeric and date/time columns become interval. See [Data Types and Measurement Scales](concepts-data-types#auto-inference-from-data-types-to-measurement-scales) for details. Change the Scale from the dropdown as needed. A code recorded as a number — an equipment or line number, for example — is inferred as interval, but changing its Scale to nominal makes it available for dummy coding.

### Action {#action}

| Action                     | Description                                             |
|----------------------------|---------------------------------------------------------|
| Not included               | Exclude from the output dataset                         |
| Include as-is              | Keep column as-is in the output                         |
| Dummy code                 | Convert to dummy variables (original column excluded)   |
| Dummy code (keep original) | Convert to dummy variables and keep the original column. Choose this to use the original categorical column for graphs or crosstabs in the output dataset |

Dummy code options are only available for categorical columns (nominal/ordinal). Boolean columns are excluded[^2].

### Reference {#reference}

The reference category is the group that dummy variable coefficients are compared against. Choosing a meaningful baseline — a control group, a standard condition, or a pre-intervention level — makes the coefficients easier to interpret.

For columns with the Action set to Dummy code or Dummy code (keep original), select the reference category from the **Reference** dropdown. The default is the alphabetically first category.

## How It Works {#how-it-works}

From k categories, k-1 dummy variables are generated, omitting one category as the reference category[^3]. This scheme is called treatment coding (coding relative to a reference category).

[^3]: Creating dummy variables for all k categories would make their sum equal 1 in every row. The intercept is also 1 in every row, so the two would be linearly dependent and the regression coefficients could not be estimated uniquely. Omitting one category avoids this problem.

1. Extract unique category values[^4]
2. Sort alphabetically
3. Exclude the reference category. The default is the alphabetically first category, which can be changed from the [Reference](#reference) dropdown
4. Generate a dummy variable for each of the remaining k-1 categories
5. Code 1 for matching rows, 0 otherwise

### Example {#example}

Converting the `blood_type` column (unique values: A, AB, B, O):

- Reference category: `A` (the default, alphabetically first)
- Dummy variables generated: `blood_type_AB`, `blood_type_B`, `blood_type_O`

| blood_type | blood_type_AB | blood_type_B | blood_type_O |
|------------|---------------|--------------|--------------|
| A          | 0             | 0            | 0            |
| B          | 0             | 1            | 0            |
| O          | 0             | 0            | 1            |
| A          | 0             | 0            | 0            |
| AB         | 1             | 0            | 0            |

Rows with the reference category `A` have all dummy variables set to 0.

When the dummy variables are used in a regression model, each dummy variable coefficient is an estimate of the difference from the reference category. In this example, the coefficient of `blood_type_B` estimates the difference in the response variable between type B and the reference type A. In a model with other predictors, this is the difference holding them constant.

[^2]: Boolean columns are already equivalent to 0/1, so select Include as-is to use them directly.
[^4]: Missing values are not counted as unique values. Rows where the original column is missing have missing values in all generated dummy variables and are dropped by [listwise deletion](concepts-missing-data) in later regression or GLM analyses. Columns with only one unique value cannot be dummy coded.

## Output Dataset {#output-dataset}

With the sample data above, setting `name` to Not included, `age` to Include as-is, and both `blood_type` and `education` to Dummy code produces the following output.

| age | blood_type_AB | blood_type_B | blood_type_O | education_graduate | education_high_school |
|-----|---------------|--------------|--------------|--------------------|-----------------------|
| 28  | 0             | 0            | 0            | 0                  | 0                     |
| 35  | 0             | 1            | 0            | 1                  | 0                     |
| 42  | 0             | 0            | 1            | 0                  | 0                     |
| 31  | 0             | 0            | 0            | 0                  | 1                     |
| 26  | 1             | 0            | 0            | 1                  | 0                     |

`blood_type` (4 categories) produces 3 dummy variables, and `education` (3 categories: college, graduate, high_school; reference: college) produces 2. Column names follow the `{original_column_name}_{category_name}` format, with data type `int64` (0 or 1) and a ratio measurement scale. They go straight into regression or GLM as predictors and are not dummy coded again. Row count stays the same. The original dataset is not modified; results are saved as a new [derived dataset](datasets#derived-dataset).

## Next steps {#next-steps}

- **[Linear Regression](linear-regression)** - Regression analysis using dummy variables
- **[Generalized Linear Model (GLM)](glm)** - GLM using dummy variables

## See also {#see-also}

- **[Column Type Conversion](column-type-conversion)** - Converting data types
