Generalized Linear Model (GLM)
The GLM tab performs regression analysis using generalized linear models. A GLM is defined by a distribution family, a linear predictor , and a link function , extending OLS to the exponential family of distributions. See GLM Fundamentals for the mathematical background.
OLS regression in the Linear Regression tab is a special case of GLM (Gaussian family with identity link). See Linear Regression for details on OLS.
Distribution Families and Link Functions
The distribution families and link functions available in MIDAS are listed below.
Choosing a Distribution Family
| Family | UI Label | Variance Function | Use Case |
|---|---|---|---|
| Gaussian | Gaussian (Normal) | Continuous response variable. Equivalent to standard linear regression | |
| Binomial | Binomial (Logistic) | Binary data (0/1) or proportion data. Logistic regression | |
| Poisson | Poisson (Count) | Count data (event occurrences). Assumes variance equals the mean | |
| Gamma | Gamma (Positive Continuous) | Positive continuous values with right-skewed distributions (wait times, costs) | |
| Negative Binomial | Negative Binomial (Overdispersed Count) | Overdispersed count data. Use when the Poisson equidispersion assumption does not hold |
The choice of family is driven by the nature of the response variable. Binary outcomes call for Binomial, non-negative integers for Poisson (or Negative Binomial if overdispersed), and positive continuous values with variance proportional to the mean for Gamma.
The Binomial family supports both individual 0/1 data (Binary) and aggregated successes/trials data (Grouped). See Grouped Binomial GLM with Dose-Response Data for details.
Link Functions
The link function is a monotonic function connecting the linear predictor to the expected value of the response. Each family has a default link function (the canonical link).
| Family | Default Link | Available Links |
|---|---|---|
| Gaussian | Identity | Identity, Log |
| Binomial | Logit | Logit, Probit |
| Poisson | Log | Log, Identity |
| Gamma | Inverse | Inverse, Log, Identity |
| Negative Binomial | Log | Log |
| Link Function | Formula | Description |
|---|---|---|
| Identity | No transformation. Canonical link for Gaussian | |
| Logit | Log-odds transformation. Canonical link for Binomial | |
| Log | Log transformation. Canonical link for Poisson and Negative Binomial. Ensures | |
| Inverse | Reciprocal transformation. Canonical link for Gamma | |
| Probit | Inverse CDF of the standard normal distribution. Corresponds to a latent normal variable model |
The canonical link provides stable maximum likelihood estimation. Non-canonical links may be chosen for easier coefficient interpretation but can lead to convergence issues. See GLM Fundamentals for the mathematical properties of canonical links.
Basic Usage
The examples below use the Auto MPG dataset.
Opening GLM
Select Analysis > Generalized Linear Model (GLM)... from the menu bar.
Setting Up Variables
Dataset selects the dataset to analyze.
Dependent Variable (Y) selects the response variable. Numeric columns (int64, float64) and boolean columns are available. Boolean values are automatically converted to true=1, false=0. For the Binomial family, use a 0/1 or boolean column.
Independent Variables (X) selects predictor variables using checkboxes. Columns with categorical scales (nominal/ordinal) or date/datetime types are not selectable. To use categorical variables, convert them to numeric dummy variables using the Dummy Coding tab first (see Notes).
Distribution Family selects the distribution family. Changing the family automatically switches the link function to the canonical link for that family.
Link Function selects the link function. Available options depend on the selected family.
Include intercept toggles the intercept term. Enabled by default.

Negative Binomial Settings
When the Negative Binomial family is selected, options for the shape parameter appear. The Negative Binomial variance is , where controls the degree of overdispersion.
- Automatic (default): is estimated using profile likelihood. An outer loop optimizes while an inner IRLS loop estimates
- Manual: Check Manually specify θ and enter a value (0.1 to 100, default 1.0). Useful for sensitivity analysis or model comparison
Interpreting :
- : Converges to Poisson ()
- : Moderate overdispersion
- : Strong overdispersion
- : Extreme overdispersion
Advanced Options
- Max Iterations: Maximum number of IRLS iterations (default: 100)
- Convergence Tolerance: Convergence threshold based on maximum absolute change in coefficients (default: 1e-6)
Running the Analysis
Click the Run GLM button.
Parameter estimation uses IRLS (Iteratively Reweighted Least Squares; see algorithm details). The progress dialog shows the deviance at each iteration. Click Cancel to stop the analysis, and use Save as Dataset to save the convergence history.
Understanding Results

Model Summary
| Metric | Description |
|---|---|
| Convergence | Whether IRLS converged (with iteration count) |
| Deviance | Residual deviance . A goodness-of-fit measure based on the log-likelihood difference from the saturated model |
| AIC | Akaike Information Criterion . Used for comparing models. Lower values indicate better trade-off between fit and complexity |
| Shape Parameter () | Negative Binomial only. Indicates whether was estimated or manually specified |
Coefficients
| Column | Description |
|---|---|
| Variable | Variable name (intercept shown as "(Intercept)") |
| Estimate | Estimated regression coefficient (on link function scale) |
| Std. Error | Wald standard error |
| z value | Wald statistic . Asymptotically follows the standard normal distribution |
| Pr(>|z|) | Two-sided p-value from the standard normal distribution |
| (Significance marks) | *** p<0.001, ** p<0.01, * p<0.05, . p<0.1 |
| Lower 95% / Upper 95% | Wald-based 95% confidence interval |
Unlike OLS t-tests (exact in finite samples), the Wald test in GLM is an asymptotic approximation. Profile likelihood confidence intervals are more accurate for small samples, but MIDAS uses the Wald approximation.
Interpreting Coefficients
Coefficients are estimated on the link function scale, so interpretation requires considering the inverse link function.
- Identity link: is the change in per unit change in (same as OLS)
- Logit link: is the change in log-odds. is the odds ratio
- Log link: is the change in . is the multiplicative change in (rate ratio)
- Inverse / Probit link: Direct interpretation is difficult; interpretation through predicted values is more practical
The coefficients table can be saved as a dataset using the Save as Dataset button for export to CSV.
Saving and Diagnostics
Saving the Model
Enter a model name in the Model Name field and click Save Model. The model name defaults to the format "GLM: Y ~ X1 + X2 (Family, link)".
If an existing model with the same configuration (dataset, response variable, predictor variables, family, link function) exists, a confirmation dialog for overwriting is displayed.
Data Generated on Save
Saving a model automatically creates a derived dataset that adds diagnostic columns to the original data.
| Column | Symbol | Description |
|---|---|---|
fitted_values | Predicted values (on the response scale) | |
deviance_residuals | Deviance residuals | |
pearson_residuals | Pearson residuals | |
standardized_residuals | Standardized residuals (deviance-based) | |
leverage | Leverage (diagonal of the hat matrix) | |
cooks_distance | Cook's Distance |
Diagnostics and Details
After saving the model, two buttons appear:
- View Model Details - Opens the Model Detail tab showing detailed model information
- View Diagnostics - Opens the GLM Diagnostics tab showing diagnostic plots
Diagnostic Plots
Clicking View Diagnostics displays four diagnostic plots. As with OLS, check linearity, constant variance, and outlier influence.

Residual Type Selection
Select the residual type: Deviance (default) or Pearson. Switching updates all four plots immediately.
- Deviance Residuals: . Likelihood-based residuals and the default in MIDAS
- Pearson Residuals: . Observed-minus-expected scaled by the variance function. Useful for diagnosing overdispersion, as Pearson is used to estimate the dispersion parameter
Residuals vs Fitted
Plots residuals against fitted values . Random scatter around zero indicates adequate model fit.
- Curved pattern: The link function may be inappropriate, or nonlinear effects of predictors may be missing
- Funnel-shaped pattern: The variance function may be inappropriate (e.g., Poisson's does not match the data)
Normal Q-Q Plot
Shown only for Gaussian family. Plots standardized residual quantiles against theoretical normal quantiles.
For non-Gaussian families, deviance residuals are not guaranteed to be asymptotically normal (particularly for binary Binomial data). Instead of the plot, the message "This plot is only shown for Gaussian family GLMs." is displayed.
Scale-Location
Plots against fitted values. Constant variance is indicated by points spreading evenly in the horizontal direction.
An upward trend suggests variance depends on fitted values. Since GLM explicitly models the mean-variance relationship through the variance function , patterns in this plot suggest the chosen family's variance function does not match the data well.
Residuals vs Leverage
Plots standardized residuals against leverage (diagonal elements of the hat matrix). Cook's (1977) distance contours are displayed at (orange dashed) and (red dashed).
- Leverage: Measures how far an observation's predictor values are from others. ( = number of parameters, = number of observations) indicates high leverage
- Cook's Distance: . warrants attention; indicates strong influence
Observations outside the contour lines may substantially change the model estimates if removed.
Point Selection
Click or rectangle-select data points on any plot to display details (fitted values, residuals, leverage, Cook's Distance, etc.) in a table below the plots. Selection state is synchronized across all four plots.
Deviance Goodness-of-Fit
For Poisson and Binomial families, the residual deviance approximately follows a distribution when the model is correctly specified. The Deviance Goodness-of-Fit chart displays the density curve and marks the observed deviance, allowing you to visually assess whether the deviance falls within the bulk of the distribution or in the extreme tail.
A deviance in the right tail suggests the model does not adequately capture the variability in the data. Consider whether important predictors are missing or whether the distributional assumption is appropriate. For Poisson data, switching to the Negative Binomial family may help. See GLM Fundamentals for the theoretical background.
For Binomial models with binary response data (trial size = 1), the approximation may not hold well. Use the other diagnostic plots to assess model fit in that case.
Prediction
Use a saved GLM model to generate predictions on new data.

Running Predictions
- Open the Model Detail tab via View Model Details
- Click the Predict button to open the GLM Prediction tab
- Select a dataset for prediction (only datasets with matching predictor column names are available)
- Configure output settings:
- Output Dataset Name: Name for the prediction results dataset
- Include Original Data: Whether to include original columns in the output
- Confidence Levels: Confidence interval levels (90%, 95%, 99%)
- Prediction Levels: Prediction interval levels (90%, 95%, 99%)
- Click Run Prediction to execute
Prediction Output
Prediction results are saved as a dataset containing:
- Predicted values (on the response scale)
- Confidence intervals (computed on the linear predictor scale, then transformed via the inverse link)
- Prediction intervals
When the prediction dataset contains the response variable, accuracy metrics (R², RMSE, MAE) are automatically calculated and displayed.
Notes
Using Categorical Variables
GLM only accepts numeric variables. To use categorical (nominal/ordinal) or date/datetime variables as predictors, convert them to numeric dummy variables using the Dummy Coding tab before running the analysis.
Automatic Exclusion of Missing and Invalid Values
Rows containing missing values (null), non-numeric values, or infinity are automatically excluded from the analysis. The number of excluded rows is displayed in the Model Summary.
Convergence Issues
If IRLS fails to converge, check the following:
- Iteration count: Increase Max Iterations (e.g., 100 → 500)
- Tolerance: Relax Convergence Tolerance (e.g., 1e-6 → 1e-4)
- Scaling: Large differences in predictor scales can cause numerical instability. Consider standardizing
- Perfect separation: In logistic regression, when a predictor perfectly separates the response classes, the maximum likelihood estimate does not converge to a finite value (Albert & Anderson, 1984). Remove the offending predictor or verify the data
- Excess zeros: When count data contains an extreme number of zeros, Poisson or Negative Binomial models may struggle to fit adequately
References
- Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society: Series A, 135(3), 370-384.
- McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman and Hall/CRC.
- Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics, 19(1), 15-18.
- Albert, A., & Anderson, J. A. (1984). On the existence of maximum likelihood estimates in logistic regression models. Biometrika, 71(1), 1-10.