Generalized Linear Model (GLM)

The GLM tab performs regression analysis using generalized linear models. A GLM is defined by a distribution family, a linear predictor η=Xβ+offset\eta = X\beta + \text{offset}, and a link function g(μ)=ηg(\mu) = \eta, extending OLS to the exponential family of distributions. The offset term defaults to zero when not specified. See GLM Fundamentals for the mathematical background.

OLS regression in the Linear Regression tab is a special case of GLM (Gaussian family with identity link). See Linear Regression for details on OLS.

The distribution families and link functions available in MIDAS are listed below.

Choosing a Distribution Family

FamilyUI LabelVariance Function V(μ)V(\mu)Use Case
GaussianGaussian (Normal)11Continuous response variable. Equivalent to standard linear regression
BinomialBinomial (Logistic)μ(1μ)\mu(1 - \mu)Binary data (0/1) or proportion data. Logistic regression
PoissonPoisson (Count)μ\muCount data (event occurrences). Assumes variance equals the mean
GammaGamma (Positive Continuous)μ2\mu^2Positive continuous values with right-skewed distributions (wait times, costs)
Negative BinomialNegative Binomial (Overdispersed Count)μ+μ2/θ\mu + \mu^2/\thetaOverdispersed count data. Use when the Poisson equidispersion assumption Var(Y)=μ\operatorname{Var}(Y) = \mu does not hold

The choice of family is driven by the nature of the response variable. Binary outcomes call for Binomial, non-negative integers for Poisson (or Negative Binomial if overdispersed), and positive continuous values with variance proportional to the mean for Gamma.

The Binomial family supports both individual 0/1 data (Binary) and aggregated successes/trials data (Grouped). See Grouped Binomial GLM with Dose-Response Data for details.

The link function is a monotonic function η=g(μ)\eta = g(\mu) connecting the linear predictor to the expected value of the response. Except for Negative Binomial, each family's default link is its canonical link. The canonical link for Negative Binomial is log(μ/(μ+θ))\log(\mu/(\mu+\theta)), but when θ\theta is estimated the link function changes at each iteration, making estimation unstable. Log is used as the default in practice.

FamilyDefault LinkAvailable Links
GaussianIdentityIdentity, Log
BinomialLogitLogit, Probit
PoissonLogLog, Identity
GammaInverseInverse, Log, Identity
Negative BinomialLogLog
Link FunctionFormulaDescription
Identityη=μ\eta = \muNo transformation. Canonical link for Gaussian
Logitη=log ⁣(μ/(1μ))\eta = \log\!\bigl(\mu / (1 - \mu)\bigr)Log-odds transformation. Canonical link for Binomial
Logη=log(μ)\eta = \log(\mu)Log transformation. Canonical link for Poisson. Ensures μ>0\mu > 0
Inverseη=1/μ\eta = 1/\muReciprocal transformation. Canonical link for Gamma
Probitη=Φ1(μ)\eta = \Phi^{-1}(\mu)Inverse CDF of the standard normal distribution. Corresponds to a latent normal variable model

The canonical link provides stable maximum likelihood estimation. Non-canonical links may be chosen for easier coefficient interpretation but can lead to convergence issues. See GLM Fundamentals for the mathematical properties of canonical links.

Basic Usage

The examples below use the Auto MPG dataset.

Opening GLM

Select Analysis > Generalized Linear Model (GLM)... from the menu bar.

Setting Up Variables

Dataset selects the dataset to analyze.

Dependent Variable (Y) selects the response variable. Numeric columns (int64, float64) and boolean columns are available. Boolean values are automatically converted to true=1, false=0. For the Binomial family, use a 0/1 or boolean column.

Independent Variables (X) selects predictor variables using checkboxes. Columns with categorical scales (nominal/ordinal) or date/datetime types are not selectable. To use categorical variables, convert them to numeric dummy variables using the Dummy Coding tab first (see Notes).

Distribution Family selects the distribution family. Changing the family automatically switches the link function to the canonical link for that family.

Link Function selects the link function. Available options depend on the selected family.

Include intercept toggles the intercept term. Enabled by default.

GLM Form

Negative Binomial Settings

When the Negative Binomial family is selected, options for the shape parameter θ\theta appear. The Negative Binomial variance is Var(Y)=μ+μ2/θ\operatorname{Var}(Y) = \mu + \mu^2/\theta, where θ\theta controls the degree of overdispersion.

  • Automatic (default): θ\theta is estimated using profile likelihood. An outer loop optimizes θ\theta while an inner IRLS loop estimates β\beta
  • Manual: Check Manually specify θ and enter a value (0.1 to 100, default 1.0). Useful for sensitivity analysis or model comparison

Interpreting θ\theta:

  • θ\theta \to \infty: Converges to Poisson (Var(Y)μ\operatorname{Var}(Y) \to \mu)
  • θ10100\theta \approx 10\text{--}100: Moderate overdispersion
  • θ110\theta \approx 1\text{--}10: Strong overdispersion
  • θ<1\theta < 1: Extreme overdispersion

Offset Variable

Offset Variable adds a known quantity to the linear predictor with a fixed coefficient of 1. The offset is not estimated from data; it is a fixed value for each observation.

A typical use case is rate modeling in Poisson regression. Set the count as the response variable and log(exposure)\log(\text{exposure}) as the offset to model rates instead of counts:

ηi=Xiβ+log(exposurei)\eta_i = X_i\beta + \log(\text{exposure}_i)

This parameterization means that exp(β)\exp(\beta) is interpretable as a rate ratio.

Setting an offset also affects the null deviance. The null model becomes intercept + offset, so the null deviance differs from the case without an offset.

When predicting with a saved model that includes an offset, the prediction dataset must contain an offset column with the same name. The linear predictor for prediction is η^i=Xiβ^+offseti\hat\eta_i = X_i\hat\beta + \text{offset}_i.

Advanced Options

  • Max Iterations: Maximum number of IRLS iterations (default: 100)
  • Convergence Tolerance: Convergence threshold based on maximum absolute change in coefficients (default: 1e-6)

Running the Analysis

Click the Run GLM button.

Parameter estimation uses IRLS (Iteratively Reweighted Least Squares; see algorithm details). The progress dialog shows the deviance at each iteration. Click Cancel to stop the analysis, and use Save as Dataset to save the convergence history.

Understanding Results

GLM Results

Model Summary

MetricDescription
ConvergenceWhether IRLS converged (with iteration count)
DevianceResidual deviance D=2[(y;y)(y;μ^)]D = 2\bigl[\ell(y;\,y) - \ell(y;\,\hat\mu)\bigr]. A goodness-of-fit measure based on the log-likelihood difference from the saturated model
AICAkaike Information Criterion AIC=2+2k\text{AIC} = -2\ell + 2k, where kk is the total number of estimated parameters. Used for comparing models within the same family. Comparing AIC across different families is not recommended because the constant terms in the log-likelihood differ. Lower values indicate better fit-complexity trade-off
Shape Parameter (θ\theta)Negative Binomial only. θ\theta controls the degree of overdispersion: smaller values indicate stronger overdispersion, while θ\theta \to \infty converges to Poisson (see Negative Binomial Settings). Indicates whether θ\theta was estimated or manually specified

kk in the AIC formula is the number of regression coefficients (including the intercept). When θ\theta is automatically estimated for Negative Binomial, θ\theta is included in kk (k=number of regression coefficients+1k = \text{number of regression coefficients} + 1). When θ\theta is manually specified, it is not included because it is not an estimated parameter. Be aware of this difference when comparing AIC between θ\theta-estimated and θ\theta-fixed models.

Coefficients

ColumnDescription
VariableVariable name (intercept shown as "(Intercept)")
EstimateEstimated regression coefficient β^\hat\beta (on link function scale)
Std. ErrorWald standard error ϕ^diag((XW^X)1)\sqrt{\hat\phi \cdot \operatorname{diag}\bigl((X'\hat WX)^{-1}\bigr)}. ϕ^\hat\phi is the dispersion parameter (ϕ^=1\hat\phi = 1 for Poisson, Binomial, and Negative Binomial with estimated θ\theta)
z value / t valueTest statistic β^/SE(β^)\hat\beta / \operatorname{SE}(\hat\beta). For families that estimate the dispersion parameter ϕ\phi from data (Gaussian, Gamma, Negative Binomial with fixed θ\theta), the reference distribution is t(np)t(n-p) and the column header displays "t value". For families with ϕ=1\phi = 1 (Poisson, Binomial, Negative Binomial with estimated θ\theta), the reference distribution is standard normal and the column header displays "z value". Except for Gaussian + Identity, the test statistic follows the reference distribution only approximately in finite samples
Pr(>|z|) / Pr(>|t|)Two-sided p-value. Based on the tt-distribution for dispersion-estimating families, or the standard normal distribution for others
Lower N% / Upper N%Confidence interval β^±c×SE(β^)\hat\beta \pm c \times \operatorname{SE}(\hat\beta), where N is the selected confidence level. cc is t1α/2,npt_{1-\alpha/2,\, n-p} for dispersion-estimating families, or z1α/2z_{1-\alpha/2} for others (e.g., z0.975=1.96z_{0.975} = 1.96 at the 95% level)

For Gaussian + Identity, the test statistic follows t(np)t(n-p) exactly when errors are normally distributed. For other dispersion-estimating families (Gaussian with non-identity link, Gamma, Negative Binomial with fixed θ\theta), the tt-distribution is an approximation that accounts for the additional uncertainty from estimating ϕ\phi. For families with known dispersion (ϕ=1\phi = 1), the Wald test is an asymptotic approximation to the standard normal. Neither approximation is exact in finite samples; both improve as the sample size grows. Interpret results near borderline significance levels (p-values around 0.05) with caution.

For Negative Binomial with estimated θ\theta, ϕ=1\phi = 1 because the overdispersion is already modeled by the μ2/θ\mu^2/\theta term in the variance function — there is no remaining overdispersion for ϕ\phi to absorb, so standard errors are computed with ϕ=1\phi = 1. When θ\theta is manually fixed, the specified value may not fully capture the data's overdispersion, so ϕ^=Pearson χ2/(np)\hat\phi = \text{Pearson }\chi^2/(n-p) is estimated instead. This follows R's MASS::glm.nb.

Interpreting Coefficients

Coefficients are estimated on the link function scale, so interpretation requires considering the inverse link function.

  • Identity link: β\beta is the change in E[Y]E[Y] per unit change in XX (same as OLS)
  • Logit link: β\beta is the change in log-odds. exp(β)\exp(\beta) is the odds ratio
  • Log link: β\beta is the change in log(μ)\log(\mu). exp(β)\exp(\beta) is the multiplicative change in E[Y]E[Y] (rate ratio)
  • Inverse / Probit link: Direct interpretation is difficult; interpretation through predicted values is more practical

The coefficients table can be saved as a dataset using the Save as Dataset button for export to CSV. You must save the model first (using the Save Model button). Linking the coefficient dataset to a specific model means that deleting the model also deletes the derived coefficient dataset and any report element that references it, and refitting the model updates the dataset contents to reflect the new fit.

In the saved dataset, column names differ from the UI labels: z value / t value is saved as Test Statistic and Pr(>|z|) / Pr(>|t|) as P-value. The saved dataset also includes a Distribution column (value: normal or t) and a DF column. When Distribution is normal, DF is empty.

Saving and Diagnostics

Saving the Model

Enter a model name in the Model Name field and click Save Model. The model name defaults to the format "GLM: Y ~ X1 + X2 (Family, link)".

If an existing model with the same configuration (dataset, response variable, predictor variables, family, link function) exists, a confirmation dialog for overwriting is displayed.

Data Generated on Save

Saving a model automatically creates a derived dataset that adds diagnostic columns to the original data.

ColumnSymbolDescription
fitted_valuesμ^i=g1(xiβ^)\hat\mu_i = g^{-1}(x_i'\hat\beta)Predicted values (on the response scale)
deviance_residualsdid_iDeviance residuals
pearson_residualsri=(yiμ^i)/V(μ^i)/wir_i = (y_i - \hat\mu_i) / \sqrt{V(\hat\mu_i) / w_i}Pearson residuals. wiw_i is the prior weight (wi=1w_i = 1 for binary data)
standardized_residualsri=di/ϕ(1hi)r_i^* = d_i / \sqrt{\phi(1 - h_i)}Standardized residuals (deviance-based)
leveragehih_iLeverage (diagonal of the hat matrix)
cooks_distanceDiD_iCook's Distance

The ϕ\phi used for computing standardized residuals and Cook's Distance differs by family. For Poisson, Binomial, and Negative Binomial, ϕ=1\phi = 1: the variance function V(μ)V(\mu) specifies the theoretical variance, so overdispersion is not absorbed into the diagnostic statistics. For Negative Binomial, ϕ=1\phi = 1 regardless of whether θ\theta is estimated or fixed. For Gaussian, ϕ=Deviance/(np)\phi = \text{Deviance}/(n - p); for Gamma, ϕ=Pearson χ2/(np)\phi = \text{Pearson }\chi^2/(n-p). This follows R's stats::glm and MASS::glm.nb. Note that for Negative Binomial with fixed θ\theta, this differs from the ϕ^=Pearson χ2/(np)\hat\phi = \text{Pearson }\chi^2/(n-p) used for standard errors in the coefficients table.

Diagnostics and Details

After saving the model, two buttons appear:

  • View Model Details - Opens the Model Detail tab showing detailed model information
  • View Diagnostics - Opens the GLM Diagnostics tab showing diagnostic plots

Diagnostic Plots

Clicking View Diagnostics displays four diagnostic plots. As with OLS, check linearity, constant variance, and outlier influence.

GLM Diagnostics

Residual Type Selection

Select the residual type: Deviance (default) or Pearson. Switching updates all four plots immediately.

  • Deviance Residuals: di=sign(yiμ^i)×2[(yi;yi)(yi;μ^i)]d_i = \operatorname{sign}(y_i - \hat\mu_i) \times \sqrt{2\bigl[\ell(y_i;\,y_i) - \ell(y_i;\,\hat\mu_i)\bigr]}, where (yi;yi)\ell(y_i; y_i) is the log-likelihood under the saturated model (μi=yi\mu_i = y_i). Likelihood-based residuals and the default in MIDAS
  • Pearson Residuals: ri=(yiμ^i)/V(μ^i)/wir_i = (y_i - \hat\mu_i) / \sqrt{V(\hat\mu_i) / w_i}. wiw_i is the prior weight (wi=1w_i = 1 for binary data; wiw_i is the number of trials for grouped Binomial). Observed-minus-expected scaled by the variance function. Useful for diagnosing overdispersion, as Pearson χ2=ri2\chi^2 = \sum r_i^2 is used to estimate the dispersion parameter ϕ\phi

Residuals vs Fitted

Plots residuals against fitted values μ^\hat\mu. Random scatter around zero indicates adequate model fit.

  • Curved pattern: The link function may be inappropriate, or nonlinear effects of predictors may be missing
  • Funnel-shaped pattern: The variance function may be inappropriate (e.g., Poisson's Var=μ\operatorname{Var} = \mu does not match the data)

Normal Q-Q Plot

Shown only for Gaussian family. Plots standardized residual quantiles against theoretical normal quantiles.

For non-Gaussian families, deviance residuals are not guaranteed to be asymptotically normal (particularly for binary Binomial data). Instead of the plot, the message "This plot is only shown for Gaussian family GLMs." is displayed.

Scale-Location

Plots standardized residuals\sqrt{|\text{standardized residuals}|} against fitted values. Constant variance is indicated by points spreading evenly in the horizontal direction.

An upward trend suggests variance depends on fitted values. Since GLM explicitly models the mean-variance relationship through the variance function V(μ)V(\mu), patterns in this plot suggest the chosen family's variance function does not match the data well.

Residuals vs Leverage

Plots standardized residuals against leverage hi=diag(H)ih_i = \operatorname{diag}(H)_i (diagonal elements of the hat matrix). Cook's (1977) distance contours are displayed at D=0.5D = 0.5 (orange dashed) and D=1.0D = 1.0 (red dashed).

  • Leverage: Measures how far an observation's predictor values are from others. hi>2p/nh_i > 2p/n (pp = number of parameters, nn = number of observations) indicates high leverage
  • Cook's Distance: Di=ri2phi1hiD_i = \dfrac{r_i^{*2}}{p} \cdot \dfrac{h_i}{1 - h_i}. Di>0.5D_i > 0.5 warrants attention; Di>1.0D_i > 1.0 indicates strong influence

Observations outside the contour lines may substantially change the model estimates if removed.

Point Selection

Click or rectangle-select data points on any plot to display details (fitted values, residuals, leverage, Cook's Distance, etc.) in a table below the plots. Selection state is synchronized across all four plots.

Deviance Goodness-of-Fit

For Poisson and Binomial families, the residual deviance approximately follows a χ2(np)\chi^2(n - p) distribution when the model is correctly specified. The Deviance Goodness-of-Fit chart displays the χ2\chi^2 density curve and marks the observed deviance, allowing you to visually assess whether the deviance falls within the bulk of the distribution or in the extreme tail.

A deviance in the right tail suggests the model does not adequately capture the variability in the data. Consider whether important predictors are missing or whether the distributional assumption is appropriate. For Poisson data, switching to the Negative Binomial family may help. See GLM Fundamentals for the theoretical background.

For Binomial models with binary response data (trial size = 1), the assumptions for the χ2\chi^2 approximation are not met, so this test is unreliable. Use the other diagnostic plots to assess model fit in that case.

Prediction

Use a saved GLM model to generate predictions on new data.

GLM Prediction

Running Predictions

  1. Open the Model Detail tab via View Model Details
  2. Click the Make Predictions button to open the GLM Prediction tab
  3. Select a dataset for prediction (only datasets with matching predictor column names are available)
  4. Configure output settings:
    • Output Dataset Name: Name for the prediction results dataset
    • Include original data: Whether to include original columns in the output
    • Confidence Interval Levels: Confidence interval levels (90%, 95%, 99%)
    • Prediction Interval Levels: Prediction interval levels (90%, 95%, 99%)
  5. Click Run Prediction to execute

Prediction Output

Prediction results are saved as a dataset containing:

  • Predicted values μ^=g1(Xβ^)\hat\mu = g^{-1}(X\hat\beta) (on the response scale)
  • Confidence intervals for the mean response E[YX]E[Y \mid X], which capture the uncertainty in estimating the population mean at a given set of predictor values
  • Prediction intervals for a new observation YnewY_\text{new}, which capture the uncertainty in a single future value including observation-level variability

Reference Distribution for Intervals

For families that estimate the dispersion parameter ϕ\phi from data (Gaussian with any link, Gamma, Negative Binomial with fixed θ\theta), confidence and prediction intervals use the tt distribution with npn - p degrees of freedom, where nn is the number of training observations and pp is the total number of estimated parameters including the intercept. For Gaussian with identity link, this is the exact finite-sample result under the assumption that the errors are normally distributed. For other dispersion-estimating families, the tt-distribution accounts for the additional uncertainty from estimating ϕ\phi.

For families with known dispersion (ϕ=1\phi = 1: Poisson, Binomial, Negative Binomial with estimated θ\theta), intervals use the standard normal (zz) distribution as the reference. This is an asymptotic approximation that becomes more accurate as the sample size increases.

The reference distribution and degrees of freedom are displayed in the Prediction tab.

Prediction Interval Methods

Prediction interval computation depends on the family. Gaussian with identity link uses an analytical formula that includes estimation uncertainty, while other combinations use plug-in methods. Plug-in methods do not account for parameter estimation uncertainty, so coverage probability may fall below the stated confidence level in small samples or for extrapolation points. See GLM Fundamentals for the formulas.

When the prediction dataset contains the response variable, accuracy metrics (R², RMSE, MAE) are automatically calculated and displayed.

Notes

Using Categorical Variables

GLM only accepts numeric variables. To use categorical (nominal/ordinal) or date/datetime variables as predictors, convert them to numeric dummy variables using the Dummy Coding tab before running the analysis.

Automatic Exclusion of Missing and Invalid Values

Rows containing missing values (null), non-numeric values, or infinity are automatically excluded from the analysis. The number of excluded rows is displayed in the Model Summary.

Convergence Issues

If IRLS fails to converge, check the following:

  • Iteration count: Increase Max Iterations (e.g., 100 → 500)
  • Tolerance: Relax Convergence Tolerance (e.g., 1e-6 → 1e-4)
  • Scaling: Large differences in predictor scales can cause numerical instability. Consider standardizing
  • Perfect separation: In logistic regression, when a predictor perfectly separates the response classes, the maximum likelihood estimate does not converge to a finite value (Albert & Anderson, 1984). Remove the offending predictor or verify the data
  • Excess zeros: When count data contains an extreme number of zeros, Poisson or Negative Binomial models may struggle to fit adequately

References