4

I've been using mnlogit in R to generate a multivariable logistic regression model. My original set of variables generated a singular matrix error, i.e.

Error in solve.default(hessian, gradient, tol = 1e-24) : 
system is computationally singular: reciprocal condition number = 7.09808e-25

It turns out that several "sparse" columns (variables that are 0 for most sampled individuals) cause this singularity error. I need a systematic way of removing those variables that lead to a singularity error while retaining those that allow estimation of a regression model, i.e. something analogous to the use of the function step to select variables minimizing AIC via stepwise addition, but this time removing variables that generate singular matrices.

Is there some way to do this, since checking each variable by hand (there are several hundred predictor variables) would be incredibly tedious?

1 Answer 1

2

If X is the design matrix from your model which you can obtain using

X <- model.matrix(formula, data = data)

then you can find a (non-unique) set of variables that would give you a non-singular model using the QR decomposition. For example,

x <- 1:3
X <- model.matrix(~ x + I(x^2) + I(x^3))
QR <- qr(crossprod(X))                 # Get the QR decomposition
vars <- QR$pivot[seq_len(QR$rank)]     # Variable numbers
names <- rownames(QR$qr)[vars]         # Variable names
names
#> [1] "(Intercept)" "x"           "I(x^2)"

This is subject to numerical error and may not agree with whatever code you are using, for two reasons.

First, it doesn't do any weighting, whereas logistic regression normally uses iteratively reweighted regression.

Second, it might not use the same tolerance as the other code. You can change its sensitivity by changing the tol parameter to qr() from the default 1e-07. Bigger values will cause more variables to be omitted from names.

2
  • I will give it a try. Alternatively, perhaps I can try filtering according to a (admittedly arbitrary) sparsity criterion, i.e. excluding variables that are nonzero for less than a fraction of observations.
    – Max
    Commented Dec 4, 2019 at 18:52
  • That seems unlikely to identify aliased columns. Unless you set the fraction of non-zero observations really small (e.g. less than 1e-6), sparsity would be more or less unrelated to collinearity. Commented Dec 4, 2019 at 19:22

Not the answer you're looking for? Browse other questions tagged or ask your own question.