Why is it inadvisable to get statistical summary information for regression coefficients from glmnet model?
I have a regression model with binary outcome. I fitted the model with glmnet and got the selected variables and their coefficients.
Since glmnet doesn't calculate variable importance, I would like to feed the exact output (selected variables and their coefficients) to glm to get the information (Standard errors, etc).
I searched r documents, it seems I can use "method" option in glm to specify user defined function. But I failed to do so, could someone help me with this?
"It is a very natural question to ask for standard errors of regression coefficients or other estimated quantities. In principle such standard errors can easily be calculated, e.g. using the bootstrap.
Still, this package deliberately does not provide them. The reason for this is that standard errors are not very meaningful for strongly biased estimates such as arise from penalized estimation methods. Penalized estimation is a procedure that reduces the variance of estimators by introducing substantial bias. The bias of each estimator is therefore a major component of its mean squared error, whereas its variance may contribute only a small part.
Unfortunately, in most applications of penalized regression it is impossible to obtain a sufficiently precise estimate of the bias. Any bootstrap-based calculations can only give an assessment of the variance of the estimates. Reliable estimates of the bias are only available if reliable unbiased estimates are available, which is typically not the case in situations in which penalized estimates are used.
Reporting a standard error of a penalized estimate therefore tells only part of the story. It can give a mistaken impression of great precision, completely ignoring the inaccuracy caused by the bias. It is certainly a mistake to make confidence statements that are only based on an assessment of the variance of the estimates, such as bootstrap-based confidence intervals do."
There is CRAN packages hdi and selectiveInference which provide inference for high-dimensional models, you might want to take a look at those... I've also seen people just run a glm using the predictors selected by glmnet, but this doesn't take into account the uncertainty produced by the selection process of the best model itself...