# Predicting/imputing the missing values of a Poisson GLM Regression in R?

I'm trying to explore ways of imputing missing values in a data set. My dataset contains the number of counts of an occurance (Unnatural, Natural and the sum Total) for Year(2001-2009), Month(1-12), Gender(M/F) and AgeGroup(4 groups).

One of the imputation techniques I'm exploring is (poisson) regression imputation.

Say my data looks like this:

Year Month Gender AgeGroup Unnatural Natural Total 569 2006 5 Male 15up 278 820 1098 570 2006 6 Male 15up 273 851 1124 571 2006 7 Male 15up 304 933 1237 572 2006 8 Male 15up 296 1064 1360 573 2006 9 Male 15up 298 899 1197 574 2006 10 Male 15up 271 819 1090 575 2006 11 Male 15up 251 764 1015 576 2006 12 Male 15up 345 792 1137 577 2007 1 Female 0 NA NA NA 578 2007 2 Female 0 NA NA NA 579 2007 3 Female 0 NA NA NA 580 2007 4 Female 0 NA NA NA 581 2007 5 Female 0 NA NA NA ...

After doing a basic GLM regression - 96 observations have been deleted due to them being missing.

Is there perhaps a way/package/function in R which will use the coefficients of this GLM model to 'predict' (ie. impute) the missing values for Total (even if it just stores it in a separate dataframe - I will use Excel to merge them)? I know I can use the coefficients to predict the different hierarchal rows - but this will take forever. Hopefully there's an one step function/method?

Call: glm(formula = Total ~ Year + Month + Gender + AgeGroup, family = poisson) Deviance Residuals: Min 1Q Median 3Q Max -13.85467 -1.13541 -0.04279 1.07133 10.33728 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 13.3433865 1.7541626 7.607 2.81e-14 *** Year -0.0047630 0.0008750 -5.443 5.23e-08 *** Month 0.0134598 0.0006671 20.178 < 2e-16 *** GenderMale 0.2265806 0.0046320 48.916 < 2e-16 *** AgeGroup01-4 -1.4608048 0.0224708 -65.009 < 2e-16 *** AgeGroup05-14 -1.7247276 0.0250743 -68.785 < 2e-16 *** AgeGroup15up 2.8062812 0.0100424 279.444 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 403283.7 on 767 degrees of freedom Residual deviance: 4588.5 on 761 degrees of freedom (96 observations deleted due to missingness) AIC: 8986.8 Number of Fisher Scoring iterations: 4

## Answers

First, be very careful about the assumption of missing at random. Your example looks like missingness co-occurs with Female and agegroup. You should really test whether missingness is related to any predictors (or whether any predictors are missing). If so, the responses could be skewed.

Second, the function you are seeking is likely to be predict, which can take a glm model. See ?predict.glm for more guidance. You may want to fit a cascade of models (i.e. nested models) to address missing values.

The mice package provides a function of the same name that allows each missing value to be predicted using a regression scheme based on the other values. It can cope with predictors also being missing because it uses an iterative MCMC algorithm.

I don't think poisson regression is an option, but if all of your counts are as large as the example normal regression should offer a reasonable approximation.