# Determine missing proportions for a set of variables in a data frame by a grouping

I need help on a succint way to determine missing proportions for a set of variables in a data frame by a grouping. Consider for example the Soybean data in the package mlbench.

data(Soybean, package="mlbench")

I would like to compute proportion missing of each of the variables (columns 2 to 36) for each value of Soybean$Class

Ideally the output would look something like the following (the numbers are not real):

Class date plant.stand precip ... 2-4-d-injury 0.0 5.1 19.4 alternarialeaf-spot 12.5 2.3 1.2 anthracnose 1.4 0.0 11.2 bacterial-blight 0.3 0.0 0.5 ...

I have tried the following:

myf <- function(df) { apply(df, 2, function(x) sum(is.na(x)) / nrow(df) * 100) } by(Soybean, Soybean$Class, function(y) myf(y))

But (i) I don't want to divide by total rows of the dataframe, e.g. nrow(df) is incorrect; and (ii) the output is difficult to digest.

It seems like this is a simple thing to do, and I am afraid I am missing something obvious. I am relatively new to R, and I appreciate any help.

## Answers

This is fairly straightforward sapply and tapply fodder.

Take this simple example:

dat <- data.frame( Class=rep(letters[1:3],each=2), var1=c(1,2,3,NA,4,NA), var2=c(NA,NA,1,2,NA,3) ) # Class var1 var2 #1 a 1 NA #2 a 2 NA #3 b 3 1 #4 b NA 2 #5 c 4 NA #6 c NA 3

Then try this:

sapply( dat[-1], function(x) { tapply(x,dat$Class,FUN=function(y) sum(is.na(y))/length(y) * 100 ) } )

Result:

# var1 var2 #a 0 100 #b 50 0 #c 50 50

This should work:

library(dplyr) pmiss <- function(x) 100 * sum(is.na(x)) / length(x) Soybean %.% group_by(Class) %.% summarise( date = pmiss(date), plant.stand = pmiss(plant.stand) )

Using data.table you can apply the pmiss function to all columns

library(data.table) DT <- data.table(Soybean) DT[, lapply(.SD, pmiss), by = Class]