Determine missing proportions for a set of variables in a data frame by a grouping

I need help on a succint way to determine missing proportions for a set of variables in a data frame by a grouping. Consider for example the Soybean data in the package mlbench.

data(Soybean, package="mlbench")

I would like to compute proportion missing of each of the variables (columns 2 to 36) for each value of Soybean$Class

Ideally the output would look something like the following (the numbers are not real):

Class                   date    plant.stand       precip    ...
2-4-d-injury             0.0            5.1         19.4
alternarialeaf-spot     12.5            2.3          1.2
anthracnose              1.4            0.0         11.2
bacterial-blight         0.3            0.0          0.5  
...  

I have tried the following:

myf <- function(df) {
  apply(df, 2, function(x) sum(is.na(x)) / nrow(df) * 100)
}   

by(Soybean, Soybean$Class, function(y) myf(y))

But (i) I don't want to divide by total rows of the dataframe, e.g. nrow(df) is incorrect; and (ii) the output is difficult to digest.

It seems like this is a simple thing to do, and I am afraid I am missing something obvious. I am relatively new to R, and I appreciate any help.

Answers


This is fairly straightforward sapply and tapply fodder.

Take this simple example:

dat <- data.frame(
 Class=rep(letters[1:3],each=2),
 var1=c(1,2,3,NA,4,NA),
 var2=c(NA,NA,1,2,NA,3)
)

#  Class var1 var2
#1     a    1   NA
#2     a    2   NA
#3     b    3    1
#4     b   NA    2
#5     c    4   NA
#6     c   NA    3

Then try this:

sapply(
 dat[-1],
 function(x) {
  tapply(x,dat$Class,FUN=function(y) sum(is.na(y))/length(y) * 100 )
 }
)

Result:

#  var1 var2
#a    0  100
#b   50    0
#c   50   50

This should work:

library(dplyr)
pmiss <- function(x) 100 * sum(is.na(x)) / length(x)

Soybean %.%
  group_by(Class) %.%
  summarise( 
    date = pmiss(date),
    plant.stand = pmiss(plant.stand)
  )

Using data.table you can apply the pmiss function to all columns

library(data.table)
DT <- data.table(Soybean)
DT[, lapply(.SD, pmiss), by = Class] 

Need Your Help

converting a Perl regex to python

python regex perl

Im converting some Perl code to python, and I have a regex that works perfectly in Perl, but does not work when I copy it into re.match. The Perl line is: