Filtering rows with specific characteristics relative to other rows
I have a dataset that consists of two columns: one consists of a list of names and the other identifies a group to which the corresponding name belongs (example data and code for reproducing example data below).
group name 1 aaa 1 aaa 1 aaa 2 NA 2 NA 3 NA 3 bbb 4 ccc 4 ccc 5 ddd 5 eee 6 fff 6 ffg > x <- c(1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6) > y <- c("aaa", "aaa", "aaa", NA, NA, NA, "bbb", "ccc", "ccc", "ddd", "eee", "fff", "ffg") > mydataframe <- data.frame( "group" = x, "name" = y)
The dataset consists of more than 500,000 rows (as well as other columns which are irrelevant to this question). Groups may include may have one member (one row) or many members (multiple rows). When a group has multiple members, the names of the members may be: a) the same (e.g., group 1 where all members are named "aaa"); b) blank (e.g., group 2 where all members are NA); c) a mix of names and blanks (e.g., group 3 that includes one blank name and one name "bbb"); or d) different names (e.g., group 5 that has "ddd" and "eee"). When there are different names they may be very different (e.g., group 5) or similar (group 6)
I need to filter the dataset to identify rows that fall into category d - those rows that have the same group number but different (non-NA) names (in the example below, groups 5 and 6).
The filter would produce a new dataset that looks looks like this:
group name 5 ddd 5 eee 6 fff 6 ffg
Bonus points if someone can help me not only produce the result above, but also distinguish between groups that have members with totally different names (group 5) and groups that have members with similar, but different names (group 6).
Thank you in advance!
Here's another data.table approach:
require(data.table) udt <- data.table(na.omit(unique(mydataframe))) multrows <- udt[,.I[.N>1],by=group]$V1 udt[multrows]
How it works: .N is the number of rows in the group, equal to the number of names; and .I is the row number in udt. So, the new variable .I[.N>1] is either (i) empty if there is only one name (.N==1) or (ii) the row numbers for the group if there are multiple names.
The new variable is named V1 by default, so $V1 extracts it. And we can subset udt by selecting these relevant row numbers.
I copied this method from an answer by @eddi. The two steps can also be combined: udt[udt[,.I[.N>1],by=group]$V1].
Here's a possible data.table v >= 1.9.5 solution
library(data.table) setDT(mydataframe)[, .SD[uniqueN(na.omit(name)) == .N], by = group] # group name # 1: 5 ddd # 2: 5 eee # 3: 6 fff # 4: 6 ffg
Basically we are comparing the number of unique groups (while NAs removed) to the actual group size (.N) and subsetting it out of .SD (which stands for Sub Data) which is our actual data set subseted by condition.
Alternatively with dplyr where we using n_distinct and n() instead (for some reason this solution seem not to work without converting to data.table object first using setDT(mydataframe), so if you haven't done so yet, do it before running this code)
library(dplyr) mydataframe %>% group_by(group) %>% filter(n_distinct(na.omit(name)) == n()) # Source: local data table [4 x 2] # Groups: group # # group name # 1 5 ddd # 2 5 eee # 3 6 fff # 4 6 ffg
For the bonus points you could also try agrep (you can set the max.distance as you wish- 1 is the default). 1- means similar, NA- means no similarity found.
mydataframe %>% group_by(group) %>% filter(n_distinct(na.omit(name)) == n()) %>% mutate(similarity = agrep(name[1L], name[-1L])) # Source: local data table [4 x 3] # # group name similarity # 1 5 ddd NA # 2 5 eee NA # 3 6 fff 1 # 4 6 ffg 1