Counting combinations without destroying type

I wonder whether someone has an idea for how to count combinations like the following in a better way than I've thought of.

> library(lubridate)
> df <- data.frame(x=sample(now()+hours(1:3), 100, T), y=sample(1:4, 100, T))
> with(df, as.data.frame(table(x, y)))
                     x y Freq
1  2012-06-15 00:10:18 1    5
2  2012-06-15 01:10:18 1    9
3  2012-06-15 02:10:18 1    8
4  2012-06-15 00:10:18 2    9
5  2012-06-15 01:10:18 2   10
6  2012-06-15 02:10:18 2   12
7  2012-06-15 00:10:18 3    7
8  2012-06-15 01:10:18 3    9
9  2012-06-15 02:10:18 3    6
10 2012-06-15 00:10:18 4    5
11 2012-06-15 01:10:18 4   14
12 2012-06-15 02:10:18 4    6

I like that format, but unfortunately when we ran x and y through table(), they got converted to factors. In the final output they can exist quite nicely as their original type, but getting there seems problematic. Currently I just manually fix all the types afterward, which is really messy because I have to re-set the timezone, and look up the percent-codes for the default date format, etc. etc.

It seems like an efficient solution would involve hashing the objects, or otherwise mapping integers to the unique values of x and y so we can use tabulate(), then mapping back.

Ideas?

Answers


Here's data.table version that preserves the column classes:

library(data.table)

dt <- data.table(df, key=c("x", "y"))
dt[, .N, by=key(dt)]
#                       x y  N
#  1: 2012-06-14 18:10:22 1  8
#  2: 2012-06-14 18:10:22 2 10
#  3: 2012-06-14 18:10:22 3  8
#  4: 2012-06-14 18:10:22 4  8
#  5: 2012-06-14 19:10:22 1  6
#  6: 2012-06-14 19:10:22 2  8
#  7: 2012-06-14 19:10:22 3  6
#  8: 2012-06-14 19:10:22 4  6
#  9: 2012-06-14 20:10:22 1 15
# 10: 2012-06-14 20:10:22 2  5
# 11: 2012-06-14 20:10:22 3 12
# 12: 2012-06-14 20:10:22 4  8

str(dt[, .N, by=key(dt)])
# Classes ‘data.table’ and 'data.frame':  12 obs. of  3 variables:
#  $ x: POSIXct, format: "2012-06-14 18:10:22" "2012-06-14 18:10:22" ...
#  $ y: int  1 2 3 4 1 2 3 4 1 2 ...
#  $ N: int  8 10 8 8 6 8 6 6 15 5 ...

Edit in response to follow-up question

To count the number of appearances of all possible combinations of the observed factor levels (including those which don't appear in the data), you can do something like the following:

dt<-dt[1:30,]  # Make subset of dt in which some factor combinations don't appear

ii <- do.call("CJ", lapply(dt, unique))  # CJ() is similar to expand.grid()
dt[ii, .N]
#                      x y N
# 1: 2012-06-14 22:53:05 1 8
# 2: 2012-06-14 22:53:05 2 7
# 3: 2012-06-14 22:53:05 3 9
# 4: 2012-06-14 22:53:05 4 5
# 5: 2012-06-14 23:53:05 1 1
# 6: 2012-06-14 23:53:05 2 0
# 7: 2012-06-14 23:53:05 3 0
# 8: 2012-06-14 23:53:05 4 0

You can use ddply

library(plyr)

ddply(df, .(x, y), summarize, Freq = length(y))

If you want it arranged by y then x

ddply(df, .(y, x), summarize, Freq = length(y))

or if column ordering is important as well as row ordering

arrange(ddply(df, .(x, y), summarize, Freq = length(y)), y)

Need Your Help

VIM Autocomplete - Use $ as the word separator

vim autocomplete word separator

Let's say I have following typed in my source file.

Changing the color of a QTextBlock that is within a QTextDocument

qt qt4

Is there any other way to change the QTextLayout of a QTextBlock that is within a QTextDocument without having to subclass QAbstractTextDocumentLayout and call its documentChanged? I know that on ...