R: Aggregate character strings with c

I have a data frame with two columns: one is strings, the other one is integers.

> rnames = sapply(1:20, FUN=function(x) paste("item", x, sep="."))
> x <- sample(c(1:5), 20, replace = TRUE)
> df <- data.frame(x, rnames)
> df
   x  rnames
1  5  item.1
2  3  item.2
3  5  item.3
4  3  item.4
5  1  item.5
6  3  item.6
7  4  item.7
8  5  item.8
9  4  item.9
10 5 item.10
11 5 item.11
12 2 item.12
13 2 item.13
14 1 item.14
15 3 item.15
16 4 item.16
17 5 item.17
18 4 item.18
19 1 item.19
20 1 item.20

I'm trying to aggregate the strings into list or vectors of strings (characters) with the 'c' or the 'list' function, but getting weird results:

> aggregate(rnames ~ x, df, c)
  x             rnames
1 1      16, 6, 11, 13
2 2               4, 5
3 3      12, 15, 17, 7
4 4      18, 20, 8, 10
5 5 1, 14, 19, 2, 3, 9

When I use 'paste' instead of 'c', I can see that the aggregate is working correctly - but the result is not what I'm looking for.

> aggregate(rnames ~ x, df, paste)
  x                                            rnames
1 1                 item.5, item.14, item.19, item.20
2 2                                  item.12, item.13
3 3                   item.2, item.4, item.6, item.15
4 4                  item.7, item.9, item.16, item.18
5 5 item.1, item.3, item.8, item.10, item.11, item.17

What I'm looking for is that every aggregated group would be presented as a vector or a lit (hence the use of c) as opposed to the single string I'm getting with 'paste'. Something along the lines of the following (which in reality doesn't work):

> aggregate(rnames ~ x, df, c)
  x                                            rnames
1 1                 item.5, item.14, item.19, item.20
2 2                                  item.12, item.13
3 3                   item.2, item.4, item.6, item.15
4 4                  item.7, item.9, item.16, item.18
5 5 item.1, item.3, item.8, item.10, item.11, item.17

Any help would be appreciated.

Answers


You fell in the usual trap of data.frame: your character column is not a character column, it is a factor column! Hence the numbers instead of the characters in your result:

> rnames = sapply(1:20, FUN=function(x) paste("item", x, sep="."))
> x <- sample(c(1:5), 20, replace = TRUE)
> df <- data.frame(x, rnames)
> str(df)
'data.frame':   20 obs. of  2 variables:
 $ x     : int  2 5 5 5 5 4 3 3 2 4 ...
 $ rnames: Factor w/ 20 levels "item.1","item.10",..: 1 12 14 15 16 17 18 19 20 2 ...

To prevent the conversion to factors, use argument stringAsFactors=FALSE in your call to data.frame:

> df <- data.frame(x, rnames,stringsAsFactors=FALSE)
> str(df)
'data.frame':   20 obs. of  2 variables:
 $ x     : int  5 5 3 5 5 3 2 5 1 5 ...
 $ rnames: chr  "item.1" "item.2" "item.3" "item.4" ...
> aggregate(rnames ~ x, df, c)
  x                                                                              rnames
1 1                                                            item.9, item.13, item.17
2 2                                                                              item.7
3 3                                                             item.3, item.6, item.19
4 4                                                           item.12, item.15, item.16
5 5 item.1, item.2, item.4, item.5, item.8, item.10, item.11, item.14, item.18, item.20

Another solution to avoid the conversion to factor is function I:

> df <- data.frame(x, I(rnames))
> str(df)
'data.frame':   20 obs. of  2 variables:
 $ x     : int  3 5 4 5 4 5 3 3 1 1 ...
 $ rnames:Class 'AsIs'  chr [1:20] "item.1" "item.2" "item.3" "item.4" ...

Excerpt from ?I:

In function data.frame. Protecting an object by enclosing it in I() in a call to data.frame inhibits the conversion of character vectors to factors and the dropping of names, and ensures that matrices are inserted as single columns. I can also be used to protect objects which are to be added to a data frame, or converted to a data frame via as.data.frame.

It achieves this by prepending the class "AsIs" to the object's classes. Class "AsIs" has a few of its own methods, including for [, as.data.frame, print and format.


'm not sure just exactly what it is that you are looking for... so perhaps some reference output would be good to give us an idea of what we are aiming at?

But, since your last bit of code seems to be close to what you are after, maybe a solution like the following would work:

> library(plyr)
> ddply(df, .(x), summarize, rnames = paste(rnames, collapse = "|"))
  x                                         rnames
1 1                         item.9|item.11|item.20
2 2                  item.1|item.2|item.15|item.16
3 3                                  item.7|item.8
4 4           item.4|item.5|item.6|item.12|item.13
5 5 item.3|item.10|item.14|item.17|item.18|item.19

You can vary how the individual elements are stuck together by changing the collapse argument to paste().

Alternatively, if you want to just have each of the groups as a vetor then you could use this:

> df$rnames = as.character(df$rnames)
> L = dlply(df, .(x), function(df) {df$rnames})
> L
$`1`
[1] "item.9"  "item.11" "item.20"

$`2`
[1] "item.1"  "item.2"  "item.15" "item.16"

$`3`
[1] "item.7" "item.8"

$`4`
[1] "item.4"  "item.5"  "item.6"  "item.12" "item.13"

$`5`
[1] "item.3"  "item.10" "item.14" "item.17" "item.18" "item.19"

attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
  x
1 1
2 2
3 3
4 4
5 5

This gives you a list of vectors, which is what you were after. And each group can be indexed out of the resulting list:

> L[[1]]
[1] "item.9"  "item.11" "item.20"

Need Your Help

Create a multicast router

network-programming video-streaming vlc multicast

How can i make my laptop to act as a multicast router.

What are the Inputs, Outputs and Target in ANN

matlab artificial-intelligence neural-network prediction

I am getting confusing about Inputs data set, outputs and target. I am studying about Artificial Neural Network in Matlab, my purposed is that I wanted to use the history data (I have rainfall and ...