Why does changing a column name take an extremely long time with a large data.frame?
I have a data.frame in R with 19 million rows and 90 columns. I have plenty of spare RAM and CPU cycles. It seems that changing a single column name in this data frame is a very intense operation for R.
system.time(colnames(my.df) <- "foo") user system elapsed 356.88 16.54 373.39
Why is this so? Does every row store the column name somehow? Is this creating an entirely new data frame? It seems this operation should complete in negligible time. I don't see anything obvious in the R manual entry.
I'm running build 7600 of R (64bit) on Windows 7, and in my current workspace, setting colnames on a small data.frame takes '0' time according to system.time().
Edit: I'm aware of the possibility of using data.table, and, honestly, I can wait 5 minutes for the rename to complete whilst I go get some tea. What I'm interested in is what is happening and why?
As several commenters have mentioned, renaming data frame columns is slow, because (depending on how you do it) it makes between 1 and 4 copies of the entire data.frame. Here, from data.table's ?setkey help page, is the nicest way of demonstrating this behavior that I've seen:
DF = data.frame(a=1:2,b=3:4) # base data.frame to demo copies try(tracemem(DF)) # try() for non-Windows where R is # faster without memory profiling colnames(DF) <- "A" # 4 copies of entire object names(DF) <- "A" # 3 copies of entire object names(DF) <- c("A", "b") # 1 copy of entire object `names<-`(DF,c("A","b")) # 1 copy of entire object x=`names<-`(DF,c("A","b")) # still 1 copy (so not print method) # What if DF is large, say 10GB in RAM. Copy 10GB just to change a column name?
To (start) understanding why things are done this way, you'll probably need to delve into some of the related discussions on R-devel. Here are a couple: R-devel: speeding up perception and R-devel: Confused about NAMES
My impressionistic reading of those threads is that:
At least one copy is made so that modifications to it can be 'tried out' before overwriting the original. Thus, if something is wrong with the value-to-be-reassigned, [<-.data.frame or names<- can 'back out' and deliver an error message without having done any damage to the original object.
Several members of R-core aren't completely satisfied with how things are working right now. Several folks explain that in some cases "R loses track"; Luke Tierney indicates that he's tried some modifications relating to this copying in the past "in a few cases and always had to back off"; and Simon Urbanek hints that "there may be some things coming up, too"
(As I said, though, that's just impressionistic: I'm simply not able to follow a full conversation about the details of R's internals!)
Also relevant, in case you haven't seen it, here's how something like names(z) <- "c2" "really" works:
# From ?names<- z <- "names<-"(z, "[<-"(names(z), 3, "c2"))
Note: Much of this answer comes from Matthew Dowle's answer to this other question. (I thought it was worth placing it here, and giving it some more exposure, since it's so relevant to your own question).