How to deal with spaces in column names?
I know it is preferred if variable names do not have spaces in them. I have a situation where I need publication-quality charts, so axes and legends need to have properly formatted labels, ie with spaces. So, for example, in development I might have variables called "Pct.On.OAC" and Age.Group, but in my final plot I need "% on OAC" and "Age Group" to appear:
'data.frame': 22 obs. of 3 variables: $ % on OAC : Factor w/ 11 levels "0","0.1-9.9",..: 1 2 3 4 5 6 7 8 9 10 ... $ Age Group : Factor w/ 2 levels "Aged 80 and over",..: 1 1 1 1 1 1 1 1 1 1 ... $ Number of Practices: int 47 5 33 98 287 543 516 222 67 14 ...
But when I try to plot these:
ggplot(dt.m, aes(x=`% on OAC`,y=`Number of Practices`, fill=`Age Group`)) + geom_bar() )
no problem with that. But when I add a facet:
ggplot(dt.m, aes(x=`% on OAC`,y=`Number of Practices`, fill=`Age Group`)) + geom_bar() + facet_grid(`Age Group`~ .)
I get Error in[.data.frame(base, names(rows)) : undefined columns selected
If I change Age Group to Age.Group then it works fine, but as I said, I don't want the dot to appear in the title legend.
So my questions are:
- Is there a workaround for the problem with the facet ?
- Is there a better general approach to dealing with the problem of spaces (and other characters) in variable names when I want the final plot to include them ? I suppose I can manually overide them, but that seems like a lot of faffing around.
This is a "bug" in the package ggplot2 that comes from the fact that the function as.data.frame() in the internal ggplot2 function quoted_df converts the names to syntactically valid names. These syntactically valid names cannot be found in the original dataframe, hence the error.
To remind you :
syntactically valid names consists of letters, numbers and the dot or underline characters, and start with a letter or the dot (but the dot cannot be followed by a number)
There's a reason for that. There's also a reason why ggplot allows you to set labels using labs, eg using the following dummy dataset with valid names:
X <-data.frame( PonOAC = rep(c('a','b','c','d'),2), AgeGroup = rep(c("over 80",'under 80'),each=4), NumberofPractices = rpois(8,70) )
You can use labs at the end to make this code work
ggplot(X, aes(x=PonOAC,y=NumberofPractices, fill=AgeGroup)) + geom_bar() + facet_grid(AgeGroup~ .) + labs(x="% on OAC", y="Number of Practices",fill = "Age Group")
You asked "Is there a better general approach to dealing with the problem of spaces (and other characters) in variable names" and yes there are a few:
- Just don't use them as things will break as you experienced here
- Use the make.names() function to create safe names; this is used by R too to create identifiers (eg by using underscores for spaces etc)
- If you must, protect the unsafe identifiers with backticks.
Example for the last two points:
R> myvec <- list("foo"=3.14, "some bar"=2.22) R> myvec$'some bar' * 2  4.44 R> make.names(names(myvec))  "foo" "some.bar" R>
library("data.table", lib.loc = "~/R/win-library/3.5") names(inv01)  "INV_YEAR" "TREE_NO" "DBH 2019" "HT 2019" inv01tmp<-inv01[,list(DBH=`DBH 2019`,HT=`HT 2019`)]