How to deal with spaces in column names?

I know it is preferred if variable names do not have spaces in them. I have a situation where I need publication-quality charts, so axes and legends need to have properly formatted labels, ie with spaces. So, for example, in development I might have variables called "Pct.On.OAC" and Age.Group, but in my final plot I need "% on OAC" and "Age Group" to appear:

'data.frame':   22 obs. of  3 variables:
 $ % on OAC           : Factor w/ 11 levels "0","0.1-9.9",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Age Group          : Factor w/ 2 levels "Aged 80 and over",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Number of Practices: int  47 5 33 98 287 543 516 222 67 14 ...

But when I try to plot these:

ggplot(dt.m, aes(x=`% on OAC`,y=`Number of Practices`, fill=`Age Group`)) +
    geom_bar()
)

no problem with that. But when I add a facet:

ggplot(dt.m, aes(x=`% on OAC`,y=`Number of Practices`, fill=`Age Group`)) +
    geom_bar() +
    facet_grid(`Age Group`~ .) 

I get Error in[.data.frame(base, names(rows)) : undefined columns selected

If I change Age Group to Age.Group then it works fine, but as I said, I don't want the dot to appear in the title legend.

So my questions are:

  1. Is there a workaround for the problem with the facet ?
  2. Is there a better general approach to dealing with the problem of spaces (and other characters) in variable names when I want the final plot to include them ? I suppose I can manually overide them, but that seems like a lot of faffing around.

Answers


This is a "bug" in the package ggplot2 that comes from the fact that the function as.data.frame() in the internal ggplot2 function quoted_df converts the names to syntactically valid names. These syntactically valid names cannot be found in the original dataframe, hence the error.

To remind you :

syntactically valid names consists of letters, numbers and the dot or underline characters, and start with a letter or the dot (but the dot cannot be followed by a number)

There's a reason for that. There's also a reason why ggplot allows you to set labels using labs, eg using the following dummy dataset with valid names:

X <-data.frame(
  PonOAC = rep(c('a','b','c','d'),2),
  AgeGroup = rep(c("over 80",'under 80'),each=4),
  NumberofPractices = rpois(8,70)
  ) 

You can use labs at the end to make this code work

ggplot(X, aes(x=PonOAC,y=NumberofPractices, fill=AgeGroup)) +
  geom_bar() +
  facet_grid(AgeGroup~ .) + 
  labs(x="% on OAC", y="Number of Practices",fill = "Age Group")

To produce


You asked "Is there a better general approach to dealing with the problem of spaces (and other characters) in variable names" and yes there are a few:

  • Just don't use them as things will break as you experienced here
  • Use the make.names() function to create safe names; this is used by R too to create identifiers (eg by using underscores for spaces etc)
  • If you must, protect the unsafe identifiers with backticks.

Example for the last two points:

R> myvec <- list("foo"=3.14, "some bar"=2.22)
R> myvec$'some bar' * 2
[1] 4.44
R> make.names(names(myvec))
[1] "foo"      "some.bar"
R> 

library("data.table", lib.loc = "~/R/win-library/3.5")

names(inv01)

[1] "INV_YEAR"  "TREE_NO"   "DBH 2019"  "HT 2019" 

inv01tmp<-inv01[,list(DBH=`DBH 2019`,HT=`HT 2019`)]

enter image description here


Need Your Help

java.util.stream with ResultSet

java jdbc lambda java-stream jooq

I have few tables with big amount of data (about 100 million records). So I can't store this data in memory but I would like to stream this result set using java.util.stream class and pass this str...

Multi-Core and Concurrency - Languages, Libraries and Development Techniques

programming-languages concurrency functional-programming multicore

The CPU architecture landscape has changed, multiple cores is a trend that will change how we have to develop software. I've done multi-threaded development in C, C++ and Java, I've done multi-proc...