Replace value in a column based on a Frequency Count using R

I have a dataset with multiple columns. Many of these columns contain over 32 factors, so to run a Random Forest (for example), I want to replace values in the column based on their Frequency Count.

One of the column reads like this:

$ country                                    
: Factor w/ 92 levels "China","India","USA",..: 30 39 39 20 89 30 16 21 30 30 ...

What I would like to do is only retain the top N (where N is a value between 5 and 20) countries, and replace the remaining values with "Other". I know how to calculate the frequency of the values using the table function, but I can't seem to find a solution for replacing values on the basis of such a rule. How can this be done?

Answers


Some example data:

set.seed(1)
x <- factor(sample(1:5,100,prob=c(1,3,4,2,5),replace=TRUE))
table(x)
# 1  2  3  4  5 
# 4 26 30 13 27 

Replace all the levels other than the top 3 (Levels 2/3/5) with "Other":

levels(x)[rank(table(x)) < 3] <- "Other"

table(x)
#Other     2     3     5 
#   17    26    30    27

Need Your Help

semaphores for client side storage?

javascript node.js socket.io storage

I'm developing an instant message app with node.js and socket.io and am saving messages the client receives in client side storage to load into their chatbox after they refresh the page. The issue...

Interfacing with a Futaba RC controller using c++ and ubuntu

c++ ubuntu joystick remote-control

So I am hoping to use a Futaba remote controller (specifically the Futaba 7c 2.4ghz) for a c++ + OpenGL simulator that I wrote.