Filling in missing (blanks) in a data table, per category - backwards and forwards
I am working with a large data set of billing records for my clinical practice over 11 years. Quite a few of the rows are missing the referring physician. However, using some rules I can quite easily fill them in but do not know how to implement it in data.table under R. I know that there are things such as na.locf in the zoo package and self rolling join in the data.table package. The examples that I have seen are too simplistic and do not help me.
Here is some fictitious data to orient you (as a dput ASCII text representation)
structure(list(patient.first.name = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("John", "Kathy", "Timothy"), class = "factor"), patient.last.name = structure(c(3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("Jones", "Martinez", "Squeal"), class = "factor"), medical.record.nr = c(4563455, 4563455, 4563455, 4563455, 4563455, 2663775, 2663775, 2663775, 2663775, 2663775, 3330956, 3330956, 3330956, 3330956), date.of.service = c(39087, 39112, 39112, 39130, 39228, 39234, 39244, 39244, 39262, 39360, 39184, 39194, 39198, 39216), procedure.code = c(44750, 38995, 40125, 44720, 44729, 44750, 38995, 40125, 44720, 44729, 44750, 44729, 44729, 44729), diagnosis.code.1 = c(456.87, 456.87, 456.87, 456.87, 456.87, 521.37, 521.37, 521.37, 521.37, 356.36, 456.87, 456.87, 456.87, 456.87), diagnosis.code.2 = c(413, 413, 413, 413, 413, 532.23, NA, NA, NA, NA, NA, NA, NA, NA), referring.doctor.first = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, NA, NA, NA, 1L, 1L, NA), .Label = c("Abe", "Mark"), class = "factor"), referring.doctor.last = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, NA, NA, NA, 1L, 1L, NA), .Label = c("Newstead", "Wydell"), class = "factor"), referring.docotor.zip = c(15209, 15209, 15209, 15209, 15209, 15222, 15222, 15222, NA, NA, NA, 15209, 15209, NA), some.other.stuff = structure(c(1L, 1L, 1L, NA, 3L, NA, NA, 4L, NA, 6L, NA, 2L, 5L, NA), .Label = c("alkjkdkdio", "cheerios", "ddddd", "dddddd", "dogs", "lkjljkkkkk"), class = "factor")), .Names = c("patient.first.name", "patient.last.name", "medical.record.nr", "date.of.service", "procedure.code", "diagnosis.code.1", "diagnosis.code.2", "referring.doctor.first", "referring.doctor.last", "referring.docotor.zip", "some.other.stuff" ), row.names = c(NA, 14L), class = "data.frame")
The obvious solution is to use some sort of last observation carried forward (LOCF) algorithm on referring.doctor.last and referring.doctor.first. However, it must stop when it gets to a new patient. In other words the LOCF must only be applied to one patient who is identified by the combination of patient.first.name, patient.last.name, medical.record.nr. Also note how some patients are missing the referring doctor on their very first visit so that means that some observations have to be carried backwards. To complicate matters some patients change primary care physicians and so there may be one referring doctor earlier on and another one later on. The alogorithm therefore needs to be aware of the date order of the rows with missing values.
In zoo na.locf I do not see an easy way to group the LOCF per patient. The rolling join examples that I have seen, would not work here becasuse I cannot simply take out the rows with the missing referring.doctor information since I would then loose date.of.service and procedure.code etcetera. I would love your help in learning how R can fill in my missing data.
@MatthewDowle has provided us with a wonderful starting point and here we will take it to its conclusion.
In a nutshell, use zoo's na.locf. The problem is not amenable to rolling joins.
setDT(bill) bill[,referring.doctor.last:=na.locf(referring.doctor.last,na.rm=FALSE), by=list(patient.last.name, patient.first.name, medical.record.nr)] bill[,referring.doctor.last:=na.locf(referring.doctor.last,na.rm=FALSE,fromLast=TRUE), by=list(patient.last.name, patient.first.name, medical.record.nr)]
Then do something similar for referring.doctor.first
A few pointers:
The by statement ensures that the last observation carried forward is restricted to the same patient so that the carrying does not "bleed" into the next patient on the list.
One must use the na.rm=FALSE argument. If one does not then a patient who is missing information for a referring physician on their very first visit will have the NA removed and the vector of new values (existing + carried forward) will be one element short of the number of rows. The shortened vector is recycled and everything gets shifted up and the last row gets the first element of the vector as it is recycled. In other words, a big mess. And worst of all you will only see it sometimes.
Use fromLast=TRUE to run through the column again. That fills in the NA that preceded any data. Instead of last observation carried forward (LOCF) zoo uses next observation carried backward (NOCB). Happiness - you have now filled in the missing data in a way that is correct for most circumstances.
You can pass multiple := per line, e.g. DT[,`:=`(new=1L,new2=2L,...)]
A more concise example would have been easier to answer. For example you've included quite a few columns that appear to be redundant. Does it really need to be by first name and last name, or can we use the patient number?
Since you already have NAs in the data, that you wish to fill, it's not roll in data.table really. A rolling join is more for when your data has no NA but you have another time series (for example) that joins to positions inbetween the data. (One efficiency advantage there is the very fact you don't create NA first which you then have to fill in a 2nd step.) Or, in other words, in your question you just have one dataset; you aren't joining two.
So you do need na.locf as @Joshua suggested. I'm not aware of a function that fills NA forward and then the first value backwards, though.
In data.table, to use na.locf by group it's just :
require(data.table) require(zoo) DT[,doctor:=na.locf(doctor),by=patient]
which has the efficiency advantages of fast aggregation and update by reference. You would have to write a new small function on top of na.locf to roll the first non NA backwards.
Ensure the data is sorted by patient then date, first. Then the above will cope with changes in doctor over time, since by maintains the order of rows within each group.
Hope that gives you some hints.