Insert rows for missing dates/times

I am new to R but have turned to it to solve a problem with a large data set I am trying to process. Currently I have a 4 columns of data (Y values) set against minute-interval timestamps (month/day/year hour:min) (X values) as below:

    timestamp          tr            tt         sr         st  
1   9/1/01 0:00   1.018269e+02   -312.8622   -1959.393   4959.828  
2   9/1/01 0:01   1.023567e+02   -313.0002   -1957.755   4958.935  
3   9/1/01 0:02   1.018857e+02   -313.9406   -1956.799   4959.938  
4   9/1/01 0:03   1.025463e+02   -310.9261   -1957.347   4961.095  
5   9/1/01 0:04   1.010228e+02   -311.5469   -1957.786   4959.078

The problem I have is that some timestamp values are missing - e.g. there may be a gap between 9/1/01 0:13 and 9/1/01 0:27 and such gaps are irregular through the data set. I need to put several of these series into the same database and because the missing values are different for each series, the dates do not currently align on each row.

I would like to generate rows for these missing timestamps and fill the Y columns with blank values (no data, not zero), so that I have a continuous time series.

I'm honestly not quite sure where to start (not really used R before so learning as I go along!) but any help would be much appreciated. I have thus far installed chron and zoo, since it seems they might be useful.

Thanks!

Answers


I think the easiest thing ist to set Date first as already described, convert to zoo, and then just set a merge:

df$timestamp<-as.POSIXct(df$timestamp,format="%m/%d/%y %H:%M")

df1.zoo<-zoo(df[,-1],df[,1]) #set date to Index

df2 <- merge(df1.zoo,zoo(,seq(start(df1.zoo),end(df1.zoo),by="min")), all=TRUE)

Start and end are given from your df1 (original data) and you are setting by - e.g min - as you need for your example. all=TRUE sets all missing values at the missing dates to NAs.


This is an old question, but I just wanted to post a dplyr way of handling this, as I came across this post while searching for an answer to a similar problem. I find it more intuitive and easier on the eyes than the zoo approach.

library(dplyr)

ts <- seq.POSIXt(as.POSIXct("2001-09-01 0:00",'%m/%d/%y %H:%M'), as.POSIXct("2001-09-01 0:07",'%m/%d/%y %H:%M'), by="min")

ts <- seq.POSIXt(as.POSIXlt("2001-09-01 0:00"), as.POSIXlt("2001-09-01 0:07"), by="min")
ts <- format.POSIXct(ts,'%m/%d/%y %H:%M')

df <- data.frame(timestamp=ts)

data_with_missing_times <- full_join(df,original_data)

   timestamp     tr tt sr st
1 09/01/01 00:00 15 15 78 42
2 09/01/01 00:01 20 64 98 87
3 09/01/01 00:02 31 84 23 35
4 09/01/01 00:03 21 63 54 20
5 09/01/01 00:04 15 23 36 15
6 09/01/01 00:05 NA NA NA NA
7 09/01/01 00:06 NA NA NA NA
8 09/01/01 00:07 NA NA NA NA

Also using dplyr, this makes it easier to do something like change all those missing values to something else, which came in handy for me when plotting in ggplot.

data_with_missing_times %>% group_by(timestamp) %>% mutate_each(funs(ifelse(is.na(.),0,.)))

   timestamp     tr tt sr st
1 09/01/01 00:00 15 15 78 42
2 09/01/01 00:01 20 64 98 87
3 09/01/01 00:02 31 84 23 35
4 09/01/01 00:03 21 63 54 20
5 09/01/01 00:04 15 23 36 15
6 09/01/01 00:05  0  0  0  0
7 09/01/01 00:06  0  0  0  0
8 09/01/01 00:07  0  0  0  0

Date padding is implemented in the padr package in R. If you store your data frame, with your date-time variable stored as POSIXct or POSIXlt. All you need to do is:

library(padr)
pad(df_name)

See vignette("padr") or this blog post for its working.


# some made-up data
originaldf <- data.frame(timestamp=c("9/1/01 0:00","9/1/01 0:01","9/1/01 0:03","9/1/01 0:04"),
    tr = rnorm(4,0,1),
    tt = rnorm(4,0,1))

originaldf$minAsPOSIX <- as.POSIXct(originaldf$timestamp, format="%m/%d/%y %H:%M", tz="GMT")

# Generate vector of all minutes
ndays <- 1 # number of days to generate
minAsNumeric <- 60*60*24*243 + seq(0,60*60*24*ndays,by=60)

# convert those minutes to POSIX
minAsPOSIX <- as.POSIXct(minAsNumeric, origin="2001-01-01", tz="GMT")

# new df
newdf <- merge(data.frame(minAsPOSIX),originaldf,all.x=TRUE, by="minAsPOSIX")

In case you want to substitute the NA values acquired by any method mentioned above with zeroes, you can do this:

df[is.na(df)] <- 0

(I orginally wanted to comment this on Ibollar's answer but I lack the necessary reputation, thus I posted as an answer)


df1.zoo <- zoo(df1[,-1], as.POSIXlt(df1[,1], format = "%Y-%m-%d %H:%M:%S")) #set date to Index: Notice that column 1 is Timestamp type and is named as "TS"

full.frame.zoo <- zoo(NA, seq(start(df1.zoo), end(df1.zoo), by="min")) # zoo object
full.frame.df  <- data.frame(TS = as.POSIXlt(index(full.frame.zoo), format = "%Y-%m-%d %H:%M:%S")) # conver zoo object to data frame

full.vancouver <- merge(full.frame.df, df1, all = TRUE) # merge

I was looking for something similar where instead of filling out missing timestamps my data was in months and days. So I wanted to generate a sequence of months that would cater for leap years et cetera. I used lubridate:

date <- df$timestamp[1]
date_list <- c(date)
while (date < df$timestamp[nrow(df)]){
    date <- date %m+% months(1) 
    date_list <- c(date_list,date)
}
date_list <- format(as.Date(date_list),"%Y-%m-%d")
df_1 <- data.frame(months=date_list, stringsAsFactors = F)

This will give me a list of dates in incremental months. Then I join

df_with_missing_months <- full_join(df_1,df)

There are some advances in handling time series data in R, e.g. the tsibble package added such time series manipulations in tidy way:

library(tsibble)
library(lubridate)

ts <- lubridate::dmy_hm(c("9/1/01 0:00","9/1/01 0:01","9/1/01 0:03","9/1/01 0:27"))
originaldf <- tsibble(timestamp = ts,
                      tr        = rnorm(4,0,1),
                      tt        = rnorm(4,0,1),
                      index     = timestamp)

originaldf %>% 
  fill_gaps()

I think this can accomplished by using complete in tidyr package.

library(tidyverse)
df <- df %>%
      complete(timestamp = seq.POSIXt(min(timestamp), max(timestamp), by = "minute"), 
               tr, tt, sr,st)

you can also initialize your start date and end date instead of using min(timestamp) and max(timestamp).


Need Your Help

How does GitHub change the URL but not the reload?

javascript html

Hey, I have noticed that when browsing a GitHub repository it uses AJAX to load each folder / file.

What is the simplest method of inter-process communication between 2 C# processes?

c# ipc

I want communicate between a parent and child process both written in C#. It should be asynchronous, event driven. I does not want run a thread in every process that handle the very rare communicat...