Parse dates when YYYYMMDD and HH are in separate columns using pandas in Python
I have a simple question related with csv files and parsing datetime.
I have a csv file that look like this:
YYYYMMDD, HH, X 20110101, 1, 10 20110101, 2, 20 20110101, 3, 30
I would like to read it using pandas (read_csv) and have it in a dataframe indexed by the datetime. So far I've tried to implement the following:
import pandas as pnd pnd.read_csv("..\\file.csv", parse_dates = True, index_col = [0,1])
and the result I get is:
X YYYYMMDD HH 2011-01-01 2012-07-01 10 2012-07-02 20 2012-07-03 30
As you see the parse_dates in converting the HH into a different date.
Is there a simple and efficient way to combine properly the column "YYYYMMDD" with the column "HH" in order to have something like this? :
X Datetime 2011-01-01 01:00:00 10 2011-01-01 02:00:00 20 2011-01-01 03:00:00 30
Thanks in advance for the help.
If you pass a list to index_col, it means you want to create a hierarchical index out of the columns in the list.
In addition, the parse_dates keyword can be set to either True or a list/dict. If True, then it tries to parse individual columns as dates, otherwise it combines columns to parse a single date column.
In summary, what you want to do is:
from datetime import datetime import pandas as pd parse = lambda x: datetime.strptime(x, '%Y%m%d %H') pd.read_csv("..\\file.csv", parse_dates = [['YYYYMMDD', 'HH']], index_col = 0, date_parser=parse)
I am doing this all the time, so I tested different ways for speed. The fastest I found is the following, approx. 3 times faster than Chang She's solution, at least in my case, when taking the total time of file parsing and date parsing into account:
First, parse the data file using pd.read_csv withOUT parsing dates. I find that it is slowing down the file-reading quite a lot. Make sure that the columns of the CSV file are now columns in the dataframe df. Then:
format = "%Y%m%d %H" times = pd.to_datetime(df.YYYYMMDD + ' ' + df.HH, format=format) df.set_index(times, inplace=True) # and maybe for cleanup df = df.drop(['YYYYMMDD','HH'], axis=1)