How to improve runtime performance of reading file program
I'm currently trying to read 150 million lines (from a data file with bio-sequencing information) using Python. Currently, it's reading at 20,000 lines per second which would take about an hour and a half. I have to read through 20 of these files. Given that Python is a very high level language, would it be better to use Java to read the files instead or is the time difference not significant enough to warrant switching to another language?
The current code I'm using is:
lines_hashed = 0 with open(CUR_FILE) as f: for line in f: cpg = line.split("\t") cpg_dict[cpg] = ....data.... print lines_hashed lined_hashed += 1
The print statement is there only as a sanity that the program didn't stall anywhere. I'm assuming this is also slowing down the running time. Is there a way to check this without the print statement?
- Printing to the screen is expensive compared to disk reads. If you must check performance as you go along, only print something out every 1000 lines or more.
- As for using other languages, almost all languages call the OS to do the real work anyway.