How to improve runtime performance of reading file program

I'm currently trying to read 150 million lines (from a data file with bio-sequencing information) using Python. Currently, it's reading at 20,000 lines per second which would take about an hour and a half. I have to read through 20 of these files. Given that Python is a very high level language, would it be better to use Java to read the files instead or is the time difference not significant enough to warrant switching to another language?

The current code I'm using is:

lines_hashed = 0
with open(CUR_FILE) as f:
    for line in f:
       cpg = line.split("\t")
       cpg_dict[cpg[0]] = ....data....
       print lines_hashed
       lined_hashed += 1

The print statement is there only as a sanity that the program didn't stall anywhere. I'm assuming this is also slowing down the running time. Is there a way to check this without the print statement?

Thanks.

Answers


  1. Printing to the screen is expensive compared to disk reads. If you must check performance as you go along, only print something out every 1000 lines or more.
  2. As for using other languages, almost all languages call the OS to do the real work anyway.

Need Your Help

why chrome shows cookies for all domains in resources tab

google-chrome google-chrome-devtools

I've opened Chrome's web developer tools (F12) and navigated to Resources tab. I then selected Cookies tab and clicked on cookies specific to my domain inreado.git.local. However the view showed co...

How to display a LinkedBinaryTree

java linked-list binary-tree tostring treenode

I need to understand the best way to display my linkedBinaryTree. right noe the driver is passing in integers as elements to each node of the tree, For the toString i have tried the following snipp...