Python File System Bottleneck, how can I fix this?
So basically I have a folder that looks like this:
MyFolder\ data_1.txt data_2.txt data_3.txt ... data_very_large_number.txt
I want to process each of the files. My plan was to run 10 instances of a script that each process 1/10th of the files.
So basically, I did the following:
python script.py 1 python script.py 2 ... python script.py 10
But I'm noticing that only the first instance of script.py is actually processing anything at all. After the first instance is done processing the second instance starts to process. I am guessing that this is a File System bottleneck.
Does anyone have an idea how to tackle this issue with Python?
There are many ways to run these scripts in parallel, but if you want to keep starting them manually from the command line you should do it like this:
python script.py 1 & python script.py 2 &
and so on.
Working with large number of files that will fit into system memory, a significant performance improvement can be achieved by using ramdisk, check this out:
To create a ramdisk, simply do:
# mkfs -q /dev/ram1 8192 # mkdir -p /ramcache # mount /dev/ram1 /ramcache
import queue,threading import glob q = queue.Queue() for file in glob.glob(r"MyFolder\data_*.txt"): q.add(file) class doStuff(threading.Thread): def __init__(self,q): self.q = q super().__init__() def run(self): while True: file = None try: file = q.get_nowait() except Queue.Empty: return # end thread if file is None: continue # if we failed to get a file, forget it # DO STUFF WITH YOUR FILE # # DO STUFF WITH YOUR FILE for _ in range(10): t = doStuff(q) t.daemon = True t.start()