Looking for an efficient way to check for file existence on Windows with files on a SAN
I have a large set of files located across a series of directories on a windows 2003 server. There are upwards of a million files in each directory. The Windows server uses iSCSI to connect to an Equalogics SAN.
I have an application that needs to determine if a set of files exists - the application needs to check for the existence of up to a million files per directory.
I have tried a variety of techniques / scripting languages including perl, vbscript, dos batch files and I can not obtain greater than about 250 files checks per second. This works out to almost 50 minutes to check for 800,000 files. I tried multithreading a perl program to check for multiple files at a time, but this did not help.
I have also tried to list all of the files in the directory using dir, ls, find (via cygwin), and it takes many minutes for it to start outputting any file names at all. This isn't a great approach anyway, because there are more files than I actually need to check for.
Is there a way I can force windows to do a "read ahead" on the directory, and get the files into a cache?
Is there a better way to approach this find of a problem?
I would probably avoid any interpreted language such as VBScript et al for precisely the reasons you've specified - just not going to work as well in a scenario where performance is an issue.
Now, as my formal caveat for my suggestion, I'm assuming that over the expected time such an application would run that the set of propsective files (the search target) remains relatively stable such that the risk of a false positive presence check from the application due to file set changes occuring after the scanning application started is minimal.
It's not elegant, but I would at least suggest exploring a Win32 (not .NET) console-type app that recursively searches the directory tree into a memory-mapped file, then search that file for your required pattern. That limits the disk access to just the effort required to accumulate the results, and then puts the searching against the presumably (much) faster memory-backed file. Now, I may be underestimating the size and/or complexity of your fileset contents, but that's what I would offer as a starting point.
I recommend a Win32 app over a .NET app to avoid the overhead of the framework runtime, but the obvious caveats about a non-managed app apply.
Hope that's helpful, or at least stirs the pot for you a bit. Good luck.
When you check each file individually you're limited by the latency of the request and response. It's doubtful you can find a way to speed that up unless you use asynchronous requests and run many simultaneously, but that approach will put a strain on the file system.
While getting a full directory listing seems like overkill, it's likely to be the fastest method unless your search list is much smaller (say 100 times smaller) than the full directory.
Each individual check requires the operating system to read through the directory until it finds (or fails to find) the file you're asking for. In other words, each check reads on average more than half of the contents of the directory, so reading the complete directory once will almost certainly be much more efficient.
However, you shouldn't do this by spawning out to another program. Use FindFirstFile/FindNextFile or a .NET equivalent. You can check each file against your list as you find it - you might want to organize your list first, put it in a b-tree or something.
You might want to try GetFileInformationByHandleEx with the FileIdBothDirectoryInfo option instead of FindFirstFile/FindNextFile to see which is faster.