How do I efficiently crossmatch two ASCII catalogs?
I have two ASCII text files with columnated data. The first column of both files is a 'name' that is consistent across both files. One file has some 6000 rows, the other only has 800. Without doing a for line in file.readlines(): approach - e.g.,
with open('big_file.txt') as catalogue: with open('small_file.txt') as targets: for tline in targets.readlines()[2:]: name = tline.split() for cline in catalogue.readlines()[8:]: if name == cline.split() print cline catalogue.seek(0) break
is there an efficient way to return only the rows (or lines) from the larger file that also appear in the smaller file (using the 'name' as the check)?
It's okay if it is one row at a time for say a file.write(matching_line) the idea would be to create a third file with all the info from the large file for only the objects that are in the small file.
for line in file.readlines() is not inherently bad. What's bad is the nested loops you have there. You can use a set to keep track of and check all the names in the smaller file:
s = set() for line in targets: s.add(line.split())
Then, just loop through the bigger file and check if the name is in s:
for line in catalogue: if line.split() in s: print line