search and replace enumerating found strings
I have a huge text file (18Gb) separated by articles, each article have a heading like this:
<text id="1403" year="" url_source="http://www.adobe.de" error="0.008696">
the problem is that I should have a different id for each article, but there are some repeated ones, so what I need to do is look for the ids along the file and reenumerate them consecutively starting from 1. I've been looking around but I haven't find a suitable solution, probably because of my lack of knowledge, I would appreciate your suggestions
Assuming id is always the first attribute of every text tag, in Perl:
perl -M5.010 -wpi.bak -e'our $article; s/<text id="\K[0-9]+/++$article/ge' hugetextfile
Note that it will rename your file with added .bak and read through it, writing out to the original name, so you need 18Gb free space.
In python: If it is a valid xml file, you can use an xml parser such as ElementTree.
Otherwise, iterate over the input file and write to an output file:
new_id=1 with open('out_file','w') as out_f: with open('in_file','r') as in_f: for line in in_f: if line[:5] == '<text': newline = line.split(' ') newline = "id=" + '"' + str(new_id) + '"' newline = ' '.join(newline) line = newline new_id += 1 out_f.write(line)
Note that this assumes that each <text ... tag starts at the beginning of the line. If this is not the case, you have to modify it a little.