search and replace enumerating found strings

I have a huge text file (18Gb) separated by articles, each article have a heading like this:

<text id="1403" year="" url_source="http://www.adobe.de" error="0.008696">

the problem is that I should have a different id for each article, but there are some repeated ones, so what I need to do is look for the ids along the file and reenumerate them consecutively starting from 1. I've been looking around but I haven't find a suitable solution, probably because of my lack of knowledge, I would appreciate your suggestions

Answers


Assuming id is always the first attribute of every text tag, in Perl:

perl -M5.010 -wpi.bak -e'our $article; s/<text id="\K[0-9]+/++$article/ge' hugetextfile

Note that it will rename your file with added .bak and read through it, writing out to the original name, so you need 18Gb free space.


In python: If it is a valid xml file, you can use an xml parser such as ElementTree.

Otherwise, iterate over the input file and write to an output file:

new_id=1
with open('out_file','w') as out_f:
    with open('in_file','r') as in_f:
        for line in in_f:
            if line[:5] == '<text':
                newline = line.split(' ')
                newline[1] = "id=" + '"' + str(new_id) + '"'
                newline = ' '.join(newline)
                line = newline
                new_id += 1
            out_f.write(line)

Note that this assumes that each <text ... tag starts at the beginning of the line. If this is not the case, you have to modify it a little.


Need Your Help

Find top 3 closest targets in Actionscript 3

actionscript-3 actionscript

I have an array of characters that are Points and I want to take any character and be able to loop through that array and find the top 3 closest (using Point.distance) neighbors. Could anyone give ...