Why do I use up so much memory when I read a file into memory in Perl?

I have a text file that is 310MB in size (uncompressed). When using PerlIO::gzip to open the file and uncompress it into memory, this file easily fills 2GB of RAM before perl runs out of memory.

The file is opened as below:

open FOO, "<:gzip", "file.gz" or die $!;
my @lines = <FOO>;

Obviously, this is a super convenient way to open gzipped files easily in perl, but it takes up a ridiculous amount of space! My next step is to uncompress the file to the HD, read the lines of the file to @lines, operate on @lines, and compress it back. Does anyone have any idea why over 7 times as much memory is consumed when opening a zipped file? Does anyone have an alternate idea as to how I can uncompress this gzipped file into memory without it taking a ridiculous amount of memory?

Answers


When you do:

my @lines = <FOO>;

you are creating an array with as many elements as there are lines in file. At 100 characters per line, that's about 3.4 million array entries. There is overhead associated with each array entry which means the memory footprint will be much larger than just the uncompressed size of the file.

You can avoid slurping and process the file line-by-line. Here is an example:

C:\Temp> dir file
2010/10/04  09:18 PM       328,000,000 file
C:\Temp> dir file.gz
2010/10/04  09:19 PM         1,112,975 file.gz

And, indeed,

#!/usr/bin/perl

use strict; use warnings;
use autodie;
use PerlIO::gzip;

open my $foo, '<:gzip', 'file.gz';

while ( my $line = <$foo> ) {
    print ".";
}

has no problems.

To get an idea of the memory overhead, note:

#!/usr/bin/perl

use strict; use warnings;
use Devel::Size qw( total_size );

my $x = 'x' x 100;
my @x = ('x' x 100);

printf "Scalar: %d\n", total_size( \$x );
printf "Array:  %d\n", total_size( \@x );

Output:

Scalar: 136
Array:  256

You're reading all the content of the file into a @lines array. Of course that'll pull all the uncompressed content into memory. What you might have wanted instead is reading from your handle line-by-line, only keeping one line at a time in memory:

open my $foo, '<:gzip', 'file.gz' or die $!;
while (my $line = <$fh>) {
    # process $line here
}

With such big files I see only one solution: you can use command line to uncompress/compress file. Do your manipulation in Perl, then use again external tools to compress/decompress file :)


Need Your Help

create a .war file from gwt-project

java gwt tomcat war

How do I create a .war-file from my gwt-project in eclipse?

Restricting symbols in a Linux static library

c linux gcc static

I'm looking for ways to restrict the number of C symbols exported to a Linux static library (archive). I'd like to limit these to only those symbols that are part of the official API for the libra...