Fixing invalid memory alloc request at PostgreSQL 9.2.9

I've encountered a problem querying some of my tables recently. When I try to select data I get an ERROR telling: ERROR: invalid memory alloc request size 4294967293. This generally indicates data corruption. There is a nice and precise technique of how to delete corrupted rows described here: https://confluence.atlassian.com/jirakb/invalid-memory-alloc-request-size-440107132.html But, since I have lots of corrupted tables, this method is too slow. So, I've found a nice function which returns the last successful ctid here: http://blog.dob.sk/2012/05/19/fixing-pg_dump-invalid-memory-alloc-request-size/

Looking for corrupted row is a bit faster when using it, but not fast enough. I slightly modified it to store all "last successful ctid" in a different table and now it looks like this:

CREATE OR REPLACE FUNCTION
find_bad_row(tableName TEXT)
RETURNS void
as $find_bad_row$
DECLARE
result tid;
curs REFCURSOR;
row1 RECORD;
row2 RECORD;
tabName TEXT;
count BIGINT := 0;
BEGIN
DROP TABLE IF EXISTS bad_rows_tbl;
CREATE TABLE bad_rows_tbl (id varchar(255), offs BIGINT);
SELECT reverse(split_part(reverse($1), '.', 1)) INTO tabName;

OPEN curs FOR EXECUTE 'SELECT ctid FROM ' || tableName;

count := 1;

FETCH curs INTO row1;

WHILE row1.ctid IS NOT NULL LOOP
    BEGIN
    result = row1.ctid;

    count := count + 1;
    FETCH curs INTO row1;

    EXECUTE 'SELECT (each(hstore(' || tabName || '))).* FROM '
    || tableName || ' WHERE ctid = $1' INTO row2
    USING row1.ctid;

    IF count % 100000 = 0 THEN
    RAISE NOTICE 'rows processed: %', count;
    END IF;
    EXCEPTION
    WHEN SQLSTATE 'XX000' THEN
        RAISE NOTICE 'LAST CTID: %', result;
        EXECUTE 'INSERT INTO bad_rows_tbl VALUES(' || result || ',' || count || ')';
    END;

END LOOP;

CLOSE curs;

END
$find_bad_row$
LANGUAGE plpgsql;

I'm quite new to plpgsql, so I'm stuck with the following question: how to query not pre-unsuccessful ctid, but the exact unsuccessful one (or calculate the next one from pre-unsuccessful) so I could insert it into bad_rows_tbl and use as an argument for a DELETE statement further?

Hope for some help...

UPD: a function I ended up

CREATE OR REPLACE FUNCTION
find_bad_row(tableName TEXT)
RETURNS tid[]
as $find_bad_row$
DECLARE
result tid;
curs REFCURSOR;
row1 RECORD;
row2 RECORD;
tabName TEXT;
youNeedMe BOOLEAN = false;
count BIGINT := 0;
arrIter BIGINT := 0;
arr tid[];
BEGIN
CREATE TABLE bad_rows_tbl (id varchar(255), offs BIGINT);
SELECT reverse(split_part(reverse($1), '.', 1)) INTO tabName;

OPEN curs FOR EXECUTE 'SELECT ctid FROM ' || tableName;

count := 1;

FETCH curs INTO row1;

WHILE row1.ctid IS NOT NULL LOOP
    BEGIN
    result = row1.ctid;
    count := count + 1;

    IF youNeedMe THEN
        arr[arrIter] = result;
        arrIter := arrIter + 1;     
        RAISE NOTICE 'ADDING CTID: %', result;
        youNeedMe = FALSE;
    END IF;

    FETCH curs INTO row1;

    EXECUTE 'SELECT (each(hstore(' || tabName || '))).* FROM '
    || tableName || ' WHERE ctid = $1' INTO row2
    USING row1.ctid;

    IF count % 100000 = 0 THEN
        RAISE NOTICE 'rows processed: %', count;
    END IF;
    EXCEPTION
    WHEN SQLSTATE 'XX000' THEN
        RAISE NOTICE 'LAST GOOD CTID: %', result;
        youNeedMe = TRUE;
    END;

END LOOP;

CLOSE curs;
RETURN arr;
END
$find_bad_row$
LANGUAGE plpgsql;

Answers


This is supplemental to the function given in the question and answers next steps after the db is dumpable.

Your next steps should be:

  1. dumpall and restore on a physically different system. The reason being at this point we don't know what caused this and chances are not too bad that it might be hardware.

  2. You need to take the old system down and run hardware diagnostics on it, looking for problems. You really want to find out what happened so you don't run into it again. Of particular interest:

    • Double check ECC RAM and MCE logs
    • Look at all RAID arrays and their battery backups
    • CPUs and PSUs
    • If it were me I would also look at environmental variables such as AC in and datacenter temperature.
  3. Go over your backup strategy. In particular look at PITR (and related utility pgbarman). Make sure you can recover from a similar situation in the future if you run into it.

Data corruption doesn't just happen. In rare cases it can be caused by bugs in PostgreSQL, but in most cases it is due to your hardware or due to custom code you have running in the back-end. Narrowing down the cause and ensuring recoverability are critical going forward.

Assuming you aren't running custom C code in your database, most likely your data corruption is due to something on the hardware


Need Your Help

MKMapView Not rendring on Cocoa Application

objective-c macos cocoa mkmapview

I am trying to use MKMapView but the it won't render the tiles on view.

MVC change ModelBinder unit test

unit-testing data-annotations

to extend my validation I have created my own model binder based on following article: