slow extraction of json array elements in postgresql
I have a json array and I want to expand each element to a new table. With the new json functions in postgresql 9.3, I expected this to be the best method:
create table p as select json_array_elements(json) foo from g
To my amazement, a lateral expansion is lot faster:
create table p as select json->x foo from g join lateral (select generate_series(0,json_array_length(g.json)-1) x ) xxx on true
Which is the problem with the first approach?
EDIT: a test case can be built for 20000 rows as
create table g as select (select json_agg(random()) json from generate_series(0, (r1*4)::int)) from (select random() r1 from generate_series(1,20000)) aux;
Over a SSD storage, it takes 3 seconds against 0.2 seconds with lateral. For 40000 rows, the time increases to 12 seconds, while the lateral method just grows near linearly.
The test case is certainly conclusive, and perf top -p $the_backend_pid helps show why:
96.92% postgres [.] MemoryContextReset 0.15% [kernel] [k] cpuacct_account_field 0.09% [kernel] [k] update_cfs_rq_blocked_load 0.09% postgres [.] AllocSetAlloc 0.09% libc-2.17.so [.] __memcpy_ssse3_back 0.07% postgres [.] parse_array 0.07% [kernel] [k] trigger_load_balance 0.07% [kernel] [k] rcu_check_callbacks 0.06% [kernel] [k] apic_timer_interrupt 0.05% [kernel] [k] do_timer 0.05% [kernel] [k] update_cfs_shares 0.05% libc-2.17.so [.] malloc
It's spending a huge amount of time in MemoryContextReset. Especially given that the profile above was noted down at the 47 billion events (approx) mark.
Backtraces are always like:
#0 0x000000000072dd7d in MemoryContextReset (context=0x2a02dc90) at mcxt.c:130 #1 0x000000000072dd90 in MemoryContextResetChildren (context=<optimized out>) at mcxt.c:155 #2 MemoryContextReset (context=0x1651220) at mcxt.c:131 #3 0x00000000005817f9 in ExecScan (node=node@entry=0x164e1a0, accessMtd=accessMtd@entry=0x592040 <SeqNext>, recheckMtd=recheckMtd@entry=0x592030 <SeqRecheck>) at execScan.c:155
with varying locations within MemoryContextReset, usually at a branch.
Runtime was 836904.371, vs 903.202 for the lateral join at 200k input rows (10x your test).
So I'd say you've certainly found a performance problem that needs attention.
Update: here's a patch that will apply against git master, or against 9.3. It's pretty easy to grab the source package / srpm and rebuild it if you're using deb/rpm packages of PostgreSQL, no need to switch to unpackaged just to apply a patch.