How To Compute the Cost of Querying an Additional Table Using LEFT JOIN?
I've been tasked with estimating the marginal cost of joining an additional table in reporting queries written in SQL for the PostgreSQL database. I was given a test program to test the performance of the query. There appears to be no statistically significant difference between run times whether or not the joined table is populated. How can I write a better test to illustrate the difference in query times between the two scenarios?
In each case, the SQL is the same, joining table A with table B. The only difference is whether or not table B contains any data.
Table A has these columns:
Column | Type --------------+----------------------------- sid | bigint cluster | text sn | text tag_id | integer src_ip | text dst_ip | text dst_port | integer protocol | text src_intf | text dst_intf | text disp | smallint rcvd_bytes | bigint sent_bytes | bigint duration | integer count | integer start_time | timestamp without time zone policy_id | text src_user | text dst_domain | text app_id | text signature_id | text deny_type_id | text reputation | text wb_cat_id | text alarm_name | text virus | text sender | text recipients | text host | text dlp_rule_id | text spam_type | text spam_action | text
Table B has these columns:
Column | Type --------------+----------------------------- appliance_id | integer ip | inet fqdn | text resolve_time | timestamp without time zone expire_time | timestamp without time zone
The two tables are joined by an IPv4 address of type inet.
Test data populates tables A and B with just over 500 rows of data each.
With table B populated, on average, the test program that runs query takes 18.216 seconds of total time to run. The standard deviation between runs is 1.143 seconds.
With table B empty, on average, the test program that runs query takes 18.523 seconds of total time to run. The standard deviation between runs is 1.928 seconds.
The sample size in each case is six runs. I suspect I'll need to use a much larger sample size, but am not certain what an appropriate sample size would be.
I wrote a simple query using a LEFT JOIN:
SELECT * FROM a LEFT JOIN b ON a.src_ip::inet = b.ip AND b.resolve_time IS NOT NULL AND b.resolve_time <= now() AND b.expire_time IS NOT NULL AND now() < b.expire_time
I then ran 10 tests of 100 iterations (queries) each and timed the results of each test run.
Here are the results:
By averaging the run times and computing the ratio of the run time with Table B populated to the run time without Table B populated, I was able to compute an overhead of 21.38% for the query with Table B over the query without.