improving performance in sql with multiple tables
I've got two tables: Log(id,user,action,date) and ActionTypes(action,type). Given an action A0, and a type T0, I would like to count for each user, how many times she has used each other action Ai right after A0, but skipping the actions of Log which are not of type T0. So for example:
id user action date ---------------------------------------- 1 mary start 2012-07-16 08:00:00 2 mary open 2012-07-16 09:00:00 3 john start 2012-07-16 09:00:00 4 mary play 2012-07-16 10:00:00 5 john open 2012-07-16 10:30:00 6 mary start 2012-07-16 11:00:00 7 mary jump 2012-07-16 12:00:00 8 mary close 2012-07-16 13:00:00 9 mary delete 2012-07-16 14:00:00 10 mary start 2012-07-16 15:00:00 11 mary open 2012-07-16 16:00:00
action type -------------- start 0 open 1 play 1 jump 2 close 1 delete 1
So, given the action 'start' and the type '1' the answer will be:
user action ntimes ------------------------ mary open 2 mary close 1 john open 1
My attempt is
SELECT b.user,b.action, count(*) FROM log a, log b WHERE a.action='start' AND b.date>a.date AND a.user=b.user AND 1=(select type from ActionTypes where action=b.action) AND not exists (SELECT c.action FROM log c where c.user=a.user AND c.date>a.date and c.date<b.date and 1=(select type from ActionTypes where action=c.action)) GROUP BY b.user,b.action
Our Log table has about 1 million tuples and the query works, but it is too slow. We're using SQLServer. Any hints on how to make it faster? Thanks
Could you please give this query a try? It uses exists to test if previous chronological record is of requested type. I believe that it would be faster than self-join. I have put a demo @ Sql Fiddle.
select log.[user], log.action, count(*) ntimes from log inner join actiontype t on log.action = t.action where t.type = 1 and exists (select * from (select top 1 t1.type from log l1 inner join actiontype t1 on l1.action = t1.action where l1.[user] = log.[user] and l1.date < log.date and t1.type in (0, 1) order by l1.date desc ) prevEntry where prevEntry.type = 0 ) group by log.[user], log.action
I don't understand why is mary \ close in result list. Previous record is jump which is of type 2, and should not be skipped over to get to start.
After borrowing @Nikola Markovinović's setup, I came up with the following solution:
WITH ranked AS ( SELECT L1.[user], L2.action, rnk = ROW_NUMBER() OVER (PARTITION BY L1.id ORDER BY L2.date) FROM Log L1 INNER JOIN Log L2 ON L2.[user] = L1.[user] AND L2.date > L1.date INNER JOIN ActionType at ON L2.action = at.action WHERE L1.action = @Action AND at.type = @Type ) SELECT [user], action, ntimes = COUNT(*) FROM ranked WHERE rnk = 1 GROUP BY [user], action ;
Basically, this query selects from the Log table all users' records that have the specified action, then joins that subset back to Log to retrieve all actions of the specified type that follow those in the first subset, ranking them in the ascending order of date along the way (using the ROW_NUMBER() function). Then the query retrieves only rows with the rankings of 1, groups them by user and action and counts rows in the groups.
You can see (and play with) a working example at SQL Fiddle.
Your action queries and all of the relation fields it is much faster to be an integer and not a string.
The only way to do your queries faster is to change the structure of the database. The relations has to be indexed and has to be integers and not strings. For example something like this:
id user action date ---------------------------------------- 1 mary 1 2012-07-16 08:00:00 2 mary 2 2012-07-16 09:00:00 3 john 3 2012-07-16 09:00:00 4 mary 1 2012-07-16 10:00:00 5 john 3 2012-07-16 10:30:00 6 mary 4 2012-07-16 11:00:00 7 mary 5 2012-07-16 12:00:00 8 mary 6 2012-07-16 13:00:00 9 mary 1 2012-07-16 14:00:00 10 mary 3 2012-07-16 15:00:00 11 mary 1 2012-07-16 16:00:00
will solve your problem.
In addition if you have from 1-9 action types you can have the action to tinyint and also if you add an id and tinyint with primary key will definetely do your queries easier (with simple joins) and will also your database it will be more flexible for future changes . For example you can have:
id action type -------------- 1 start 0 2 open 1 3 play 1 4 jump 2 5 close 1 6 delete 1
Where id is a primary key and the "action" at "Log" table has a foreign key to this id.
I think the main problem is that you don't have indexes and foreign keys relations.
I sightly disagree with statements:
...is much faster to be an integer and not a string
That's not exactly true, once column action is indexed there is little difference between integers or strings.
...only way to do your queries faster is to change the structure of the database
In this case query could be optimized in number of ways:
- Avoid filtering on joined data set (Log x ActionTypes) and try to do filtering earlier(in example below filtering happens in inner sub select).
- Avoid repetitive filtering condition(where). Even though sql server internally will optimize this queriers duplication usually sign that you're doing calculation few times and most of the time you could find solution in which you could put condition only once(in example below you could put where condition before group by ).
- Your best friend is 'SQL Query Analyzer(Optimizer)'. Its built-in tool in Sql Server Manager Studio and it will show you sql query execution cost taking into account data volume. It's really good tool and helps to find bottlenecks in queries.
Here is simplified query which will produce result you need(it's written&tested on Oracle since it's been a while since I was working with ms sql server ):
select "user", action, count(*) from action_log where action not in ( --exclusion criteria select action_type."action"from action_type where action_type."type" = 1 ) group by "user", action