Get most similar rows and order them by similarity - performance improvement

I have items table with structure similar to this:

id
user_id
feature_1 
feature_2
feature_3
...
feature_20

Most of feature... fields are numbers, 3-4 of them contain text.

Now I need to find for given item items that are the most similar (have exact same fields with some weight) and order them by similarity.

I can do something like this:

select (IF (feature_1 = 'xxx1', 100, 0) +  
        IF (feature_2 = 'xxx2', 100, 0) + 
        IF (feature_3 = 'xxx3', 100, 0) + 
        IF (feature_4 = 'xxx4', 1, 0) + 
        ...  + 
        IF (feature_20 = 'xxx20', 1, 0)) 
        AS score, id from `items` where `id` <> 'yyy' 
        group by `id` having `score` > '0' order by `score` desc;

In place of xxx of course I put valid value of this field for item I want to compare and in place of yyy I put id of item I compare (I don't want include it in result). For each field I can specify the weight I want to use for similarity (here for first three 100 and for the rest 1)

Exact same technique was used in Getting most similar rows in MySQL table and order them by similarity

Now comes the performance. I've generated table with about 100000 items. Finding similar items for one item takes about 0.4 second. Even if I could lower the number of feature_ fields that I need to include in comparison (and I probably won't be allowed to do this) it will take about 0.16-0.2 second for such set.

And now it will be even worse. I need to find similar items for all items that belong to one user. Let's assume user has 100 items. I need to take them all from DB, run 100 queries like this above, then sort everything by score and remove duplicates (in PHP but it's not a problem) and then again take the whole records to display (of course final result will be paginated).

So:

  • I will need to run more than 100 queries to achieve that ( I don't know if it's possible to run such query without explicit putting values in xxx places)
  • it will take 100 x 0,4 seconds = 40 seconds to achieve that

Questions:

  • is it possible to improve above query (use indexes or rebuild it) to make it run much faster
  • is it possible to rebuild the query to get similar items not for one item but for many items (all items of one user)

I need to also add, that not all items have all feature fields filled (they are nullable) so if I look for similar items for item that have for example feature_15 field null I don't want to include this feature_15 field to score at all because it's unknown for this item.

EDIT

I've created the structure as suggested by @pala (DB structure below). Now I have 25 records in features table and 2138959 (yes, over 2 millions) records in feature_watch table.

When I run example query:

select if2.watch_id, sum(f.weight) AS `sum` from feature_watch if1 
    inner join feature_watch if2 on if1.feature_id = if2.feature_id 
      and if1.feature_value = if2.feature_value 
      and if1.watch_id <> if2.watch_id 
     inner join features f on if2.feature_id = f.id 
     where if1.watch_id = 71 group by if2.watch_id ORDER BY sum DESC

it now takes between 1-2 seconds to get the same result. Did I miss something here?

CREATE TABLE IF NOT EXISTS `features` (
`id` int(10) unsigned NOT NULL,
  `name` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
  `weight` tinyint(3) unsigned NOT NULL,
  `created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `updated_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00'
) ENGINE=InnoDB AUTO_INCREMENT=26 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

CREATE TABLE IF NOT EXISTS `feature_watch` (
`id` int(10) unsigned NOT NULL,
  `feature_id` int(10) unsigned NOT NULL,
  `watch_id` int(10) unsigned NOT NULL,
  `user_id` int(10) unsigned NOT NULL,
  `feature_value` varchar(150) COLLATE utf8_unicode_ci DEFAULT NULL
) ENGINE=InnoDB AUTO_INCREMENT=2142999 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

ALTER TABLE `features`
 ADD PRIMARY KEY (`id`), ADD UNIQUE KEY `features_name_unique` (`name`), ADD KEY `weight` (`weight`);

ALTER TABLE `feature_watch`
 ADD PRIMARY KEY (`id`), ADD KEY `feature_watch_user_id_foreign` (`user_id`), ADD KEY `feature_id` (`feature_id`,`feature_value`), ADD KEY `watch_id` (`watch_id`);

ALTER TABLE `features`
MODIFY `id` int(10) unsigned NOT NULL AUTO_INCREMENT,AUTO_INCREMENT=26;

ALTER TABLE `feature_watch`
MODIFY `id` int(10) unsigned NOT NULL AUTO_INCREMENT,AUTO_INCREMENT=2142999;

ALTER TABLE `feature_watch`
ADD CONSTRAINT `feature_watch_feature_id_foreign` FOREIGN KEY (`feature_id`) REFERENCES `features` (`id`),
ADD CONSTRAINT `feature_watch_user_id_foreign` FOREIGN KEY (`user_id`) REFERENCES `users` (`id`) ON DELETE CASCADE,
ADD CONSTRAINT `feature_watch_watch_id_foreign` FOREIGN KEY (`watch_id`) REFERENCES `watches` (`id`) ON DELETE CASCADE;

EDIT2

For the followin query:

select if2.watch_id, sum(f.weight) AS `sum` from feature_watch if1 inner join feature_watch if2 on if1.feature_id = if2.feature_id and if1.feature_value = if2.feature_value and if1.watch_id <> if2.watch_id inner join features f on if2.feature_id = f.id where if1.watch_id = 71 AND if2.`user_id` in (select `id` from `users` where `is_private` = '0') and if2.`user_id` <> '1' group by if2.watch_id ORDER BY sum DESC

EXPLAIN gives:

id  select_type     table   type    possible_keys   key     key_len     ref     rows    Extra   
1   SIMPLE  if1     ref     watch_id,compound,feature_id    watch_id    4   const   22  Using where; Using temporary; Using filesort
1   SIMPLE  f   eq_ref  PRIMARY     PRIMARY     4   watches10.if1.feature_id    1   NULL
1   SIMPLE  if2     ref     watch_id,compound,feature_id,user_id    compound    457     watches10.if1.feature_id,watches10.if1.feature_val...   441     Using where; Using index
1   SIMPLE  users   eq_ref  PRIMARY     PRIMARY     4   watches10.if2.user_id   1   Using where

The above query executes over 0.5s and if I would like to run it for more than record id 71 (put for example 10 records ids) it will execute about x times slower (about 5 seconds for 10 ids)

Answers


I would suggest you reorganise your table structure similar to the following:

create table items (id integer primary key auto_increment);

create table features (
  id integer primary key auto_increment,
  feature_name varchar(25),
  feature_weight integer
);

create table item_features (  
  item_id integer,
  feature_id integer,  
  feature_value varchar(25)
);

This would allow you to run a relatively simple query to calculate similarity based on features, by summing their weight.

select if2.item_id, sum(f.feature_weight)
  from item_features if1
    inner join item_features if2
      on if1.feature_id = if2.feature_id
        and if1.feature_value = if2.feature_value
        and if1.item_id <> if2.item_id
    inner join features f
      on if2.feature_id = f.id
   where if1.item_id = 1
   group by if2.item_id

There is a demo of this here: http://sqlfiddle.com/#!9/613970/4

I know it doesn't match the table definition in the question - but repeated values like that in a table are a path to the dark side. Normalisation really does make life easier.

With an index on item_features(feature_id, feature_value), and also on features(feature_name), the query should be quite fast


Here is my understanding of what you want. Please tell me if I guessed it correctly or not. SQLFiddle

There are many items that belong to several users as determined by user_id. In this example we have 3 users:

CREATE TABLE items (
id int, 
`user_id` int, `f1` int, `f2` int, `f3` int,
primary key(id),
key(user_id));

INSERT INTO items
    (id, `user_id`, `f1`, `f2`, `f3`)
VALUES
    (1, 1, 2, 22, 30),
    (2, 1, 1, 21, 40),
    (3, 1, 9, 25, 50),
    (4, 2, 1, 21, 30),
    (5, 2, 1, 22, 40),
    (6, 2, 2, 22, 35),
    (7, 3, 9, 22, 31),
    (8, 3, 8, 20, 55),
    (9, 3, 7, 20, 55),
    (10, 3, 5, 26, 30)
;

user_id is a parameter of the query. For a given user_id you want to find all items that belong to this user, then for each found item you want to calculate the score that defines a "distance" between this item and every other item (not just from this user, but each and every other item). And then you want to show all rows of the result ordered by the score. Not just the single most similar item, but all of them.

The score of a pair of items is calculated using values of features of these two items. There is no constant set of feature values that is compared to all items, each pair of items may have its own score.

When the score is calculated each feature has a weight. These weights are predefined and constant (do not depend on the item). Let's use these constants in this example:

weight for f1 is 1
weight for f2 is 3
weight for f3 is 5

Here is one way to get the result in one query (for user_id=1):

SELECT *
FROM
  (
    SELECT
      UserItems.id AS UserItemID
      ,AllItems.id AS AllItemID
      ,IF(AllItems.f1 = UserItems.f1, 1, 0)+
      IF(AllItems.f2 = UserItems.f2, 3, 0)+
      IF(AllItems.f3 = UserItems.f3, 5, 0) AS Score
    FROM
      (
        SELECT id, f1, f2, f3
        FROM items
        WHERE items.user_id = 1
      ) AS UserItems
      CROSS JOIN
      (
        SELECT id, f1, f2, f3
        FROM items
      ) AS AllItems
  ) AS Scores
WHERE
  UserItemID <> AllItemID
  AND Score > 0
ORDER BY UserItemID, Score desc

Result set

| UserItemID | AllItemID | Score |
|------------|-----------|-------|
|          1 |        10 |     5 |
|          1 |         4 |     5 |
|          1 |         6 |     4 |
|          1 |         5 |     3 |
|          1 |         7 |     3 |
|          2 |         5 |     6 |
|          2 |         4 |     4 |
|          3 |         7 |     1 |

If this is really what you want, I'm afraid there is no magic way to make it work fast. For each item of the user you need to compare it to each other item to calculate the score. So, if there are N rows in the items table and M items for a given user you have to calculate the score N*M times. Then you have to filter out zero scores and sort the result. You can't avoid reading the whole items table M times.

Only if there is some external knowledge about the data, then maybe you could "cheat" somehow and read not the whole items table every time.

For example, if you know that distribution of values of feature K is very uneven: 99% of values are X and 1% are some other values. It may be possible to make use of this knowledge to reduce amount of calculations.

Another example, if items cluster somehow together (in the sense of your metric/distance/score). If you can pre-calculate these clusters, then instead of reading through the whole table of items every time you could read only small subset of those items that belong to the same cluster using appropriate indexes.


Need Your Help

OSX Error installing subversion via macports

macos svn macports

I'm attempting to install subversion 1.7.2 ( upgrading from default Lion version, 1.6? ).

MySQLi bind_param() error

php mysql mysqli

I'm trying to make a function that receive a query (sql) and a parameter (array) but I receive this error: