Multi-field, multi-word, match without query_string

I would like to be able to match a multi word search against multiple fields where every word searched is contained in any of the fields, any combination. The catch is I would like to avoid using query_string.

curl -X POST "http://localhost:9200/index/document/1" -d '{"id":1,"firstname":"john","middlename":"clark","lastname":"smith"}'
curl -X POST "http://localhost:9200/index/document/2" -d '{"id":2,"firstname":"john","middlename":"paladini","lastname":"miranda"}'

I would like the search for 'John Smith' to match only document 1. The following query does what I need but I would rather avoid using query_string in case the user passes "OR", "AND" and any of the other advanced params.

curl -X GET 'http://localhost:9200/index/_search?per_page=10&pretty' -d '{
  "query": {
    "query_string": {
      "query": "john smith",
      "default_operator": "AND",
      "fields": [
        "firstname",
        "lastname",
        "middlename"
      ]
    }
  }
}'

Answers


What you are looking for is the multi-match query, but it doesn't perform in quite the way you would like.

Compare the output of validate for multi_match vs query_string.

multi_match (with operator and) will make sure that ALL terms exist in at least one field:

curl -XGET 'http://127.0.0.1:9200/_validate/query?pretty=1&explain=true'  -d '
{
   "multi_match" : {
      "operator" : "and",
      "fields" : [
         "firstname",
         "lastname"
      ],
      "query" : "john smith"
   }
}
'

# {
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 1,
#       "total" : 1
#    },
#    "explanations" : [
#       {
#          "index" : "test",
#          "explanation" : "((+lastname:john +lastname:smith) | (+firstname:john +firstname:smith))",
#          "valid" : true
#       }
#    ],
#    "valid" : true
# }

While query_string (with default_operator AND) will check that EACH term exists in at least one field:

curl -XGET 'http://127.0.0.1:9200/_validate/query?pretty=1&explain=true'  -d '
{
   "query_string" : {
      "fields" : [
         "firstname",
         "lastname"
      ],
      "query" : "john smith",
      "default_operator" : "AND"
   }
}
'

# {
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 1,
#       "total" : 1
#    },
#    "explanations" : [
#       {
#          "index" : "test",
#          "explanation" : "+(firstname:john | lastname:john) +(firstname:smith | lastname:smith)",
#          "valid" : true
#       }
#    ],
#    "valid" : true
# }

So you have a few choices to achieve what you are after:

  1. Preparse the search terms, to remove things like wildcards, etc, before using the query_string

  2. Preparse the search terms to extract each word, then generate a multi_match query per word

  3. Use index_name in your mapping for the name fields to index their data into a single field, which you can then use for search. (like your own custom all field):

As follows:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "mappings" : {
      "test" : {
         "properties" : {
            "firstname" : {
               "index_name" : "name",
               "type" : "string"
            },
            "lastname" : {
               "index_name" : "name",
               "type" : "string"
            }
         }
      }
   }
}
'

curl -XPOST 'http://127.0.0.1:9200/test/test?pretty=1'  -d '
{
   "firstname" : "john",
   "lastname" : "smith"
}
'

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1'  -d '
{
   "query" : {
      "match" : {
         "name" : {
            "operator" : "and",
            "query" : "john smith"
         }
      }
   }
}
'

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "firstname" : "john",
#                "lastname" : "smith"
#             },
#             "_score" : 0.2712221,
#             "_index" : "test",
#             "_id" : "VJFU_RWbRNaeHF9wNM8fRA",
#             "_type" : "test"
#          }
#       ],
#       "max_score" : 0.2712221,
#       "total" : 1
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 33
# }

Note however, that firstname and lastname are no longer searchable independently. The data for both fields has been indexed into name.

You could use multi-fields with the path parameter to make them searchable both independently and together, as follows:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "mappings" : {
      "test" : {
         "properties" : {
            "firstname" : {
               "fields" : {
                  "firstname" : {
                     "type" : "string"
                  },
                  "any_name" : {
                     "type" : "string"
                  }
               },
               "path" : "just_name",
               "type" : "multi_field"
            },
            "lastname" : {
               "fields" : {
                  "any_name" : {
                     "type" : "string"
                  },
                  "lastname" : {
                     "type" : "string"
                  }
               },
               "path" : "just_name",
               "type" : "multi_field"
            }
         }
      }
   }
}
'

curl -XPOST 'http://127.0.0.1:9200/test/test?pretty=1'  -d '
{
   "firstname" : "john",
   "lastname" : "smith"
}
'

Searching the any_name field works:

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1'  -d '
{
   "query" : {
      "match" : {
         "any_name" : {
            "operator" : "and",
            "query" : "john smith"
         }
      }
   }
}
'

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "firstname" : "john",
#                "lastname" : "smith"
#             },
#             "_score" : 0.2712221,
#             "_index" : "test",
#             "_id" : "Xf9qqKt0TpCuyLWioNh-iQ",
#             "_type" : "test"
#          }
#       ],
#       "max_score" : 0.2712221,
#       "total" : 1
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 11
# }

Searching firstname for john AND smith doesn't work:

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1'  -d '
{
   "query" : {
      "match" : {
         "firstname" : {
            "operator" : "and",
            "query" : "john smith"
         }
      }
   }
}
'

# {
#    "hits" : {
#       "hits" : [],
#       "max_score" : null,
#       "total" : 0
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 2
# }

But searching firstname for just john works correctly:

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1'  -d '
{
   "query" : {
      "match" : {
         "firstname" : {
            "operator" : "and",
            "query" : "john"
         }
      }
   }
}
'

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "firstname" : "john",
#                "lastname" : "smith"
#             },
#             "_score" : 0.30685282,
#             "_index" : "test",
#             "_id" : "Xf9qqKt0TpCuyLWioNh-iQ",
#             "_type" : "test"
#          }
#       ],
#       "max_score" : 0.30685282,
#       "total" : 1
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 3
# }

I would rather avoid using query_string in case the user passes "OR", "AND" and any of the other advanced params.

In my experience, escaping the special characters with backslash is a simple and effective solution. The list can be found in the documentation http://lucene.apache.org/core/4_5_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description, plus AND/OR/NOT/TO.


I think "match" query is what you are looking for:

"The match family of queries does not go through a “query parsing” process. It does not support field name prefixes, wildcard characters, or other “advance” features. For this reason, chances of it failing are very small / non existent, and it provides an excellent behavior when it comes to just analyze and run that text as a query behavior (which is usually what a text search box does)"

http://www.elasticsearch.org/guide/reference/query-dsl/match-query.html


Nowadays you can use cross_fields type in multi_match

GET /_validate/query?explain
{
    "query": {
        "multi_match": {
            "query":       "peter smith",
            "type":        "cross_fields", 
            "operator":    "and",
            "fields":      [ "firstname", "lastname", "middlename" ]
        }
    }
}

Cross-fields take a term-centric approach. It treats all of the fields as one big field, and looks for each term in any field.

One thing to note though is that if you want it to work optimally, all fields analyzed should have the same analyzer (standard, english, etc.):

For the cross_fields query type to work optimally, all fields should have the same analyzer. Fields that share an analyzer are grouped together as blended fields.

If you include fields with a different analysis chain, they will be added to the query in the same way as for best_fields. For instance, if we added the title field to the preceding query (assuming it uses a different analyzer), the explanation would be as follows:

(+title:peter +title:smith) ( +blended("peter", fields: [first_name, last_name]) +blended("smith", fields: [first_name, last_name]) )


Need Your Help

HTML scraping using lxml and requests gives a unicode error

python html unicode web-scraping lxml

I'm trying to use HTML scraper like the one provided here. It works fine for the example they provided. However, when I try using it with my webpage, I receive this error - Unicode strings with enc...