Elasticsearch Edge NGram tokenizer higher score when word begins with n-gram
up vote
0
down vote
favorite
Suppose there is the following mapping with Edge NGram Tokenizer:
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"tokenizer": "autocomplete_tokenizer",
"filter": [
"standard"
]
},
"autocomplete_search": {
"tokenizer": "whitespace"
}
},
"tokenizer": {
"autocomplete_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter",
"symbol"
]
}
}
}
},
"mappings": {
"tag": {
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
And the following documents are indexed:
POST /tag/tag/_bulk
{"index":{}}
{"name" : "HITS FIND SOME"}
{"index":{}}
{"name" : "TRENDING HI"}
{"index":{}}
{"name" : "HITS OTHER"}
Then searching
{
"query": {
"match": {
"name": {
"query": "HI"
}
}
}
}
yields all with the same score, or TRENDING - HI
with a score higher than one of the others.
How can it be configured, to show with a higher score the entries that actually start with the searcher n-gram? In this case, HITS FIND SOME
and HITS OTHER
to have a higher score than TRENDING HI
; at the same time TRENDING HI
should be in the results.
Highlighter is also used, so the given solution shouldn't mess it up.
The highlighter used in query is:
"highlight": {
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields": {
"name": {}
}
}
Using this with match_phrase_prefix
messes up the highlighting, yielding <H><I><T><S> FIND SOME
when searching only for H
.
elasticsearch search n-gram
This question has an open bounty worth +100
reputation from m3th0dman ending in 6 days.
This question has not received enough attention.
Expecting a solution to the given issue without messing up the highlighter.
add a comment |
up vote
0
down vote
favorite
Suppose there is the following mapping with Edge NGram Tokenizer:
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"tokenizer": "autocomplete_tokenizer",
"filter": [
"standard"
]
},
"autocomplete_search": {
"tokenizer": "whitespace"
}
},
"tokenizer": {
"autocomplete_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter",
"symbol"
]
}
}
}
},
"mappings": {
"tag": {
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
And the following documents are indexed:
POST /tag/tag/_bulk
{"index":{}}
{"name" : "HITS FIND SOME"}
{"index":{}}
{"name" : "TRENDING HI"}
{"index":{}}
{"name" : "HITS OTHER"}
Then searching
{
"query": {
"match": {
"name": {
"query": "HI"
}
}
}
}
yields all with the same score, or TRENDING - HI
with a score higher than one of the others.
How can it be configured, to show with a higher score the entries that actually start with the searcher n-gram? In this case, HITS FIND SOME
and HITS OTHER
to have a higher score than TRENDING HI
; at the same time TRENDING HI
should be in the results.
Highlighter is also used, so the given solution shouldn't mess it up.
The highlighter used in query is:
"highlight": {
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields": {
"name": {}
}
}
Using this with match_phrase_prefix
messes up the highlighting, yielding <H><I><T><S> FIND SOME
when searching only for H
.
elasticsearch search n-gram
This question has an open bounty worth +100
reputation from m3th0dman ending in 6 days.
This question has not received enough attention.
Expecting a solution to the given issue without messing up the highlighter.
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
Suppose there is the following mapping with Edge NGram Tokenizer:
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"tokenizer": "autocomplete_tokenizer",
"filter": [
"standard"
]
},
"autocomplete_search": {
"tokenizer": "whitespace"
}
},
"tokenizer": {
"autocomplete_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter",
"symbol"
]
}
}
}
},
"mappings": {
"tag": {
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
And the following documents are indexed:
POST /tag/tag/_bulk
{"index":{}}
{"name" : "HITS FIND SOME"}
{"index":{}}
{"name" : "TRENDING HI"}
{"index":{}}
{"name" : "HITS OTHER"}
Then searching
{
"query": {
"match": {
"name": {
"query": "HI"
}
}
}
}
yields all with the same score, or TRENDING - HI
with a score higher than one of the others.
How can it be configured, to show with a higher score the entries that actually start with the searcher n-gram? In this case, HITS FIND SOME
and HITS OTHER
to have a higher score than TRENDING HI
; at the same time TRENDING HI
should be in the results.
Highlighter is also used, so the given solution shouldn't mess it up.
The highlighter used in query is:
"highlight": {
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields": {
"name": {}
}
}
Using this with match_phrase_prefix
messes up the highlighting, yielding <H><I><T><S> FIND SOME
when searching only for H
.
elasticsearch search n-gram
Suppose there is the following mapping with Edge NGram Tokenizer:
{
"settings": {
"analysis": {
"analyzer": {
"autocomplete_analyzer": {
"tokenizer": "autocomplete_tokenizer",
"filter": [
"standard"
]
},
"autocomplete_search": {
"tokenizer": "whitespace"
}
},
"tokenizer": {
"autocomplete_tokenizer": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter",
"symbol"
]
}
}
}
},
"mappings": {
"tag": {
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "autocomplete_search"
}
}
}
}
}
And the following documents are indexed:
POST /tag/tag/_bulk
{"index":{}}
{"name" : "HITS FIND SOME"}
{"index":{}}
{"name" : "TRENDING HI"}
{"index":{}}
{"name" : "HITS OTHER"}
Then searching
{
"query": {
"match": {
"name": {
"query": "HI"
}
}
}
}
yields all with the same score, or TRENDING - HI
with a score higher than one of the others.
How can it be configured, to show with a higher score the entries that actually start with the searcher n-gram? In this case, HITS FIND SOME
and HITS OTHER
to have a higher score than TRENDING HI
; at the same time TRENDING HI
should be in the results.
Highlighter is also used, so the given solution shouldn't mess it up.
The highlighter used in query is:
"highlight": {
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields": {
"name": {}
}
}
Using this with match_phrase_prefix
messes up the highlighting, yielding <H><I><T><S> FIND SOME
when searching only for H
.
elasticsearch search n-gram
elasticsearch search n-gram
edited 14 hours ago
asked 2 days ago
m3th0dman
5,49833566
5,49833566
This question has an open bounty worth +100
reputation from m3th0dman ending in 6 days.
This question has not received enough attention.
Expecting a solution to the given issue without messing up the highlighter.
This question has an open bounty worth +100
reputation from m3th0dman ending in 6 days.
This question has not received enough attention.
Expecting a solution to the given issue without messing up the highlighter.
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
up vote
2
down vote
In this particular case you could add a match_phrase_prefix
term to your query, which does prefix match on the last term in the text:
{
"query": {
"bool": {
"should": [
{
"match": {
"name": "HI"
}
},
{
"match_phrase_prefix": {
"name": "HI"
}
}
]
}
}
}
The match
term will match on all three results, but the match_phrase_prefix
won't match on TRENDING HI
. As a result, you'll get all three items in the results, but TRENDING HI
will appear with a lower score.
Quoting the docs:
The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.
On a side note, if you're introducing that bool
query, you'll probably want to look at the minimum_should_match
option, depending on the results you want.
But I needTRENDING HI
as a result; just with a lower score.
– m3th0dman
yesterday
1
@m3th0dman the overall results are a combination of matching results for each term, soTRENDING HI
will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
– AdrienF
yesterday
Thank you for your answer!
– m3th0dman
19 hours ago
Unfortunately this messes up the highlighter.
– m3th0dman
16 hours ago
@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
16 hours ago
|
show 1 more comment
up vote
2
down vote
You must understand how elasticsearch/lucene analyzes your data and calculate the search score.
1. Analyze API
https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:
T / TR / TRE /.... TRENDING / / H / HI
2. Score
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
The bool query is often used to build complex query where you need a particular use case. Use must
to filter document, then should
to score. A common use case is to use different analyzers on a same field (by using the keyword fields
in the mapping, you can analyze a same field differently).
3. dont mess highlight
According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query
You can add an extra query:
{
"query": {
"bool": {
"must" : [
{
"match": {
"name": "HI"
}
}
],
"should": [
{
"prefix": {
"name": "HI"
}
}
]
}
},
"highlight": {
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields": {
"name": {
"highlight_query": {
"match": {
"name": "HI"
}
}
}
}
}
}
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
In this particular case you could add a match_phrase_prefix
term to your query, which does prefix match on the last term in the text:
{
"query": {
"bool": {
"should": [
{
"match": {
"name": "HI"
}
},
{
"match_phrase_prefix": {
"name": "HI"
}
}
]
}
}
}
The match
term will match on all three results, but the match_phrase_prefix
won't match on TRENDING HI
. As a result, you'll get all three items in the results, but TRENDING HI
will appear with a lower score.
Quoting the docs:
The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.
On a side note, if you're introducing that bool
query, you'll probably want to look at the minimum_should_match
option, depending on the results you want.
But I needTRENDING HI
as a result; just with a lower score.
– m3th0dman
yesterday
1
@m3th0dman the overall results are a combination of matching results for each term, soTRENDING HI
will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
– AdrienF
yesterday
Thank you for your answer!
– m3th0dman
19 hours ago
Unfortunately this messes up the highlighter.
– m3th0dman
16 hours ago
@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
16 hours ago
|
show 1 more comment
up vote
2
down vote
In this particular case you could add a match_phrase_prefix
term to your query, which does prefix match on the last term in the text:
{
"query": {
"bool": {
"should": [
{
"match": {
"name": "HI"
}
},
{
"match_phrase_prefix": {
"name": "HI"
}
}
]
}
}
}
The match
term will match on all three results, but the match_phrase_prefix
won't match on TRENDING HI
. As a result, you'll get all three items in the results, but TRENDING HI
will appear with a lower score.
Quoting the docs:
The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.
On a side note, if you're introducing that bool
query, you'll probably want to look at the minimum_should_match
option, depending on the results you want.
But I needTRENDING HI
as a result; just with a lower score.
– m3th0dman
yesterday
1
@m3th0dman the overall results are a combination of matching results for each term, soTRENDING HI
will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
– AdrienF
yesterday
Thank you for your answer!
– m3th0dman
19 hours ago
Unfortunately this messes up the highlighter.
– m3th0dman
16 hours ago
@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
16 hours ago
|
show 1 more comment
up vote
2
down vote
up vote
2
down vote
In this particular case you could add a match_phrase_prefix
term to your query, which does prefix match on the last term in the text:
{
"query": {
"bool": {
"should": [
{
"match": {
"name": "HI"
}
},
{
"match_phrase_prefix": {
"name": "HI"
}
}
]
}
}
}
The match
term will match on all three results, but the match_phrase_prefix
won't match on TRENDING HI
. As a result, you'll get all three items in the results, but TRENDING HI
will appear with a lower score.
Quoting the docs:
The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.
On a side note, if you're introducing that bool
query, you'll probably want to look at the minimum_should_match
option, depending on the results you want.
In this particular case you could add a match_phrase_prefix
term to your query, which does prefix match on the last term in the text:
{
"query": {
"bool": {
"should": [
{
"match": {
"name": "HI"
}
},
{
"match_phrase_prefix": {
"name": "HI"
}
}
]
}
}
}
The match
term will match on all three results, but the match_phrase_prefix
won't match on TRENDING HI
. As a result, you'll get all three items in the results, but TRENDING HI
will appear with a lower score.
Quoting the docs:
The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.
On a side note, if you're introducing that bool
query, you'll probably want to look at the minimum_should_match
option, depending on the results you want.
edited yesterday
answered 2 days ago
AdrienF
372113
372113
But I needTRENDING HI
as a result; just with a lower score.
– m3th0dman
yesterday
1
@m3th0dman the overall results are a combination of matching results for each term, soTRENDING HI
will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
– AdrienF
yesterday
Thank you for your answer!
– m3th0dman
19 hours ago
Unfortunately this messes up the highlighter.
– m3th0dman
16 hours ago
@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
16 hours ago
|
show 1 more comment
But I needTRENDING HI
as a result; just with a lower score.
– m3th0dman
yesterday
1
@m3th0dman the overall results are a combination of matching results for each term, soTRENDING HI
will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
– AdrienF
yesterday
Thank you for your answer!
– m3th0dman
19 hours ago
Unfortunately this messes up the highlighter.
– m3th0dman
16 hours ago
@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
16 hours ago
But I need
TRENDING HI
as a result; just with a lower score.– m3th0dman
yesterday
But I need
TRENDING HI
as a result; just with a lower score.– m3th0dman
yesterday
1
1
@m3th0dman the overall results are a combination of matching results for each term, so
TRENDING HI
will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.– AdrienF
yesterday
@m3th0dman the overall results are a combination of matching results for each term, so
TRENDING HI
will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.– AdrienF
yesterday
Thank you for your answer!
– m3th0dman
19 hours ago
Thank you for your answer!
– m3th0dman
19 hours ago
Unfortunately this messes up the highlighter.
– m3th0dman
16 hours ago
Unfortunately this messes up the highlighter.
– m3th0dman
16 hours ago
@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
16 hours ago
@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
16 hours ago
|
show 1 more comment
up vote
2
down vote
You must understand how elasticsearch/lucene analyzes your data and calculate the search score.
1. Analyze API
https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:
T / TR / TRE /.... TRENDING / / H / HI
2. Score
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
The bool query is often used to build complex query where you need a particular use case. Use must
to filter document, then should
to score. A common use case is to use different analyzers on a same field (by using the keyword fields
in the mapping, you can analyze a same field differently).
3. dont mess highlight
According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query
You can add an extra query:
{
"query": {
"bool": {
"must" : [
{
"match": {
"name": "HI"
}
}
],
"should": [
{
"prefix": {
"name": "HI"
}
}
]
}
},
"highlight": {
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields": {
"name": {
"highlight_query": {
"match": {
"name": "HI"
}
}
}
}
}
}
add a comment |
up vote
2
down vote
You must understand how elasticsearch/lucene analyzes your data and calculate the search score.
1. Analyze API
https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:
T / TR / TRE /.... TRENDING / / H / HI
2. Score
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
The bool query is often used to build complex query where you need a particular use case. Use must
to filter document, then should
to score. A common use case is to use different analyzers on a same field (by using the keyword fields
in the mapping, you can analyze a same field differently).
3. dont mess highlight
According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query
You can add an extra query:
{
"query": {
"bool": {
"must" : [
{
"match": {
"name": "HI"
}
}
],
"should": [
{
"prefix": {
"name": "HI"
}
}
]
}
},
"highlight": {
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields": {
"name": {
"highlight_query": {
"match": {
"name": "HI"
}
}
}
}
}
}
add a comment |
up vote
2
down vote
up vote
2
down vote
You must understand how elasticsearch/lucene analyzes your data and calculate the search score.
1. Analyze API
https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:
T / TR / TRE /.... TRENDING / / H / HI
2. Score
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
The bool query is often used to build complex query where you need a particular use case. Use must
to filter document, then should
to score. A common use case is to use different analyzers on a same field (by using the keyword fields
in the mapping, you can analyze a same field differently).
3. dont mess highlight
According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query
You can add an extra query:
{
"query": {
"bool": {
"must" : [
{
"match": {
"name": "HI"
}
}
],
"should": [
{
"prefix": {
"name": "HI"
}
}
]
}
},
"highlight": {
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields": {
"name": {
"highlight_query": {
"match": {
"name": "HI"
}
}
}
}
}
}
You must understand how elasticsearch/lucene analyzes your data and calculate the search score.
1. Analyze API
https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:
T / TR / TRE /.... TRENDING / / H / HI
2. Score
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
The bool query is often used to build complex query where you need a particular use case. Use must
to filter document, then should
to score. A common use case is to use different analyzers on a same field (by using the keyword fields
in the mapping, you can analyze a same field differently).
3. dont mess highlight
According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query
You can add an extra query:
{
"query": {
"bool": {
"must" : [
{
"match": {
"name": "HI"
}
}
],
"should": [
{
"prefix": {
"name": "HI"
}
}
]
}
},
"highlight": {
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields": {
"name": {
"highlight_query": {
"match": {
"name": "HI"
}
}
}
}
}
}
edited 13 hours ago
answered 14 hours ago
Thomas Decaux
12.2k25658
12.2k25658
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53238598%2felasticsearch-edge-ngram-tokenizer-higher-score-when-word-begins-with-n-gram%23new-answer', 'question_page');
}
);
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password