Elasticsearch Edge NGram tokenizer higher score when word begins with n-gram

up vote
0
down vote

favorite

Suppose there is the following mapping with Edge NGram Tokenizer:

{

  "settings": {

    "analysis": {

      "analyzer": {

        "autocomplete_analyzer": {

          "tokenizer": "autocomplete_tokenizer",

          "filter": [

            "standard"

          ]

        },

        "autocomplete_search": {

          "tokenizer": "whitespace"

        }

      },

      "tokenizer": {

        "autocomplete_tokenizer": {

          "type": "edge_ngram",

          "min_gram": 1,

          "max_gram": 10,

          "token_chars": [

            "letter",

            "symbol"

          ]

        }

      }

    }

  },

  "mappings": {

    "tag": {

      "properties": {

        "id": {

          "type": "long"

        },

        "name": {

          "type": "text",

          "analyzer": "autocomplete_analyzer",

          "search_analyzer": "autocomplete_search"

        }

      }

    }

  }

}

And the following documents are indexed:

POST /tag/tag/_bulk

{"index":{}}

{"name" : "HITS FIND SOME"}

{"index":{}}

{"name" : "TRENDING HI"}

{"index":{}}

{"name" : "HITS OTHER"}

Then searching

{

  "query": {

    "match": {

      "name": {

        "query": "HI"

      }

    }

  }

}

yields all with the same score, or TRENDING - HI with a score higher than one of the others.

How can it be configured, to show with a higher score the entries that actually start with the searcher n-gram? In this case, HITS FIND SOME and HITS OTHER to have a higher score than TRENDING HI; at the same time TRENDING HI should be in the results.

Highlighter is also used, so the given solution shouldn't mess it up.

The highlighter used in query is:

 "highlight": {

    "pre_tags": [

      "<"

    ],

    "post_tags": [

      ">"

    ],

    "fields": {

      "name": {}

    }

  }

Using this with match_phrase_prefix messes up the highlighting, yielding <H><I><T><S> FIND SOME when searching only for H.

edited 14 hours ago

asked 2 days ago

m3th0dman

5,49833566

This question has an open bounty worth +100
reputation from m3th0dman ending in 6 days.

This question has not received enough attention.

Expecting a solution to the given issue without messing up the highlighter.

add a comment |

up vote
0
down vote

favorite

Suppose there is the following mapping with Edge NGram Tokenizer:

{

  "settings": {

    "analysis": {

      "analyzer": {

        "autocomplete_analyzer": {

          "tokenizer": "autocomplete_tokenizer",

          "filter": [

            "standard"

          ]

        },

        "autocomplete_search": {

          "tokenizer": "whitespace"

        }

      },

      "tokenizer": {

        "autocomplete_tokenizer": {

          "type": "edge_ngram",

          "min_gram": 1,

          "max_gram": 10,

          "token_chars": [

            "letter",

            "symbol"

          ]

        }

      }

    }

  },

  "mappings": {

    "tag": {

      "properties": {

        "id": {

          "type": "long"

        },

        "name": {

          "type": "text",

          "analyzer": "autocomplete_analyzer",

          "search_analyzer": "autocomplete_search"

        }

      }

    }

  }

}

And the following documents are indexed:

POST /tag/tag/_bulk

{"index":{}}

{"name" : "HITS FIND SOME"}

{"index":{}}

{"name" : "TRENDING HI"}

{"index":{}}

{"name" : "HITS OTHER"}

Then searching

{

  "query": {

    "match": {

      "name": {

        "query": "HI"

      }

    }

  }

}

yields all with the same score, or TRENDING - HI with a score higher than one of the others.

Highlighter is also used, so the given solution shouldn't mess it up.

The highlighter used in query is:

 "highlight": {

    "pre_tags": [

      "<"

    ],

    "post_tags": [

      ">"

    ],

    "fields": {

      "name": {}

    }

  }

Using this with match_phrase_prefix messes up the highlighting, yielding <H><I><T><S> FIND SOME when searching only for H.

edited 14 hours ago

asked 2 days ago

m3th0dman

5,49833566

This question has an open bounty worth +100
reputation from m3th0dman ending in 6 days.

This question has not received enough attention.

Expecting a solution to the given issue without messing up the highlighter.

add a comment |

up vote
0
down vote

favorite

Suppose there is the following mapping with Edge NGram Tokenizer:

{

  "settings": {

    "analysis": {

      "analyzer": {

        "autocomplete_analyzer": {

          "tokenizer": "autocomplete_tokenizer",

          "filter": [

            "standard"

          ]

        },

        "autocomplete_search": {

          "tokenizer": "whitespace"

        }

      },

      "tokenizer": {

        "autocomplete_tokenizer": {

          "type": "edge_ngram",

          "min_gram": 1,

          "max_gram": 10,

          "token_chars": [

            "letter",

            "symbol"

          ]

        }

      }

    }

  },

  "mappings": {

    "tag": {

      "properties": {

        "id": {

          "type": "long"

        },

        "name": {

          "type": "text",

          "analyzer": "autocomplete_analyzer",

          "search_analyzer": "autocomplete_search"

        }

      }

    }

  }

}

And the following documents are indexed:

POST /tag/tag/_bulk

{"index":{}}

{"name" : "HITS FIND SOME"}

{"index":{}}

{"name" : "TRENDING HI"}

{"index":{}}

{"name" : "HITS OTHER"}

Then searching

{

  "query": {

    "match": {

      "name": {

        "query": "HI"

      }

    }

  }

}

yields all with the same score, or TRENDING - HI with a score higher than one of the others.

Highlighter is also used, so the given solution shouldn't mess it up.

The highlighter used in query is:

 "highlight": {

    "pre_tags": [

      "<"

    ],

    "post_tags": [

      ">"

    ],

    "fields": {

      "name": {}

    }

  }

Using this with match_phrase_prefix messes up the highlighting, yielding <H><I><T><S> FIND SOME when searching only for H.

edited 14 hours ago

asked 2 days ago

m3th0dman

5,49833566

Suppose there is the following mapping with Edge NGram Tokenizer:

{

  "settings": {

    "analysis": {

      "analyzer": {

        "autocomplete_analyzer": {

          "tokenizer": "autocomplete_tokenizer",

          "filter": [

            "standard"

          ]

        },

        "autocomplete_search": {

          "tokenizer": "whitespace"

        }

      },

      "tokenizer": {

        "autocomplete_tokenizer": {

          "type": "edge_ngram",

          "min_gram": 1,

          "max_gram": 10,

          "token_chars": [

            "letter",

            "symbol"

          ]

        }

      }

    }

  },

  "mappings": {

    "tag": {

      "properties": {

        "id": {

          "type": "long"

        },

        "name": {

          "type": "text",

          "analyzer": "autocomplete_analyzer",

          "search_analyzer": "autocomplete_search"

        }

      }

    }

  }

}

And the following documents are indexed:

POST /tag/tag/_bulk

{"index":{}}

{"name" : "HITS FIND SOME"}

{"index":{}}

{"name" : "TRENDING HI"}

{"index":{}}

{"name" : "HITS OTHER"}

Then searching

{

  "query": {

    "match": {

      "name": {

        "query": "HI"

      }

    }

  }

}

yields all with the same score, or TRENDING - HI with a score higher than one of the others.

Highlighter is also used, so the given solution shouldn't mess it up.

The highlighter used in query is:

 "highlight": {

    "pre_tags": [

      "<"

    ],

    "post_tags": [

      ">"

    ],

    "fields": {

      "name": {}

    }

  }

Using this with match_phrase_prefix messes up the highlighting, yielding <H><I><T><S> FIND SOME when searching only for H.

elasticsearch search n-gram

edited 14 hours ago

asked 2 days ago

m3th0dman

5,49833566

edited 14 hours ago

asked 2 days ago

m3th0dman

5,49833566

edited 14 hours ago

asked 2 days ago

m3th0dman

5,49833566

asked 2 days ago

m3th0dman

5,49833566

asked 2 days ago

m3th0dman

5,49833566

This question has an open bounty worth +100
reputation from m3th0dman ending in 6 days.

This question has not received enough attention.

Expecting a solution to the given issue without messing up the highlighter.

This question has an open bounty worth +100
reputation from m3th0dman ending in 6 days.

This question has not received enough attention.

Expecting a solution to the given issue without messing up the highlighter.

add a comment |

2 Answers
2

active

oldest

votes

up vote
2
down vote

In this particular case you could add a match_phrase_prefix term to your query, which does prefix match on the last term in the text:

{

  "query": {

    "bool": {

      "should": [

        {

          "match": {

            "name": "HI"

          }

        },

        {

          "match_phrase_prefix": {

            "name": "HI"

          }

        }

      ]

    }

  }

}

The match term will match on all three results, but the match_phrase_prefix won't match on TRENDING HI. As a result, you'll get all three items in the results, but TRENDING HI will appear with a lower score.

Quoting the docs:

The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.

On a side note, if you're introducing that bool query, you'll probably want to look at the minimum_should_match option, depending on the results you want.

edited yesterday

answered 2 days ago

AdrienF

372113

But I need TRENDING HI as a result; just with a lower score.
– m3th0dman
yesterday

1

@m3th0dman the overall results are a combination of matching results for each term, so TRENDING HI will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
– AdrienF
yesterday

Thank you for your answer!
– m3th0dman
19 hours ago

Unfortunately this messes up the highlighter.
– m3th0dman
16 hours ago

@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
16 hours ago

|
show 1 more comment

up vote
2
down vote

You must understand how elasticsearch/lucene analyzes your data and calculate the search score.

1. Analyze API

https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:

T / TR / TRE /.... TRENDING / / H / HI

2. Score

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

The bool query is often used to build complex query where you need a particular use case. Use must to filter document, then should to score. A common use case is to use different analyzers on a same field (by using the keyword fields in the mapping, you can analyze a same field differently).

3. dont mess highlight

According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query

You can add an extra query:

{

  "query": {

    "bool": {

            "must" : [

                        {

          "match": {

            "name": "HI"

          }

        }

            ],

      "should": [

        {

          "prefix": {

            "name": "HI"

          }

        }

      ]

    }

  },

     "highlight": {

    "pre_tags": [

      "<"

    ],

    "post_tags": [

      ">"

    ],

    "fields": {

      "name": {

                "highlight_query": {

                        "match": {

            "name": "HI"

          }

                }

            }

    }

  }

}

edited 13 hours ago

answered 14 hours ago

Thomas Decaux

12.2k25658

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53238598%2felasticsearch-edge-ngram-tokenizer-higher-score-when-word-begins-with-n-gram%23new-answer', 'question_page');
}
);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
2
down vote

In this particular case you could add a match_phrase_prefix term to your query, which does prefix match on the last term in the text:

{

  "query": {

    "bool": {

      "should": [

        {

          "match": {

            "name": "HI"

          }

        },

        {

          "match_phrase_prefix": {

            "name": "HI"

          }

        }

      ]

    }

  }

}

Quoting the docs:

The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.

On a side note, if you're introducing that bool query, you'll probably want to look at the minimum_should_match option, depending on the results you want.

edited yesterday

answered 2 days ago

AdrienF

372113

But I need TRENDING HI as a result; just with a lower score.
– m3th0dman
yesterday

1

@m3th0dman the overall results are a combination of matching results for each term, so TRENDING HI will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
– AdrienF
yesterday

Thank you for your answer!
– m3th0dman
19 hours ago

Unfortunately this messes up the highlighter.
– m3th0dman
16 hours ago

@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
16 hours ago

|
show 1 more comment

up vote
2
down vote

In this particular case you could add a match_phrase_prefix term to your query, which does prefix match on the last term in the text:

{

  "query": {

    "bool": {

      "should": [

        {

          "match": {

            "name": "HI"

          }

        },

        {

          "match_phrase_prefix": {

            "name": "HI"

          }

        }

      ]

    }

  }

}

Quoting the docs:

The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.

On a side note, if you're introducing that bool query, you'll probably want to look at the minimum_should_match option, depending on the results you want.

edited yesterday

answered 2 days ago

AdrienF

372113

But I need TRENDING HI as a result; just with a lower score.
– m3th0dman
yesterday

1

@m3th0dman the overall results are a combination of matching results for each term, so TRENDING HI will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
– AdrienF
yesterday

Thank you for your answer!
– m3th0dman
19 hours ago

Unfortunately this messes up the highlighter.
– m3th0dman
16 hours ago

@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
16 hours ago

|
show 1 more comment

up vote
2
down vote

In this particular case you could add a match_phrase_prefix term to your query, which does prefix match on the last term in the text:

{

  "query": {

    "bool": {

      "should": [

        {

          "match": {

            "name": "HI"

          }

        },

        {

          "match_phrase_prefix": {

            "name": "HI"

          }

        }

      ]

    }

  }

}

Quoting the docs:

The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.

On a side note, if you're introducing that bool query, you'll probably want to look at the minimum_should_match option, depending on the results you want.

edited yesterday

answered 2 days ago

AdrienF

372113

In this particular case you could add a match_phrase_prefix term to your query, which does prefix match on the last term in the text:

{

  "query": {

    "bool": {

      "should": [

        {

          "match": {

            "name": "HI"

          }

        },

        {

          "match_phrase_prefix": {

            "name": "HI"

          }

        }

      ]

    }

  }

}

Quoting the docs:

The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.

On a side note, if you're introducing that bool query, you'll probably want to look at the minimum_should_match option, depending on the results you want.

edited yesterday

answered 2 days ago

AdrienF

372113

edited yesterday

answered 2 days ago

AdrienF

372113

answered 2 days ago

AdrienF

372113

answered 2 days ago

AdrienF

372113

But I need TRENDING HI as a result; just with a lower score.
– m3th0dman
yesterday

1

@m3th0dman the overall results are a combination of matching results for each term, so TRENDING HI will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
– AdrienF
yesterday

Thank you for your answer!
– m3th0dman
19 hours ago

Unfortunately this messes up the highlighter.
– m3th0dman
16 hours ago

@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
16 hours ago

|
show 1 more comment

But I need TRENDING HI as a result; just with a lower score.
– m3th0dman
yesterday

1

@m3th0dman the overall results are a combination of matching results for each term, so TRENDING HI will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
– AdrienF
yesterday

Thank you for your answer!
– m3th0dman
19 hours ago

Unfortunately this messes up the highlighter.
– m3th0dman
16 hours ago

@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
16 hours ago

But I need TRENDING HI as a result; just with a lower score.
– m3th0dman
yesterday

@m3th0dman the overall results are a combination of matching results for each term, so TRENDING HI will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
– AdrienF
yesterday

Thank you for your answer!
– m3th0dman
19 hours ago

Unfortunately this messes up the highlighter.
– m3th0dman
16 hours ago

@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
16 hours ago

|
show 1 more comment

up vote
2
down vote

You must understand how elasticsearch/lucene analyzes your data and calculate the search score.

1. Analyze API

https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:

T / TR / TRE /.... TRENDING / / H / HI

2. Score

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

3. dont mess highlight

According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query

You can add an extra query:

{

  "query": {

    "bool": {

            "must" : [

                        {

          "match": {

            "name": "HI"

          }

        }

            ],

      "should": [

        {

          "prefix": {

            "name": "HI"

          }

        }

      ]

    }

  },

     "highlight": {

    "pre_tags": [

      "<"

    ],

    "post_tags": [

      ">"

    ],

    "fields": {

      "name": {

                "highlight_query": {

                        "match": {

            "name": "HI"

          }

                }

            }

    }

  }

}

edited 13 hours ago

answered 14 hours ago

Thomas Decaux

12.2k25658

add a comment |

up vote
2
down vote

You must understand how elasticsearch/lucene analyzes your data and calculate the search score.

1. Analyze API

https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:

T / TR / TRE /.... TRENDING / / H / HI

2. Score

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

3. dont mess highlight

According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query

You can add an extra query:

{

  "query": {

    "bool": {

            "must" : [

                        {

          "match": {

            "name": "HI"

          }

        }

            ],

      "should": [

        {

          "prefix": {

            "name": "HI"

          }

        }

      ]

    }

  },

     "highlight": {

    "pre_tags": [

      "<"

    ],

    "post_tags": [

      ">"

    ],

    "fields": {

      "name": {

                "highlight_query": {

                        "match": {

            "name": "HI"

          }

                }

            }

    }

  }

}

edited 13 hours ago

answered 14 hours ago

Thomas Decaux

12.2k25658

add a comment |

up vote
2
down vote

You must understand how elasticsearch/lucene analyzes your data and calculate the search score.

1. Analyze API

https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:

T / TR / TRE /.... TRENDING / / H / HI

2. Score

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

3. dont mess highlight

According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query

You can add an extra query:

{

  "query": {

    "bool": {

            "must" : [

                        {

          "match": {

            "name": "HI"

          }

        }

            ],

      "should": [

        {

          "prefix": {

            "name": "HI"

          }

        }

      ]

    }

  },

     "highlight": {

    "pre_tags": [

      "<"

    ],

    "post_tags": [

      ">"

    ],

    "fields": {

      "name": {

                "highlight_query": {

                        "match": {

            "name": "HI"

          }

                }

            }

    }

  }

}

edited 13 hours ago

answered 14 hours ago

Thomas Decaux

12.2k25658

You must understand how elasticsearch/lucene analyzes your data and calculate the search score.

1. Analyze API

https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:

T / TR / TRE /.... TRENDING / / H / HI

2. Score

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

3. dont mess highlight

According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query

You can add an extra query:

{

  "query": {

    "bool": {

            "must" : [

                        {

          "match": {

            "name": "HI"

          }

        }

            ],

      "should": [

        {

          "prefix": {

            "name": "HI"

          }

        }

      ]

    }

  },

     "highlight": {

    "pre_tags": [

      "<"

    ],

    "post_tags": [

      ">"

    ],

    "fields": {

      "name": {

                "highlight_query": {

                        "match": {

            "name": "HI"

          }

                }

            }

    }

  }

}

edited 13 hours ago

answered 14 hours ago

Thomas Decaux

12.2k25658

edited 13 hours ago

answered 14 hours ago

Thomas Decaux

12.2k25658

answered 14 hours ago

Thomas Decaux

12.2k25658

answered 14 hours ago

Thomas Decaux

12.2k25658

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Name

This page is only for reference, If you need detailed information, please check here

NoH2 NlgDoL5mimcWw5lcQQt07i2

搜尋此網誌

Ndtyjky