问题描述
ElasticSearch 5.x 对 Suggester API 进行了一些(重大)更改(文档).最显着的变化如下:
ElasticSearch 5.x introduced some (breaking) changes to the Suggester API (Documentation). Most notable change is the following:
完成建议是面向文档的
建议都知道他们所属的文件.现在,关联文档 (_source
) 是作为完成建议的一部分返回.
Suggestions are aware of the
document they belong to. Now, associated documents (_source
) are
returned as part of completion suggestions.
简而言之,所有完成查询都返回所有匹配的文档,而不仅仅是匹配的单词.这就是问题所在 - 如果自动完成的单词出现在多个文档中,则会出现重复.
In short, all completion queries return all matching documents instead of just matched words. And herein lies the problem - duplication of autocompleted words if they occur in more than one document.
假设我们有这个简单的映射:
Let's say we have this simple mapping:
{
"my-index": {
"mappings": {
"users": {
"properties": {
"firstName": {
"type": "text"
},
"lastName": {
"type": "text"
},
"suggest": {
"type": "completion",
"analyzer": "simple"
}
}
}
}
}
}
附上几个测试文档:
{
"_index": "my-index",
"_type": "users",
"_id": "1",
"_source": {
"firstName": "John",
"lastName": "Doe",
"suggest": [
{
"input": [
"John",
"Doe"
]
}
]
}
},
{
"_index": "my-index",
"_type": "users",
"_id": "2",
"_source": {
"firstName": "John",
"lastName": "Smith",
"suggest": [
{
"input": [
"John",
"Smith"
]
}
]
}
}
还有一个按书查询:
POST /my-index/_suggest?pretty
{
"my-suggest" : {
"text" : "joh",
"completion" : {
"field" : "suggest"
}
}
}
产生以下结果:
{
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"my-suggest": [
{
"text": "joh",
"offset": 0,
"length": 3,
"options": [
{
"text": "John",
"_index": "my-index",
"_type": "users",
"_id": "1",
"_score": 1,
"_source": {
"firstName": "John",
"lastName": "Doe",
"suggest": [
{
"input": [
"John",
"Doe"
]
}
]
}
},
{
"text": "John",
"_index": "my-index",
"_type": "users",
"_id": "2",
"_score": 1,
"_source": {
"firstName": "John",
"lastName": "Smith",
"suggest": [
{
"input": [
"John",
"Smith"
]
}
]
}
}
]
}
]
}
简而言之,对于文本joh"的完成建议,返回了两 (2) 个 文档 - John 和两者都具有相同的 text
属性值.
In short, for a completion suggest for text "joh", two (2) documents were returned - both John's and both had the same value of the text
property.
但是,我想收到一 (1) 个词.像这样简单的事情:
However, I would like to receive one (1) word. Something simple like this:
{
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"my-suggest": [
{
"text": "joh",
"offset": 0,
"length": 3,
"options": [
"John"
]
}
]
}
问题:如何实现基于单词的完成建议器.不需要返回任何与文档相关的数据,因为我此时不需要它.
Question: how to implement a word-based completion suggester. There is no need to return any document related data, since I don't need it at this point.
完成建议器"甚至适合我的场景吗?还是应该使用完全不同的方法?
Is the "Completion Suggester" even appropriate for my scenario? Or should I use a completely different approach?
编辑:正如你们中的许多人指出的那样,额外的仅完成索引将是一个可行的解决方案.但是,我可以看到这种方法存在多个问题:
EDIT: As many of you pointed out, an additional completion-only index would be a viable solution. However, I can see multiple issues with this approach:
- 保持新索引同步.
- 自动完成后续单词可能是全局的,而不是缩小范围.例如,假设您在附加索引中有以下单词:
"John"、"Doe"、"David"、"Smith"
.当查询"John D"
时,不完整词的结果应该是"Doe"
而不是"Doe", "David"
.
- Keeping the new index in sync.
- Auto-completing subsequent words would probably be global, instead of narrowed down. For example, say you have the following words in the additional index:
"John", "Doe", "David", "Smith"
. When querying for"John D"
, the result for the incomplete word should be"Doe"
and not"Doe", "David"
.
为了克服第二点,仅索引单个单词是不够的,因为您还需要将所有单词映射到文档,以便正确缩小自动完成后续单词的范围.有了这个,您实际上遇到了与查询原始索引相同的问题.因此,额外的索引不再有意义.
To overcome the second point, only indexing single words wouldn't be enough, since you would also need to map all words to documents in order to properly narrow down auto-completing subsequent words. And with this, you actually have the same problem as querying the original index. Therefore, the additional index doesn't make sense anymore.
推荐答案
正如评论中所暗示的,在不获取重复文档的情况下实现此目的的另一种方法是为 firstname
包含字段 ngram 的字段.首先,您像这样定义映射:
As hinted at in the comment, another way of achieving this without getting the duplicate documents is to create a sub-field for the firstname
field containing ngrams of the field. First you define your mapping like this:
PUT my-index
{
"settings": {
"analysis": {
"analyzer": {
"completion_analyzer": {
"type": "custom",
"filter": [
"lowercase",
"completion_filter"
],
"tokenizer": "keyword"
}
},
"filter": {
"completion_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 24
}
}
}
},
"mappings": {
"users": {
"properties": {
"autocomplete": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
},
"completion": {
"type": "text",
"analyzer": "completion_analyzer",
"search_analyzer": "standard"
}
}
},
"firstName": {
"type": "text"
},
"lastName": {
"type": "text"
}
}
}
}
}
然后你索引一些文档:
POST my-index/users/_bulk
{"index":{}}
{ "firstName": "John", "lastName": "Doe", "autocomplete": "John Doe"}
{"index":{}}
{ "firstName": "John", "lastName": "Deere", "autocomplete": "John Deere" }
{"index":{}}
{ "firstName": "Johnny", "lastName": "Cash", "autocomplete": "Johnny Cash" }
然后你可以查询joh
并得到一个John
的结果和一个Johnny
的结果
Then you can query for joh
and get one result for John
and another one for Johnny
{
"size": 0,
"query": {
"term": {
"autocomplete.completion": "john d"
}
},
"aggs": {
"suggestions": {
"terms": {
"field": "autocomplete.raw"
}
}
}
}
结果:
{
"aggregations": {
"suggestions": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "John Doe",
"doc_count": 1
},
{
"key": "John Deere",
"doc_count": 1
}
]
}
}
}
更新(2019 年 6 月 25 日):
ES 7.2 引入了一种名为 search_as_you_type
的新数据类型,它本机允许这种行为.阅读更多内容:https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-as-you-type.html
ES 7.2 introduced a new data type called search_as_you_type
that allows this kind of behavior natively. Read more at: https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-as-you-type.html
这篇关于面向词的补全建议器 (ElasticSearch 5.x)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,WP2