ElasticSearch各种分词(Analyzer)模式总结
作者:互联网
定义
Analyzer是es中专门用来处理分词的组件,由三部分组成:
Character Filters:针对原始文本的处理,例如去除html等
Tokenizer:按照规则进行分词
Token Filter:将切分的单词进行加工,例如去除修饰性单词等
分词器种类
StandardAnalyzer
这是默认分词器,按词切分,将字母转换为小写,默认关闭终止词。
使用方法如下:
GET /_analyze
{
"analyzer": "standard",
"text": "It`s a good day commander. Let`s do it for 2 times!"
}
结果如下,可见其中的大写字母都被转换成了小写:
{
"tokens" : [
{
"token" : "it",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "s",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "a",
"start_offset" : 5,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "good",
"start_offset" : 7,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "day",
"start_offset" : 12,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "commander",
"start_offset" : 16,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "let",
"start_offset" : 27,
"end_offset" : 30,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "s",
"start_offset" : 31,
"end_offset" : 32,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "do",
"start_offset" : 33,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "it",
"start_offset" : 36,
"end_offset" : 38,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "for",
"start_offset" : 39,
"end_offset" : 42,
"type" : "<ALPHANUM>",
"position" : 10
},
{
"token" : "2",
"start_offset" : 43,
"end_offset" : 44,
"type" : "<NUM>",
"position" : 11
},
{
"token" : "times",
"start_offset" : 45,
"end_offset" : 50,
"type" : "<ALPHANUM>",
"position" : 12
}
]
}
SimpleAnalyzer
按照非字母切分,非字母的都会被去掉,字母则同样会进行小写处理
举例如下:
GET /_analyze
{
"analyzer": "simple",
"text": "It`s a good day commander. Let`s do it for 2 times!"
}
输出如下,可见除了单词小写,其中的非字母都被去掉了(数字2):
{
"tokens" : [
{
"token" : "it",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "s",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 1
},
{
"token" : "a",
"start_offset" : 5,
"end_offset" : 6,
"type" : "word",
"position" : 2
},
{
"token" : "good",
"start_offset" : 7,
"end_offset" : 11,
"type" : "word",
"position" : 3
},
{
"token" : "day",
"start_offset" : 12,
"end_offset" : 15,
"type" : "word",
"position" : 4
},
{
"token" : "commander",
"start_offset" : 16,
"end_offset" : 25,
"type" : "word",
"position" : 5
},
{
"token" : "let",
"start_offset" : 27,
"end_offset" : 30,
"type" : "word",
"position" : 6
},
{
"token" : "s",
"start_offset" : 31,
"end_offset" : 32,
"type" : "word",
"position" : 7
},
{
"token" : "do",
"start_offset" : 33,
"end_offset" : 35,
"type" : "word",
"position" : 8
},
{
"token" : "it",
"start_offset" : 36,
"end_offset" : 38,
"type" : "word",
"position" : 9
},
{
"token" : "for",
"start_offset" : 39,
"end_offset" : 42,
"type" : "word",
"position" : 10
},
{
"token" : "times",
"start_offset" : 45,
"end_offset" : 50,
"type" : "word",
"position" : 11
}
]
}
WhitespaceAnalyzer
按照空格切分单词,举例如下:
GET /_analyze
{
"analyzer": "whitespace",
"text": "It`s a good day commander. Let`s do it for 2 times!"
}
输出如下,可见It`s、Let`s等都被保留了:
{
"tokens" : [
{
"token" : "It`s",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "a",
"start_offset" : 5,
"end_offset" : 6,
"type" : "word",
"position" : 1
},
{
"token" : "good",
"start_offset" : 7,
"end_offset" : 11,
"type" : "word",
"position" : 2
},
{
"token" : "day",
"start_offset" : 12,
"end_offset" : 15,
"type" : "word",
"position" : 3
},
{
"token" : "commander.",
"start_offset" : 16,
"end_offset" : 26,
"type" : "word",
"position" : 4
},
{
"token" : "Let`s",
"start_offset" : 27,
"end_offset" : 32,
"type" : "word",
"position" : 5
},
{
"token" : "do",
"start_offset" : 33,
"end_offset" : 35,
"type" : "word",
"position" : 6
},
{
"token" : "it",
"start_offset" : 36,
"end_offset" : 38,
"type" : "word",
"position" : 7
},
{
"token" : "for",
"start_offset" : 39,
"end_offset" : 42,
"type" : "word",
"position" : 8
},
{
"token" : "2",
"start_offset" : 43,
"end_offset" : 44,
"type" : "word",
"position" : 9
},
{
"token" : "times!",
"start_offset" : 45,
"end_offset" : 51,
"type" : "word",
"position" : 10
}
]
}
StopAnalyzer
相比SimpleAnalyzer,多了个stop filter,可以把the、a、is等修饰性的词去掉。举例如下:
GET /_analyze
{
"analyzer": "stop",
"text": "It`s a good day commander. Let`s do it for 2 times!"
}
输出如下,可见少了很多修饰性的词:
{
"tokens" : [
{
"token" : "s",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 1
},
{
"token" : "good",
"start_offset" : 7,
"end_offset" : 11,
"type" : "word",
"position" : 3
},
{
"token" : "day",
"start_offset" : 12,
"end_offset" : 15,
"type" : "word",
"position" : 4
},
{
"token" : "commander",
"start_offset" : 16,
"end_offset" : 25,
"type" : "word",
"position" : 5
},
{
"token" : "let",
"start_offset" : 27,
"end_offset" : 30,
"type" : "word",
"position" : 6
},
{
"token" : "s",
"start_offset" : 31,
"end_offset" : 32,
"type" : "word",
"position" : 7
},
{
"token" : "do",
"start_offset" : 33,
"end_offset" : 35,
"type" : "word",
"position" : 8
},
{
"token" : "times",
"start_offset" : 45,
"end_offset" : 50,
"type" : "word",
"position" : 11
}
]
}
Keyword Analyzer
不分词,直接将输入当成一个term输出。举例如下:
GET /_analyze
{
"analyzer": "keyword",
"text": "It`s a good day commander. Let`s do it for 2 times!"
}
输出如下:
{
"tokens" : [
{
"token" : "It`s a good day commander. Let`s do it for 2 times!",
"start_offset" : 0,
"end_offset" : 51,
"type" : "word",
"position" : 0
}
]
}
Pattern Analyzer
通过正则表达式进行分词,默认用\W+,也就是按照非字母进行分隔。举例如下:
GET /_analyze
{
"analyzer": "pattern",
"text": "It`s a good day commander. Let`s do it for 2 times!"
}
这里输出其实和standard是一样的:
{
"tokens" : [
{
"token" : "it",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "s",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 1
},
{
"token" : "a",
"start_offset" : 5,
"end_offset" : 6,
"type" : "word",
"position" : 2
},
{
"token" : "good",
"start_offset" : 7,
"end_offset" : 11,
"type" : "word",
"position" : 3
},
{
"token" : "day",
"start_offset" : 12,
"end_offset" : 15,
"type" : "word",
"position" : 4
},
{
"token" : "commander",
"start_offset" : 16,
"end_offset" : 25,
"type" : "word",
"position" : 5
},
{
"token" : "let",
"start_offset" : 27,
"end_offset" : 30,
"type" : "word",
"position" : 6
},
{
"token" : "s",
"start_offset" : 31,
"end_offset" : 32,
"type" : "word",
"position" : 7
},
{
"token" : "do",
"start_offset" : 33,
"end_offset" : 35,
"type" : "word",
"position" : 8
},
{
"token" : "it",
"start_offset" : 36,
"end_offset" : 38,
"type" : "word",
"position" : 9
},
{
"token" : "for",
"start_offset" : 39,
"end_offset" : 42,
"type" : "word",
"position" : 10
},
{
"token" : "2",
"start_offset" : 43,
"end_offset" : 44,
"type" : "word",
"position" : 11
},
{
"token" : "times",
"start_offset" : 45,
"end_offset" : 50,
"type" : "word",
"position" : 12
}
]
}
LanguageAnalyzer
es也可以按照语言进行分词:
GET /_analyze
{
"analyzer": "english",
"text": "It`s a good day commander. Let`s do it for 2 times!"
}
输出如下,也对修饰性单词进行了过滤:
{
"tokens" : [
{
"token" : "s",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "good",
"start_offset" : 7,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "dai",
"start_offset" : 12,
"end_offset" : 15,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "command",
"start_offset" : 16,
"end_offset" : 25,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "let",
"start_offset" : 27,
"end_offset" : 30,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "s",
"start_offset" : 31,
"end_offset" : 32,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "do",
"start_offset" : 33,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "2",
"start_offset" : 43,
"end_offset" : 44,
"type" : "<NUM>",
"position" : 11
},
{
"token" : "time",
"start_offset" : 45,
"end_offset" : 50,
"type" : "<ALPHANUM>",
"position" : 12
}
]
}
ICU-Analyzer
这是对中文分词的分词器,要先进行安装:
[es@localhost bin]$ ./elasticsearch-plugin install analysis-icu
-> Installing analysis-icu
-> Downloading analysis-icu from elastic
[=================================================] 100%
-> Installed analysis-icu
然后重启ES,再做一下测试:
GET /_analyze
{
"analyzer": "icu_analyzer",
"text": "这个进球真是漂亮!"
}
输出如下,看来"进球"并没有分割成一个词语:
{
"tokens" : [
{
"token" : "这个",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "进",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "球",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "真是",
"start_offset" : 4,
"end_offset" : 6,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "漂亮",
"start_offset" : 6,
"end_offset" : 8,
"type" : "<IDEOGRAPHIC>",
"position" : 4
}
]
}
配置自定义Analyzer
通过组合CharacterFilter、Tokenizer和TokenFilter来实现自定义Analyzer
自带的CharacterFilter有HTML strip、Mapping和Pattern replace,分别用来进行html标签去除、字符串替换和正则匹配替换
自带的Tokenizer有whitespace、standard、uax_url_email、pattern、keyword、path_hierarchy,也可以用java开发插件,实现自己的tokenizer
自带的TokenFilter有Lowercase、stop、synonym
tokenizer+character_filter
tokenizer和character_filter的组合示例如下:
POST _analyze
{
"tokenizer": "keyword",
"char_filter": ["html_strip"],
"text": "<b>aaa</b>"
}
同样是tokenizer和character_filter的组合,不过可以在character_filter中加入mapping,示例如下:
POST _analyze
{
"tokenizer": "standard",
"char_filter": [
{
"type": "mapping",
"mappings": ["- => _"]
}],
"text": "1-2, d-4"
}
正则
正则示例如下,$1表示取第几个()里的内容,这里就是www.baidu.com:
POST _analyze
{
"tokenizer": "standard",
"char_filter": [{
"type": "pattern_replace",
"pattern": "http://(.*)",
"replacement": "$1"
}],
"text": "http://www.baidu.com"
}
路径层次分词器
路径层次分词器如下,把输入/home/szc/a/b/c/e当成目录,然后按照目录的层级进行分词:
POST _analyze{ "tokenizer": "path_hierarchy", "text": "/home/szc/a/b/c/e"}
输出如下
{
"tokens" : [
{
"token" : "/home",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "/home/szc",
"start_offset" : 0,
"end_offset" : 9,
"type" : "word",
"position" : 0
},
{
"token" : "/home/szc/a",
"start_offset" : 0,
"end_offset" : 11,
"type" : "word",
"position" : 0
},
{
"token" : "/home/szc/a/b",
"start_offset" : 0,
"end_offset" : 13,
"type" : "word",
"position" : 0
},
{
"token" : "/home/szc/a/b/c",
"start_offset" : 0,
"end_offset" : 15,
"type" : "word",
"position" : 0
},
{
"token" : "/home/szc/a/b/c/e",
"start_offset" : 0,
"end_offset" : 17,
"type" : "word",
"position" : 0
}
]
}
filter组合
filter组合如下,同时进行小写和去除修饰词处理
POST _analyze
{
"tokenizer": "whitespace",
"filter": ["lowercase", "stop"],
"text": "The boys in China are playing soccer!"
}
综合使用
一个组合性的名为my_analyzer的自定义Analyzer如下所示,其中的char_filter、tokenizer、filter都是自定义的,
使用时加上定义此分词器的文档即可:
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "I`m a :) guy, and you ?"
}
输出如下,可见先是完成了表情符替换,又按照指定的正则进行了分词,最后去除了修饰性单词
{
"tokens" : [
{
"token" : "i`m",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "happy",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "guy",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 3
},
{
"token" : "you",
"start_offset" : 18,
"end_offset" : 21,
"type" : "word",
"position" : 5
}
]
}
以上就是关于ElasticSearch7的分词器的内容
标签:end,Analyzer,start,token,ElasticSearch,offset,position,type,分词 来源: https://blog.csdn.net/qq_45151158/article/details/122706175