其他分享
首页 > 其他分享> > ElasticSearch各种分词(Analyzer)模式总结

ElasticSearch各种分词(Analyzer)模式总结

作者:互联网

定义
Analyzer是es中专门用来处理分词的组件,由三部分组成:

Character Filters:针对原始文本的处理,例如去除html等
Tokenizer:按照规则进行分词
Token Filter:将切分的单词进行加工,例如去除修饰性单词等
分词器种类


StandardAnalyzer

这是默认分词器,按词切分,将字母转换为小写,默认关闭终止词。
使用方法如下:

GET /_analyze
{
  "analyzer": "standard",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}


结果如下,可见其中的大写字母都被转换成了小写:

{
  "tokens" : [
    {
      "token" : "it",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "s",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "day",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "commander",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "let",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "s",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "it",
      "start_offset" : 36,
      "end_offset" : 38,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "for",
      "start_offset" : 39,
      "end_offset" : 42,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "2",
      "start_offset" : 43,
      "end_offset" : 44,
      "type" : "<NUM>",
      "position" : 11
    },
    {
      "token" : "times",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 12
    }
  ]
}


SimpleAnalyzer


按照非字母切分,非字母的都会被去掉,字母则同样会进行小写处理
举例如下:

GET /_analyze
{
  "analyzer": "simple",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}


输出如下,可见除了单词小写,其中的非字母都被去掉了(数字2):

{
  "tokens" : [
    {
      "token" : "it",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "s",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "day",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "commander",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "let",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "s",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "it",
      "start_offset" : 36,
      "end_offset" : 38,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "for",
      "start_offset" : 39,
      "end_offset" : 42,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "times",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "word",
      "position" : 11
    }
  ]
}



WhitespaceAnalyzer


按照空格切分单词,举例如下:

GET /_analyze
{
  "analyzer": "whitespace",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}


输出如下,可见It`s、Let`s等都被保留了:

{
  "tokens" : [
    {
      "token" : "It`s",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "day",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "commander.",
      "start_offset" : 16,
      "end_offset" : 26,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "Let`s",
      "start_offset" : 27,
      "end_offset" : 32,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "it",
      "start_offset" : 36,
      "end_offset" : 38,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "for",
      "start_offset" : 39,
      "end_offset" : 42,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "2",
      "start_offset" : 43,
      "end_offset" : 44,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "times!",
      "start_offset" : 45,
      "end_offset" : 51,
      "type" : "word",
      "position" : 10
    }
  ]
}



StopAnalyzer


相比SimpleAnalyzer,多了个stop filter,可以把the、a、is等修饰性的词去掉。举例如下:

GET /_analyze
{
  "analyzer": "stop",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}



输出如下,可见少了很多修饰性的词:

{
  "tokens" : [
    {
      "token" : "s",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "day",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "commander",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "let",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "s",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "times",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "word",
      "position" : 11
    }
  ]
}



Keyword Analyzer


不分词,直接将输入当成一个term输出。举例如下:

GET /_analyze
{
  "analyzer": "keyword",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}



输出如下:

{
  "tokens" : [
    {
      "token" : "It`s a good day commander. Let`s do it for 2 times!",
      "start_offset" : 0,
      "end_offset" : 51,
      "type" : "word",
      "position" : 0
    }
  ]
}


Pattern Analyzer


通过正则表达式进行分词,默认用\W+,也就是按照非字母进行分隔。举例如下:

GET /_analyze
{
  "analyzer": "pattern",
  "text": "It`s a good day commander. Let`s do it for 2 times!"
}



这里输出其实和standard是一样的:

{
  "tokens" : [
    {
      "token" : "it",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "s",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "a",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "day",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "commander",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "let",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "s",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "it",
      "start_offset" : 36,
      "end_offset" : 38,
      "type" : "word",
      "position" : 9
    },
    {
      "token" : "for",
      "start_offset" : 39,
      "end_offset" : 42,
      "type" : "word",
      "position" : 10
    },
    {
      "token" : "2",
      "start_offset" : 43,
      "end_offset" : 44,
      "type" : "word",
      "position" : 11
    },
    {
      "token" : "times",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "word",
      "position" : 12
    }
  ]
}



LanguageAnalyzer


es也可以按照语言进行分词:

GET /_analyze
{
  "analyzer": "english",
    "text": "It`s a good day commander. Let`s do it for 2 times!"
}



输出如下,也对修饰性单词进行了过滤:

{
  "tokens" : [
    {
      "token" : "s",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "good",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "dai",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "command",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "let",
      "start_offset" : 27,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "s",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "do",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "2",
      "start_offset" : 43,
      "end_offset" : 44,
      "type" : "<NUM>",
      "position" : 11
    },
    {
      "token" : "time",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 12
    }
  ]
}


ICU-Analyzer


这是对中文分词的分词器,要先进行安装:

[es@localhost bin]$ ./elasticsearch-plugin install analysis-icu

-> Installing analysis-icu
-> Downloading analysis-icu from elastic
[=================================================] 100%   
-> Installed analysis-icu


然后重启ES,再做一下测试:

GET /_analyze
{
  "analyzer": "icu_analyzer",
    "text": "这个进球真是漂亮!"
}


输出如下,看来"进球"并没有分割成一个词语:

{
  "tokens" : [
    {
      "token" : "这个",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "进",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "球",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "真是",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "漂亮",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    }
  ]
}


配置自定义Analyzer


通过组合CharacterFilter、Tokenizer和TokenFilter来实现自定义Analyzer
自带的CharacterFilter有HTML strip、Mapping和Pattern replace,分别用来进行html标签去除、字符串替换和正则匹配替换
自带的Tokenizer有whitespace、standard、uax_url_email、pattern、keyword、path_hierarchy,也可以用java开发插件,实现自己的tokenizer
自带的TokenFilter有Lowercase、stop、synonym

tokenizer+character_filter


tokenizer和character_filter的组合示例如下:

POST _analyze
{
  "tokenizer": "keyword",
  "char_filter": ["html_strip"],
  "text": "<b>aaa</b>"
}



同样是tokenizer和character_filter的组合,不过可以在character_filter中加入mapping,示例如下:

POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": ["- => _"]
    }],
    "text": "1-2, d-4"
}


正则


正则示例如下,$1表示取第几个()里的内容,这里就是www.baidu.com:

POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [{
    "type": "pattern_replace",
    "pattern": "http://(.*)",
    "replacement": "$1"
  }],
  "text": "http://www.baidu.com"
}



路径层次分词器


路径层次分词器如下,把输入/home/szc/a/b/c/e当成目录,然后按照目录的层级进行分词:

POST _analyze{  "tokenizer": "path_hierarchy",  "text": "/home/szc/a/b/c/e"}


输出如下

{
  "tokens" : [
    {
      "token" : "/home",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/home/szc",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/home/szc/a",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/home/szc/a/b",
      "start_offset" : 0,
      "end_offset" : 13,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/home/szc/a/b/c",
      "start_offset" : 0,
      "end_offset" : 15,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "/home/szc/a/b/c/e",
      "start_offset" : 0,
      "end_offset" : 17,
      "type" : "word",
      "position" : 0
    }
  ]
}


filter组合


filter组合如下,同时进行小写和去除修饰词处理

POST _analyze
{
  "tokenizer": "whitespace",
  "filter": ["lowercase", "stop"],
  "text": "The boys in China are playing soccer!"
}



综合使用


一个组合性的名为my_analyzer的自定义Analyzer如下所示,其中的char_filter、tokenizer、filter都是自定义的,


使用时加上定义此分词器的文档即可:

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "I`m a :) guy, and you ?"
}


输出如下,可见先是完成了表情符替换,又按照指定的正则进行了分词,最后去除了修饰性单词

{
  "tokens" : [
    {
      "token" : "i`m",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "happy",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "guy",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "you",
      "start_offset" : 18,
      "end_offset" : 21,
      "type" : "word",
      "position" : 5
    }
  ]
}


以上就是关于ElasticSearch7的分词器的内容

标签:end,Analyzer,start,token,ElasticSearch,offset,position,type,分词
来源: https://blog.csdn.net/qq_45151158/article/details/122706175