其他分享
首页 > 其他分享> > 文档正常话

文档正常话

作者:互联网

一、normalization

normalization:规范化。在切词之后,包括大小写的转换、去掉语气词停用词(is、an)、单复数的变化

每种分词器的normalization策略不一样,如下图展示

 

二、char filter

char filter:字符过滤器,在切词之前完成操作

1、HTML Strip

 1 PUT my_index
 2 {
 3   "settings": {
 4     "analysis": {
 5       "analyzer": {
 6         "my_analyzer": {
 7           "tokenizer": "keyword",
 8           "char_filter": [
 9             "my_char_filter"
10           ]
11         }
12       },
13       "char_filter": {
14         "my_char_filter": {
15           "type": "html_strip",
16           "escaped_tags":"a"     使用该属性可以规定保留哪些标签
17         }
18       }
19     }
20   }
21 }

2、Mapping

 1 PUT my_index
 2 {
 3   "settings": {
 4     "analysis": {
 5       "char_filter": {
 6         "my_char_filter": {
 7           "type": "mapping",
 8           "mappings": [
 9             "滚 => *",
10             "垃 => *",
11             "圾 => *"
12           ]
13         }
14       },
15       "analyzer": {
16         "my_analyzer": {
17           "tokenizer": "keyword",
18           "char_filter": [
19             "my_char_filter"
20           ]
21         }
22       }
23     }
24   }
25 }

3、Pattern Replace,正则替换

 1 PUT my_index
 2 {
 3   "settings": {
 4     "analysis": {
 5       "char_filter": {
 6         "my_char_filter": {
 7           "type": "pattern_replace",
 8           "pattern":"(\\d{3})\\d{4}(\\d{4})",
 9           "replacement":"$1****$2"
10         }
11       },
12       "analyzer": {
13         "my_analyzer": {
14           "tokenizer": "keyword",
15           "char_filter": [
16             "my_char_filter"
17           ]
18         }
19       }
20     }
21   }
22 }

三、分词器tokenizer

分词器最主要的作用是进行切词,默认分词器为standard

 

标签:filter,tokenizer,analyzer,char,正常,文档,分词器,my
来源: https://www.cnblogs.com/lyc-code/p/15880129.html