Elasticsearch 入门
作者:互联网
1. 术语
在 ElasticSearch 中,存入一个文件的动作称为索引(indexing)。对比传统关系型数据库,ElasticSearch中的类比为:
Relational DB -> Databases -> Tables -> Rows -> Columns
Elasticsearch -> Indices -> Types -> Documents -> Fields
也就是说,ElasticSearch 中包含多个索引(Indices)(数据库),每个索引可以包含多个类型(Types)(表),每个类型里包含多个文档(Documents)(行),每个文档有多个字段(Fields)(列)
2. 写入与检索操作
写入数据
下面我们看一个例子:
我们 put 一条数据到 ES:
curl -XPOST https://es_endpoint/corporation/employee/1 -d '
{
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests": [ "sports", "music" ]
}' -H 'Content-Type: application/json'
这里 es_endpoint 为 ElasticSearch 的终端节点,corporation 为索引(Index),employee为类型(Type),1 为 id。
在放入数据到ES后,我们即可以使用 GET 方法获取数据,如:
curl -XGET https://es_endpoint/corporation/employee/1
{"_index":"corporation",
"_type":"employee",
"_id":"1",
"_version":3,
"_seq_no":2,
"_primary_term":1,
"found":true,
"_source":
{
"first_name" : "Douglas",
"last_name" : "Fir",
"age" : 35,
"about": "I like to build cabinets",
"interests": [ "forestry" ]
}}
ElasticSearch 中使用的是 HTTP 方法进行操作,比如 GET 方法用于检索文档,POST 方法或 PUT 方法写入文档(或是更新文档)。DELETE 方法用于删除文档,HEAD 方法用于检查某文档是否存在。
获取数据
GET 方法可以通过 id 获取唯一文档,不过如果需求是搜索文档,则可以使用如下方式,将 id 换为_search:
curl -XGET https://es_endpoint/corporation/employee/_search
检索数据
使用这个方式会将类型为employee中的所有文档均检索出来,若是需要进行条件检索,则可以用:
curl -XGET https://es_endpoint/corporation/employee/_search?q=first_name:Jane
查询结果为:
{"took":5,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":0.2876821,"hits":[{"_index":"corporation","_type":"employee","_id":"2","_score":0.2876821,"_source":
{
"first_name" : "Jane",
"last_name" : "Smith",
"age" : 32,
"about" : "I like to collect rock albums",
"interests": [ "music" ]
}}]}}
DSL 检索
以上查询仅用于一些简单查询场景,ElasticSearch 提供了更丰富且灵活的查询语言,DSL(Domain Specific Language)。此查询以 JSON 的方式进行请求,例如对于上一个简单查询,我们可以改写为:
curl -XGET https://es_endpoint/corporation/employee/_search -d '
{
"query" : {
"match" : {
"first_name" : "Jane"
}
}
} ' -H 'Content-Type: application/json'
查询结果为:
{"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":0.2876821,"hits":[{"_index":"corporation","_type":"employee","_id":"2","_score":0.2876821,"_source":
{
"first_name" : "Jane",
"last_name" : "Smith",
"age" : 32,
"about" : "I like to collect rock albums",
"interests": [ "music" ]
}}]}}
更复杂的检索
我们在查询语句中加入一个过滤器,过滤掉年纪大于 30 岁的员工:
curl -XGET https://es_endpoint/corporation/employee/_search -d '
{
"query" : {
"bool" : {
"filter" : {
"range" : {
"age" : { "gt" : 30 }
}
},
"must" : {
"match" : {
"last_name" : "smith"
}
}
}
}
} ' -H 'Content-Type: application/json'
这里我们用了一个过滤器(fliter),将年龄大于30岁的文档进行过滤,然后匹配last_name 为 smith 的温度。
全文搜索
在全文搜索中,我们可以指定文档中任意字段的数据,进行全文检索,例如:
curl https://es_endpoint/corporation/employee/_search -d '
{
"query" : {
"match" : {
"about" : "rock climbing"
}
}
} ' -H 'Content-Type: application/json'
结果为:
{"took":9,
"timed_out":false,
"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},
"hits":{"total":{"value":2,"relation":"eq"},
"max_score":0.5753642,
"hits":[
{"_index":"corporation",
"_type":"employee",
"_id":"1",
"_score":0.5753642,
"_source":
{
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests": [ "sports", "music" ]
}},
{"_index":"corporation",
"_type":"employee",
"_id":"2",
"_score":0.2876821,
"_source":
{
"first_name" : "Jane",
"last_name" : "Smith",
"age" : 32,
"about" : "I like to collect rock albums",
"interests": [ "music" ]
}}]}}
可以看到两个返回的文档中有_score 的字段,这个字段表示的是:与匹配条件的相关性。返回的文档按相关性降序排序。可以看到我们检索的条件有 rock climbing,但是仅包含 rock 的第二个文档也被检索出来,但是相关性低于第一个文档。
短语检索
上面的检索进行了 rock climbing 的模糊匹配,若是要进行此短语的精确匹配,则可以将match 改为 match_phrase,如:
https://es_endpoint/corporation/employee/_search -d '
{
"query" : {
"match_phrase" : {
"about" : "rock climbing"
}
}
} ' -H 'Content-Type: application/json'
高亮搜索
很多应用中,需要对搜索中匹配到的关键词进行高亮(highlight),这样可以直观地查看到查询的匹配。ElasticSearch 直接提供了高亮的功能,在语句上增加highlight 的参数即可,例如:
curl -XGET https://search-tangaws-5grg7m53kinfqf2mip6oq6woqm.cn-north-1.es.amazonaws.com.cn/corporation/employee/_search -d '
{
"query" : {
"match_phrase" : {
"about" : "rock climbing"
}
},
"highlight": {
"fields" : {
"about" : {}
}
}
}' -H 'Content-Type: application/json'
结果为:
{"took":44,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":0.5753642,"hits":[{"_index":"corporation","_type":"employee","_id":"1","_score":0.5753642,"_source":
{
"first_name" : "John",
"last_name" : "Smith",
"age" : 25,
"about" : "I love to go rock climbing",
"interests": [ "sports", "music" ]
},"highlight":{"about":["I love to go <em>rock</em> <em>climbing</em>"]}}]}}
可以看到返回的结果中多了一个新的字段为“highlight”,此字段中包含了about 中匹配到的文本,并使用了<em></em>用于标识匹配到的单词。
3. 聚合操作
在数据分析的场景中,我们需要对文档进行一些统计分析。ElasticSearch 提供了一个功能叫聚合(aggregations),它可以让我们在数据上生成复杂的统计分析。此功能类似于 SQL 中的 group by,但是功能更强大。
例如,我们需要找到所有employee中最多的兴趣爱好:
curl -XGET https://search-tangaws-5grg7m53kinfqf2mip6oq6woqm.cn-north-1.es.amazonaws.com.cn/corporation/employee/_search -d '
{
"aggs": {
"all_interests": {
"terms": { "field": "interests.keyword" }
}
}
}' -H 'Content-Type: application/json'
返回的结果为:
…前面的结果忽略,我们仅看统计信息:
"aggregations": {
"all_interests": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [{
"key": "music",
"doc_count": 2
}, {
"key": "forestry",
"doc_count": 1
}, {
"key": "sports",
"doc_count": 1
}]
}
}
可以看到有两个员工的兴趣爱好为 music,对forestry与sports 感兴趣的员工均只有一名。
References:
https://www.elastic.co/guide/en/elasticsearch/guide/master/index.html
标签:about,入门,corporation,文档,rock,employee,Elasticsearch,name 来源: https://www.cnblogs.com/zackstang/p/12021845.html