其他分享
首页 > 其他分享> > 进阶-第19__深度探秘搜索技术_混合使用match和近似匹配实现召回率与精准度的平衡

进阶-第19__深度探秘搜索技术_混合使用match和近似匹配实现召回率与精准度的平衡

作者:互联网

召回率定义

 

比如你搜索一个java spark,总共有100个doc,能返回多少个doc作为结果,就是召回率,recall

 

精准度定义

 

比如你搜索一个java spark,能不能尽可能让包含java spark,或者是java和spark离的很近的doc,排在最前面,precision

 

直接用match_phrase短语搜索,会导致必须所有term都在doc field中出现,而且距离在slop限定范围内,才能匹配上

 

match phrase,proximity match,要求doc必须包含所有的term,才能作为结果返回;如果某一个doc可能就是有某个term没有包含,那么就无法作为结果返回

 

java spark --> hello world java --> 就不能返回了

java spark --> hello world, java spark --> 才可以返回

 

近似匹配的时候,召回率比较低,精准度太高了

 

但是有时可能我们希望的是匹配到几个term中的部分,就可以作为结果出来,这样可以提高召回率。同时我们也希望用上match_phrase根据距离提升分数的功能,让几个term距离越近分数就越高,优先返回

 

就是优先满足召回率,意思,java spark,包含java的也返回,包含spark的也返回,包含java和spark的也返回;同时兼顾精准度,就是包含java和spark,同时java和spark离的越近的doc排在最前面

 

此时可以用bool组合match query和match_phrase query一起,来实现上述效果

实验举例

GET /forum/article/_search

{

  "query": {

    "bool": {

      "must": {

        "match": {

          "title": {

            "query":                "java spark" --> java或spark或java spark,java和spark靠前,但是没法区分java和spark的距离,也许java和spark靠的很近,但是没法排在最前面

          }

        }

      },

      "should": {

        "match_phrase": { --> 在slop以内,如果java spark能匹配上一个doc,那么就会对doc贡献自己的relevance score,如果java和spark靠的越近,那么就分数越高

          "title": {

            "query": "java spark",

            "slop":  50

          }

        }

      }

    }

  }

}

 

单独查询的实验一:

单独match保证招呼率,不保证精准度

GET /forum/article/_search

{

  "query": {

    "bool": {

      "must": [

        {

          "match": {

            "content": "java spark"

          }

        }

      ]

    }

  }

}

结果:

{

  "took": 1,

  "timed_out": false,

  "_shards": {

    "total": 5,

    "successful": 5,

    "failed": 0

  },

  "hits": {

    "total": 2,

    "max_score": 0.68640786,

    "hits": [

      {

        "_index": "forum",

        "_type": "article",

        "_id": "2",

        "_score": 0.68640786,

        "_source": {

          "articleID": "KDKE-B-9947-#kL5",

          "userID": 1,

          "hidden": false,

          "postDate": "2017-01-02",

          "tag": [

            "java"

          ],

          "tag_cnt": 1,

          "view_cnt": 50,

          "title": "this is java blog",

          "content": "i think java is the best programming language",

          "sub_title": "learned a lot of course",

          "author_first_name": "Smith",

          "author_last_name": "Williams"

        }

      },

      {

        "_index": "forum",

        "_type": "article",

        "_id": "5",

        "_score": 0.68324494,

        "_source": {

          "articleID": "DHJK-B-1395-#Ky5",

          "userID": 3,

          "hidden": false,

          "postDate": "2017-03-01",

          "tag": [

            "elasticsearch"

          ],

          "tag_cnt": 1,

          "view_cnt": 10,

          "title": "this is spark blog",

          "content": "spark is best big data solution based on scala ,an programming language similar to java spark",

          "sub_title": "haha, hello world",

          "author_first_name": "Tonny",

          "author_last_name": "Peter Smith"

        }

      }

    ]

  }

}

保证招呼率和精准度实验二:

GET /forum/article/_search

{

  "query": {

    "bool": {

      "must": [

        {

          "match": {

            "content": "java spark"

          }

        }

      ],

      "should": [

        {

          "match_phrase": {

            "content": {

              "query": "java spark",

              "slop": 50

            }

          }

        }

      ]

    }

  }

}

结果:

{

  "took": 2,

  "timed_out": false,

  "_shards": {

    "total": 5,

    "successful": 5,

    "failed": 0

  },

  "hits": {

    "total": 2,

    "max_score": 1.258609,

    "hits": [

      {

        "_index": "forum",

        "_type": "article",

        "_id": "5",

        "_score": 1.258609,

        "_source": {

          "articleID": "DHJK-B-1395-#Ky5",

          "userID": 3,

          "hidden": false,

          "postDate": "2017-03-01",

          "tag": [

            "elasticsearch"

          ],

          "tag_cnt": 1,

          "view_cnt": 10,

          "title": "this is spark blog",

          "content": "spark is best big data solution based on scala ,an programming language similar to java spark",

          "sub_title": "haha, hello world",

          "author_first_name": "Tonny",

          "author_last_name": "Peter Smith"

        }

      },

      {

        "_index": "forum",

        "_type": "article",

        "_id": "2",

        "_score": 0.68640786,

        "_source": {

          "articleID": "KDKE-B-9947-#kL5",

          "userID": 1,

          "hidden": false,

          "postDate": "2017-01-02",

          "tag": [

            "java"

          ],

          "tag_cnt": 1,

          "view_cnt": 50,

          "title": "this is java blog",

          "content": "i think java is the best programming language",

          "sub_title": "learned a lot of course",

          "author_first_name": "Smith",

          "author_last_name": "Williams"

        }

      }

    ]

  }

}

 

标签:__,java,进阶,title,doc,精准度,query,spark,match
来源: https://blog.csdn.net/qq_35524586/article/details/88426861