编程语言
首页 > 编程语言> > java-如何在Lucene中仅标记某些单词

java-如何在Lucene中仅标记某些单词

作者:互联网

我正在为项目使用Lucene,并且需要自定义分析器.

代码是:

public class MyCommentAnalyzer extends Analyzer {

@Override
    protected TokenStreamComponents createComponents( String fieldName, Reader reader ) {

      Tokenizer source = new StandardTokenizer( Version.LUCENE_48, reader );
      TokenStream filter = new StandardFilter( Version.LUCENE_48, source );

      filter = new StopFilter( Version.LUCENE_48, filter, StandardAnalyzer.STOP_WORDS_SET );

      return new TokenStreamComponents( source, filter );
}

}

我已经建立好了,但是现在我不能继续了.我的需求是筛选器只能选择某些单词.与使用停用词相比,这是相反的过程:不要从单词表中删除,而只能在单词表中添加术语.就像预建的字典一样.
因此StopFilter不会填充目标. Lucene提供的所有过滤器似乎都不是很好.
我想我需要编写自己的过滤器,但不知道如何.

有什么建议吗?

解决方法:

您应该从StopFilter寻找起点,所以read the source

StopFilter的大部分来源都是用于构建Stopset的所有便捷方法.您可以放心地忽略所有这些内容(除非您想保留它来构建保持集).

剪切所有内容,然后StopFilter归结为:

public final class StopFilter extends FilteringTokenFilter {

    private final CharArraySet stopWords;
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    public StopFilter(Version matchVersion, TokenStream in, CharArraySet stopWords) {
        super(matchVersion, in);
        this.stopWords = stopWords;
    }

    @Override
    protected boolean accept() {
        return !stopWords.contains(termAtt.buffer(), 0, termAtt.length());
    }
}

FilteringTokenFilter是一个非常简单的类.关键只是accept方法.当前术语被调用时,如果返回true,则将该术语添加到输出流中.如果返回false,则放弃当前项.

因此,您真正需要在StopFilter中进行更改的唯一事情是删除单个字符,以使accept返回与当前操作相反的状态.同样,在这里和那里更改一些名称也不会受到伤害.

public final class KeepOnlyFilter extends FilteringTokenFilter {

    private final CharArraySet keepWords;
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    public KeepOnlyFilter(Version matchVersion, TokenStream in, CharArraySet keepWords) {
        super(matchVersion, in);
        this.keepWords = keepWords;
    }

    @Override
    protected boolean accept() {
        return keepWords.contains(termAtt.buffer(), 0, termAtt.length());
    }
}

标签:java,dictionary,tokenize,lucene
来源: https://codeday.me/bug/20191009/1880180.html