具有词干提取功能的标准分析器

本文介绍了具有词干提取功能的标准分析器的处理方法，对大家解决问题具有一定的参考价值

问题描述

有没有办法将 PorterStemFilter 集成到 Lucene 中的 StandardAnalyzer 中，或者我必须复制/粘贴 StandardAnalyzers 源代码，然后添加过滤器，因为 StandardAnalyzer 被定义为最终类.有没有更聪明的办法?

Is there a way to integrate PorterStemFilter into StandardAnalyzer in Lucene, or do I have to copy/paste StandardAnalyzers source code, and add the filter, since StandardAnalyzer is defined as final class. Is there any smarter way?

另外，如果我不想考虑数字，我该如何实现?

Also, if I would like not to consider numbers, how can I achieve that?

谢谢

推荐答案

如果你想用这个组合进行英文文本分析，那么你应该使用Lucene的EnglishAnalyzer.否则，您可以创建一个扩展 AnalyzerWraper 的新 Analyzer，如下所示.

If you want to use this combination for English text analysis, then you should use Lucene's EnglishAnalyzer. Otherwise, you could create a new Analyzer that extends the AnalyzerWraper as shown below.

import java.io.IOException;
import java.io.StringReader;
import java.util.HashSet;
import java.util.Set;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.AnalyzerWrapper;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.TypeTokenFilter;
import org.apache.lucene.analysis.en.PorterStemFilter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;


public class PorterAnalyzer extends AnalyzerWrapper {

  private Analyzer baseAnalyzer;

  public PorterAnalyzer(Analyzer baseAnalyzer) {
      this.baseAnalyzer = baseAnalyzer;
  }

  @Override
  public void close() {
      baseAnalyzer.close();
      super.close();
  }

  @Override
  protected Analyzer getWrappedAnalyzer(String fieldName)
  {
      return baseAnalyzer;
  }

  @Override
  protected TokenStreamComponents wrapComponents(String fieldName, TokenStreamComponents components)
  {
      TokenStream ts = components.getTokenStream();
      Set<String> filteredTypes = new HashSet<>();
      filteredTypes.add("<NUM>");
      TypeTokenFilter numberFilter = new TypeTokenFilter(Version.LUCENE_46,ts, filteredTypes);

      PorterStemFilter porterStem = new PorterStemFilter(numberFilter);
      return new TokenStreamComponents(components.getTokenizer(), porterStem);
  }

  public static void main(String[] args) throws IOException
  {

      //Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
      PorterAnalyzer analyzer = new PorterAnalyzer(new StandardAnalyzer(Version.LUCENE_46));
      String text = "This is a testing example. It should tests the Porter stemmer version 111";

      TokenStream ts = analyzer.tokenStream("fieldName", new StringReader(text));
      ts.reset();

      while (ts.incrementToken()){
          CharTermAttribute ca = ts.getAttribute(CharTermAttribute.class);

          System.out.println(ca.toString());
      }
      analyzer.close();
  }

}

上面的代码是基于这个lucene 论坛主题的.主要工作由 wrapComponents 方法实现.您首先从包装的分析器中获取 TokenStream 对象，然后您应该应用类型过滤器来忽略数字标记.最后，您应用了搬运工词干过滤器.我希望很清楚.

The code above is based on this lucene forum thread's. The main work is implemented by the wrapComponents method. You first get the TokenStream object from the wrapped analyzer, you then shoud apply a type filter to ignore numerical tokens. Lastly, you apply the porter stemmer filter. I hope it is clear.

这篇关于具有词干提取功能的标准分析器的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，WP2

具有词干提取功能的标准分析器

问题描述

推荐答案

admin_action_{$_REQUEST[‘action’]}

admin_footer-{$GLOBALS[‘hook_suffix’]}

customize_save_{$this->id_data[‘base’]}

customize_value_{$this->id_data[‘base’]}

get_comment_author_url

network_admin_edit_{$_GET[‘action’]}

network_sites_updated_message_{$_GET[‘updated’]}

pre_wp_is_site_initialized

WordPress 的SEO 教学：如何在网站中加入关键字（Meta Keywords）与Meta 描述（Meta Description）？

谷歌的SEO是什么