问题描述
有没有办法将 PorterStemFilter
集成到 Lucene 中的 StandardAnalyzer
中,或者我必须复制/粘贴 StandardAnalyzers
源代码,然后添加过滤器,因为 StandardAnalyzer
被定义为最终类.有没有更聪明的办法?
Is there a way to integrate PorterStemFilter
into StandardAnalyzer
in Lucene, or do I have to copy/paste StandardAnalyzers
source code, and add the filter, since StandardAnalyzer
is defined as final class. Is there any smarter way?
另外,如果我不想考虑数字,我该如何实现?
Also, if I would like not to consider numbers, how can I achieve that?
谢谢
推荐答案
如果你想用这个组合进行英文文本分析,那么你应该使用Lucene的EnglishAnalyzer
.否则,您可以创建一个扩展 AnalyzerWraper
的新 Analyzer
,如下所示.
If you want to use this combination for English text analysis, then you should use Lucene's EnglishAnalyzer
. Otherwise, you could create a new Analyzer
that extends the AnalyzerWraper
as shown below.
import java.io.IOException;
import java.io.StringReader;
import java.util.HashSet;
import java.util.Set;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.AnalyzerWrapper;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.TypeTokenFilter;
import org.apache.lucene.analysis.en.PorterStemFilter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;
public class PorterAnalyzer extends AnalyzerWrapper {
private Analyzer baseAnalyzer;
public PorterAnalyzer(Analyzer baseAnalyzer) {
this.baseAnalyzer = baseAnalyzer;
}
@Override
public void close() {
baseAnalyzer.close();
super.close();
}
@Override
protected Analyzer getWrappedAnalyzer(String fieldName)
{
return baseAnalyzer;
}
@Override
protected TokenStreamComponents wrapComponents(String fieldName, TokenStreamComponents components)
{
TokenStream ts = components.getTokenStream();
Set<String> filteredTypes = new HashSet<>();
filteredTypes.add("<NUM>");
TypeTokenFilter numberFilter = new TypeTokenFilter(Version.LUCENE_46,ts, filteredTypes);
PorterStemFilter porterStem = new PorterStemFilter(numberFilter);
return new TokenStreamComponents(components.getTokenizer(), porterStem);
}
public static void main(String[] args) throws IOException
{
//Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
PorterAnalyzer analyzer = new PorterAnalyzer(new StandardAnalyzer(Version.LUCENE_46));
String text = "This is a testing example. It should tests the Porter stemmer version 111";
TokenStream ts = analyzer.tokenStream("fieldName", new StringReader(text));
ts.reset();
while (ts.incrementToken()){
CharTermAttribute ca = ts.getAttribute(CharTermAttribute.class);
System.out.println(ca.toString());
}
analyzer.close();
}
}
上面的代码是基于这个lucene 论坛主题的.主要工作由 wrapComponents 方法实现.您首先从包装的分析器中获取 TokenStream 对象,然后您应该应用类型过滤器来忽略数字标记.最后,您应用了搬运工词干过滤器.我希望很清楚.
The code above is based on this lucene forum thread's. The main work is implemented by the wrapComponents method. You first get the TokenStream object from the wrapped analyzer, you then shoud apply a type filter to ignore numerical tokens. Lastly, you apply the porter stemmer filter. I hope it is clear.
这篇关于具有词干提取功能的标准分析器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,WP2