1

我想在 lucene 项目中使用 WikipediaTokenizer - http://lucene.apache.org/java/3_0_2/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html但我从未使用过 lucene。我只想将维基百科字符串转换为令牌列表。但是,我看到这个类中只有四种方法可用,end、incrementToken、reset、reset(reader)。有人可以给我举个例子来使用它。

谢谢你。

4

3 回答 3

3

在 Lucene 3.0 中,移除了 next() 方法。现在您应该使用 incrementToken 来遍历令牌,当您到达输入流的末尾时它会返回 false。要获取每个令牌,您应该使用AttributeSource类的方法。根据您要获取的属性(术语、类型、有效负载等),您需要使用 addAttribute 方法将相应属性的类类型添加到标记器。

以下部分代码示例来自 WikipediaTokenizer 的测试类,如果您下载 Lucene 的源代码,您可以找到它。

...
WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(test));
int count = 0;
int numItalics = 0;
int numBoldItalics = 0;
int numCategory = 0;
int numCitation = 0;
TermAttribute termAtt = tf.addAttribute(TermAttribute.class);
TypeAttribute typeAtt = tf.addAttribute(TypeAttribute.class);

while (tf.incrementToken()) {
  String tokText = termAtt.term();
  //System.out.println("Text: " + tokText + " Type: " + token.type());
  String expectedType = (String) tcm.get(tokText);
  assertTrue("expectedType is null and it shouldn't be for: " + tf.toString(), expectedType != null);
  assertTrue(typeAtt.type() + " is not equal to " + expectedType + " for " + tf.toString(), typeAtt.type().equals(expectedType) == true);
  count++;
  if (typeAtt.type().equals(WikipediaTokenizer.ITALICS)  == true){
    numItalics++;
  } else if (typeAtt.type().equals(WikipediaTokenizer.BOLD_ITALICS)  == true){
    numBoldItalics++;
  } else if (typeAtt.type().equals(WikipediaTokenizer.CATEGORY)  == true){
    numCategory++;
  }
  else if (typeAtt.type().equals(WikipediaTokenizer.CITATION)  == true){
    numCitation++;
  }
}
...
于 2010-10-13T15:07:37.890 回答
1

WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(test));

令牌令牌 = 新令牌();

令牌 = tf.next(令牌);

http://www.javadocexamples.com/java_source/org/apache/lucene/wikipedia/analysis/WikipediaTokenizerTest.java.html

问候

于 2010-10-13T14:25:54.043 回答
0

公共类 WikipediaTokenizerTest { 静态 Logger logger = Logger.getLogger(WikipediaTokenizerTest.class); protected static final String LINK_PHRASES = "单击 [[再次链接此处]] 单击 [ http://lucene.apache.org再次单击此处] [[Category:abcd]]";

public WikipediaTokenizer testSimple() throws Exception {
    String text = "This is a [[Category:foo]]";
    return new WikipediaTokenizer(new StringReader(text));
}
public static void main(String[] args){
    WikipediaTokenizerTest wtt = new WikipediaTokenizerTest();

    try {
        WikipediaTokenizer x = wtt.testSimple();

        logger.info(x.hasAttributes());

        Token token = new Token();
        int count = 0;
        int numItalics = 0;
        int numBoldItalics = 0;
        int numCategory = 0;
        int numCitation = 0;

        while (x.incrementToken() == true) {
            logger.info("seen something");
        }

    } catch(Exception e){
        logger.error("Exception while tokenizing Wiki Text: " + e.getMessage());
    }


}
于 2014-05-30T08:06:16.193 回答