lucene - Lucene: how to preserve whitespaces etc when tokenizing stream?

Question

I am trying to perform a "translation" of sorts of a stream of text. More specifically, I need to tokenize the input stream, look up every term in a specialized dictionary and output the corresponding "translation" of the token. However, i also want to preserve all the original whitespaces, stopwords etc from the input so that the output is formatted in the same way as the input instead of ended up being a stream of translations. So if my input is

Term1: Term2 Stopword! Term3 Term4

then I want the output to look like

Term1': Term2' Stopword! Term3' Term4'

(where Termi' is translation of Termi) instead of simply

Term1' Term2' Term3' Term4'

Currently I am doing the following:

PatternAnalyzer pa = new PatternAnalyzer(Version.LUCENE_31,
                             PatternAnalyzer.WHITESPACE_PATTERN,
                             false, 
                             WordlistLoader.getWordSet(new File(stopWordFilePath)));
TokenStream ts = pa.tokenStream(null, in);
CharTermAttribute charTermAttribute = ts.getAttribute(CharTermAttribute.class);

while (ts.incrementToken()) { // loop over tokens
     String termIn = charTermAttribute.toString(); 
     ...
}

but this, of course, loses all the whitespaces etc. How can I modify this to be able to re-insert them into the output? thanks much!

============ UPDATE!

I tried splitting the original stream into "words" and "non-words". It seems to work fine. Not sure whether it's the most efficient way, though:

public ArrayList splitToWords(String sIn) {



if (sIn == null || sIn.length() == 0) {
    return null;
}

char[] c = sIn.toCharArray();
ArrayList<Token> list = new ArrayList<Token>(); 
int tokenStart = 0;
boolean curIsLetter = Character.isLetter(c[tokenStart]);
for (int pos = tokenStart + 1; pos < c.length; pos++) {
    boolean newIsLetter = Character.isLetter(c[pos]);
    if (newIsLetter == curIsLetter) {
        continue;
    }
    TokenType type = TokenType.NONWORD;
    if (curIsLetter == true)
    {
        type = TokenType.WORD;
    }

    list.add(new Token(new String(c, tokenStart, pos - tokenStart),type));
    tokenStart = pos;

    curIsLetter = newIsLetter;
}
TokenType type = TokenType.NONWORD;
if (curIsLetter == true)
{
    type = TokenType.WORD;
}
list.add(new Token(new String(c, tokenStart, c.length - tokenStart),type));

return list;

}

score 0 · Accepted Answer

好吧，它并没有真正丢失空格，您仍然拥有原始文本:)

所以我认为你应该使用 OffsetAttribute，它将每个术语的 startOffset() 和 endOffset() 包含到你的原始文本中。例如，这就是 lucene 用来突出显示原始文本中的搜索结果片段的方法。

我写了一个快速测试（使用 EnglishAnalyzer）来演示：输入是：

Just a test of some ideas. Let's see if it works.

输出是：

just a test of some idea. let see if it work.

// just for example purposes, not necessarily the most performant.
public void testString() throws Exception {
  String input = "Just a test of some ideas. Let's see if it works.";
  EnglishAnalyzer analyzer = new EnglishAnalyzer(Version.LUCENE_35);
  StringBuilder output = new StringBuilder(input);
  // in some cases, the analyzer will make terms longer or shorter.
  // because of this we must track how much we have adjusted the text so far
  // so that the offsets returned will still work for us via replace()
  int delta = 0;

  TokenStream ts = analyzer.tokenStream("bogus", new StringReader(input));
  CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
  OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
  ts.reset();
  while (ts.incrementToken()) {
    String term = termAtt.toString();
    int start = offsetAtt.startOffset();
    int end = offsetAtt.endOffset();
    output.replace(delta + start, delta + end, term);
    delta += (term.length() - (end - start));
  }
  ts.close();

System.out.println(output.toString());

}

lucene - Lucene: how to preserve whitespaces etc when tokenizing stream?

1 回答 1

Related

Reference