1

我正在尝试StringToWordVector使用WordTokenizer. 这是我的代码:

StringToWordVector filter = new StringToWordVector();

//Tokenizer option (letter only)
String tokenizerOption[] = new String[2];
tokenizerOption[0] = "-tokenizer";
tokenizerOption[1] = "weka.core.tokenizers.WordTokenizer -delimiters \r\t\n .,;:\'\"()?!-><#$%&*+/@^_=[]{}|\\`~0123456789";
filter.setOptions(tokenizerOption);
filter.setInputFormat(data);

然后我将过滤后的实例保存到 ARFF。我得到这个 ARFF:

@attribute '\n' numeric
@attribute ' ' numeric
@attribute ' a ' numeric

如您所见,\n分隔符中不包含空格。如何获得它包括它们?

4

1 回答 1

3

我找到了答案,请参阅下面的代码:

//Make a filter
StringToWordVector filter = new StringToWordVector();

//Make a tokenizer
WordTokenizer wt = new WordTokenizer();
String delimiters = " \r\t\n.,;:\'\"()?!-><#$\\%&*+/@^_=[]{}|`~0123456789";
wt.setDelimiters(delimiters);
filter.setTokenizer(wt);

//Inform filter about dataset
filter.setInputFormat(data);
于 2013-04-06T16:53:19.400 回答