我想从我的推文数据中删除以下事件:
任何带有@的东西(例如@nike)
以 :// 开头的任何内容
在我的 scala 脚本中,我有停用词,但它们必须与输出完全匹配。有没有办法添加诸如 @* 或 ://* 之类的停用词来解释我想要删除的所有单词的可能性?
val source = CSVFile("output.csv")
val tokenizer = {
SimpleEnglishTokenizer() ~> // tokenize on space and punctuation
WordsAndNumbersOnlyFilter() ~> // ignore non-words and non-numbers
CaseFolder() ~> // lowercase everything
MinimumLengthFilter(3) // take terms with >=3 characters
}
val text = {
source ~> // read from the source file
Column(1) ~> // select column containing text
TokenizeWith(tokenizer) ~> // tokenize with tokenizer above
TermCounter() ~> // collect counts (needed below)
TermMinimumDocumentCountFilter(30) ~> // filter terms in <4 docs
TermStopListFilter(List("a", "and", "I", "but", "what")) ~> // stopword list
TermDynamicStopListFilter(10) ~> // filter out 30 most common terms
DocumentMinimumLengthFilter(5) // take only docs with >=5 terms
}
Tokenizer 似乎无法识别这些非字母字符。但是它会毫无问题地过滤掉#。谢谢你的帮助!