1

我想从我的推文数据中删除以下事件:

任何带有@的东西(例如@nike)

以 :// 开头的任何内容

在我的 scala 脚本中,我有停用词,但它们必须与输出完全匹配。有没有办法添加诸如 @* 或 ://* 之类的停用词来解释我想要删除的所有单词的可能性?

val source = CSVFile("output.csv")

val tokenizer = {
SimpleEnglishTokenizer() ~>            // tokenize on space and punctuation
WordsAndNumbersOnlyFilter() ~>         // ignore non-words and non-numbers
CaseFolder() ~>                        // lowercase everything
MinimumLengthFilter(3)                 // take terms with >=3 characters 
}

val text = {
source ~>                              // read from the source file
Column(1) ~>                           // select column containing text
TokenizeWith(tokenizer) ~>             // tokenize with tokenizer above
TermCounter() ~>                       // collect counts (needed below)
TermMinimumDocumentCountFilter(30) ~>   // filter terms in <4 docs
TermStopListFilter(List("a", "and", "I", "but", "what")) ~> // stopword list
TermDynamicStopListFilter(10) ~>       // filter out 30 most common terms  
DocumentMinimumLengthFilter(5)         // take only docs with >=5 terms 
}

Tokenizer 似乎无法识别这些非字母字符。但是它会毫无问题地过滤掉#。谢谢你的帮助!

4

1 回答 1

1

由于我从未与 stanford-nlp 合作过,因此我仍然在这里遗漏了许多细节,但这是我能弄清楚的。

我从一个分叉的 scalanlp 存储库中找到了一些源代码,它定义TermStopListFilter

/**
 * Filters out terms from the given list.
 * 
 * @author dramage
 */
case class TermStopListFilter[ID:Manifest](stops : List[String])
extends Stage[LazyIterable[Item[ID,Iterable[String]]],LazyIterable[Item[ID,Iterable[String]]]] {
  override def apply(parcel : Parcel[LazyIterable[Item[ID,Iterable[String]]]]) : Parcel[LazyIterable[Item[ID,Iterable[String]]]] = {
    val newMeta = {
      if (parcel.meta.contains[TermCounts]) {
        parcel.meta + parcel.meta[TermCounts].filterIndex(term => !stops.contains(term)) + TermStopList(stops)
      } else {
        parcel.meta + this;
      }
    }

    Parcel(parcel.history + this, newMeta,
      parcel.data.map((doc : Item[ID,Iterable[String]]) => (doc.map(_.filter(term => !stops.contains(term))))));
  }

  override def toString =
    "TermStopListFilter("+stops+")";
}

在代码中我看到

if (parcel.meta.contains[TermCounts]) {
  parcel.meta + 
  parcel.meta[TermCounts].filterIndex(term => !stops.contains(term)) +
  TermStopList(stops)
}

看起来TermCountsmeta数据中获得的对象通过将术语与stops元素匹配来过滤其包含的术语,使用contains.

要使用更通用的表达式进行过滤,只需实现TermStopListFilter使用正则表达式的新版本就足够了,例如

import scala.util.matching.Regex

/**
 * Filters out terms that matches the supplied regular expression.
 */
case class TermStopListFilter[ID:Manifest](regex: String)
extends Stage[LazyIterable[Item[ID,Iterable[String]]],LazyIterable[Item[ID,Iterable[String]]]] {
  override def apply(parcel : Parcel[LazyIterable[Item[ID,Iterable[String]]]]) : Parcel[LazyIterable[Item[ID,Iterable[String]]]] = {

    //extract the pattern from the regular expression string
    val pat = regex.r.pattern

    val newMeta = {
      if (parcel.meta.contains[TermCounts]) {
        parcel.meta + parcel.meta[TermCounts].filterIndex(term => pat.matcher(term).matches) // something should be added here??
      } else {
        parcel.meta + this; // is this still correct?
      }
    }

    Parcel(parcel.history + this, newMeta,
      parcel.data.map((doc : Item[ID,Iterable[String]]) => (doc.map(_.filter(term => pat.matcher(term).matches)))));
  }

  override def toString =
    "TermStopListFilter("+regex+")";
}
于 2013-01-11T10:45:58.450 回答