客观的
我想计算“爱”这个词在文件中出现的次数,但前提是它前面没有“不”这个词,例如“我喜欢电影”将计为一次出现,而“我不喜欢电影” “不会算作出场。
问题
如何继续使用 tm 包?
代码
下面是一些我想修改以执行上述操作的自包含代码。
require(tm)
# text vector
my.docs <- c(" I love the Red Hot Chilli Peppers! They are the most lovely people in the world.",
"I do not love the Red Hot Chilli Peppers but I do not hate them either. I think they are OK.\n",
"I hate the `Red Hot Chilli Peppers`!")
# convert to data.frame
my.docs.df <- data.frame(docs = my.docs, row.names = c("positiveText", "neutralText", "negativeText"), stringsAsFactors = FALSE)
# convert to a corpus
my.corpus <- Corpus(DataframeSource(my.docs.df))
# Some standard preprocessing
my.corpus <- tm_map(my.corpus, stripWhitespace)
my.corpus <- tm_map(my.corpus, tolower)
my.corpus <- tm_map(my.corpus, removePunctuation)
my.corpus <- tm_map(my.corpus, removeWords, stopwords("english"))
my.corpus <- tm_map(my.corpus, stemDocument)
my.corpus <- tm_map(my.corpus, removeNumbers)
# construct dictionary
my.dictionary.terms <- tolower(c("love", "Hate"))
my.dictionary <- Dictionary(my.dictionary.terms)
# construct the term document matrix
my.tdm <- TermDocumentMatrix(my.corpus, control = list(dictionary = my.dictionary))
inspect(my.tdm)
# Terms positiveText neutralText negativeText
# hate 0 1 1
# love 2 1 0
更多信息
我正在尝试从商业包 WordStat 中重现字典规则功能。它能够利用字典规则,即
“由单词、单词模式、短语以及邻近规则(例如 NEAR、AFTER、BEFORE)组成的分层内容分析词典或分类法,用于实现概念的精确测量”
我还注意到这个有趣的 SO 问题:基于规则的开源模式匹配/信息提取框架?
更新 1:根据@Ben 的评论和帖子,我得到了这个(虽然最后略有不同,但他的回答强烈地启发了他,所以完全归功于他)
require(data.table)
require(RWeka)
# bi-gram tokeniser function
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))
# get all 1-gram and 2-gram word counts
tdm <- TermDocumentMatrix(my.corpus, control = list(tokenize = BigramTokenizer))
# convert to data.table
dt <- as.data.table(as.data.frame(as.matrix(tdm)), keep.rownames=TRUE)
setkey(dt, rn)
# attempt at extracting but includes overlaps i.e. words counted twice
dt[like(rn, "love")]
# rn positiveText neutralText negativeText
# 1: i love 1 0 0
# 2: love 2 1 0
# 3: love peopl 1 0 0
# 4: love the 1 1 0
# 5: most love 1 0 0
# 6: not love 0 1 0
然后我想我需要做一些行子设置和行减法,这会导致类似
dt1 <- dt["love"]
# rn positiveText neutralText negativeText
#1: love 2 1 0
dt2 <- dt[like(rn, "love") & like(rn, "not")]
# rn positiveText neutralText negativeText
#1: not love 0 1 0
# somehow do something like
# DT = dt1 - dt2
# but I can't work out how to code that but the require output would be
# rn positiveText neutralText negativeText
#1: love 2 0 0
我不知道如何使用 data.table 获取最后一行,但这种方法类似于 WordStats 'NOT NEAR' 字典函数,例如在这种情况下,如果“love”这个词没有出现在 1 个单词中,则只计算它直接在“不”这个词之前或之后。
如果我们要做一个 m-gram 标记器,那么就像说我们只计算“爱”这个词,如果它没有出现在“不”这个词的任何一侧的 (m-1) 个词中。
其他方法是最受欢迎的!