我发布了一个非常相似的问题,但我需要更改条件。我有一个充满多个条目的data.frame。列是“no”、“article”和“class”(“p”=positive,“n”=negative,“x”=neutral)。它看起来像这样:
no <- c(3, 3, 5, 5, 5, 24, 24, 35, 35, 41, 41, 41)
article <- c("earnings went up.", "earnings went up.", "massive layoff.", "they moved their offices.", "Mr. X joined the company.", "class action filed.", "accident in warehouse.", "blabla one.", "blabla two.", "blabla three.", "blabla four.", "blabla five.")
class <- c("p","p","n","x","x","n","n","x","p","p","n","p")
mydf <- data.frame(no, article, class)
mydf
# no article class
# 1 3 earnings went up. p
# 2 3 earnings went up. p
# 3 5 massive layoff. n
# 4 5 they moved their offices. x
# 5 5 Mr. X joined the company. x
# 6 24 class action filed. n
# 7 24 accident in warehouse. n
# 8 35 blabla one. x
# 9 35 blabla two. p
# 10 41 blabla three. p
# 11 41 blabla four. n
# 12 41 blabla five. p
我想摆脱多个条目。多个条目的文章应该合并,但前提是文章不相同!然后,我希望分配除“x”之外的频率最高的类。“x”表示中性,所以如果有重复的“x”、“p”,我仍然希望分配“p”。如果有“n”,则应分配“x”->“n”。与其他多个条目相同。如果“p”和“n”的频率相等,则应分配“x”。
# examples:
# "p", "x" --> "p"
# "p", "n" --> "x"
# "x", "n", "x" --> "n"
# "p", "n", "p" --> "p"
# the resulting data.frame should look like this:
# no article class
# 1 3 earnings went up. p
# 2 5 massive layoff. they moved their offices. Mr. X joined the company. n
# 3 24 class action filed. accident in warehouse. n
# 4 35 blabla one. blabla two. p
# 5 41 blabla four. blabla five. p
在我的旧问题中,即使它们相同,文章也会被合并,并且分配了频率最高的类(“x”、“n”、“p”处理相同)。如果没有最高频率,则分配“x”。有用的方法是:
library(qdap)
df2 <- with(mydf, sentCombine(article, no))
df2$class <- df2$no %l% vect2df(c(tapply(mydf[, 3], mydf[, 1], function(x){
tab <- table(x)
ifelse(sum(tab %in% max(tab)) > 1, "x", names(tab)[max(tab) == tab])
})))
我试图更改此代码,但我对如何编写函数和 qdap 知之甚少,无法真正理解这一点。