r - 在R中合并重复并分配最高频率（中性除外！）的值

Question

我发布了一个非常相似的问题，但我需要更改条件。我有一个充满多个条目的data.frame。列是“no”、“article”和“class”（“p”=positive，“n”=negative，“x”=neutral）。它看起来像这样：

no <- c(3, 3, 5, 5, 5, 24, 24, 35, 35, 41, 41, 41)
article <- c("earnings went up.", "earnings went up.", "massive layoff.", "they moved their offices.", "Mr. X joined the company.", "class action filed.", "accident in warehouse.", "blabla one.", "blabla two.", "blabla three.", "blabla four.", "blabla five.")
class <- c("p","p","n","x","x","n","n","x","p","p","n","p")

mydf <- data.frame(no, article, class)
mydf

#    no                   article class
# 1   3         earnings went up.     p
# 2   3         earnings went up.     p
# 3   5           massive layoff.     n
# 4   5 they moved their offices.     x
# 5   5 Mr. X joined the company.     x
# 6  24       class action filed.     n
# 7  24    accident in warehouse.     n
# 8  35               blabla one.     x
# 9  35               blabla two.     p
# 10 41             blabla three.     p
# 11 41              blabla four.     n
# 12 41              blabla five.     p

我想摆脱多个条目。多个条目的文章应该合并，但前提是文章不相同！然后，我希望分配除“x”之外的频率最高的类。“x”表示中性，所以如果有重复的“x”、“p”，我仍然希望分配“p”。如果有“n”，则应分配“x”->“n”。与其他多个条目相同。如果“p”和“n”的频率相等，则应分配“x”。

# examples:
# "p", "x"      --> "p"
# "p", "n"      --> "x" 
# "x", "n", "x" --> "n" 
# "p", "n", "p" --> "p"  

# the resulting data.frame should look like this:

#    no                                                            article  class
# 1   3                                                   earnings went up.     p
# 2   5 massive layoff. they moved their offices. Mr. X joined the company.     n
# 3  24                          class action filed. accident in warehouse.     n
# 4  35                                             blabla one. blabla two.     p
# 5  41                                           blabla four. blabla five.     p

在我的旧问题中，即使它们相同，文章也会被合并，并且分配了频率最高的类（“x”、“n”、“p”处理相同）。如果没有最高频率，则分配“x”。有用的方法是：

library(qdap)
df2 <- with(mydf, sentCombine(article, no))

df2$class <- df2$no %l% vect2df(c(tapply(mydf[, 3], mydf[, 1], function(x){
tab <- table(x)
ifelse(sum(tab %in% max(tab)) > 1, "x", names(tab)[max(tab) == tab])
})))

我试图更改此代码，但我对如何编写函数和 qdap 知之甚少，无法真正理解这一点。

score 1 · Accepted Answer

这个怎么样dplyr

require(dplyr) # for aggregation

getclass<-function(class){
  n.n<-length(class[class=="n"])
  n.p<-length(class[class=="p"])
  ret<-"x"                         # return x, unless
  if(n.n>n.p)ret<-"n"              # there are more n's than p's (return p)
  if(n.n<n.p)ret<-"p"              # or more p's than n's (return n)
  return(ret)
}

group_by(mydf,no) %.%
  summarise(article=paste0(unique(article),collapse=" "),class=getclass(class))

Source: local data frame [5 x 3]

  no                                                             article class
1  3                                                   earnings went up.     p
2  5 massive layoff. they moved their offices. Mr. X joined the company.     n
3 24                          class action filed. accident in warehouse.     n
4 35                                             blabla one. blabla two.     p
5 41                             blabla three. blabla four. blabla five.     p

r - 在R中合并重复并分配最高频率（中性除外！）的值

1 回答 1

Related

Reference