0

我正在处理非结构化数据(文本)。我想用一些关键词和关键词组合标记数据。

我无法用单词组合标记数据。我想知道“欺诈”和“误卖”发生在哪里。

我尝试使用 qdap 包我能够用 OR 条件而不是 AND 条件标记这两个词

下面是我使用的代码

library (qdap)

df<- read.csv (file.choose(),header=T) 

####cleaning of text
df$Comment<- strip(df$Comment)##remove capitalization and punctuation

df$Comment<- clean (df$Comment)
df$Comment<- scrubber(df$Comment)
df$Comment<- qprep(df$Comment)

df$Comment<-replace_abbreviation(df$Comment)

terms <- list(
    " fraud ",
    " refund "," cheat ", " cancellation ", "missold", "delay",
      combo1= qcv(fraud,missold) )

df2<-with (df, termco(df$Comment, df$Comment, terms))[["raw"]]###tagging of data with key words
df3<- merge (df, df2, by="Comment")

我正在使用保险公司的投诉数据 我拥有的变量是

  1. 投诉日期
  2. 品牌反对者抱怨
  3. 评论(投诉)
4

1 回答 1

0

根据您的示例 xlsx:

library(xlsx)
df <- read.xlsx(file="sample output.xlsx", sheetIndex=1)
library(tm)
terms <- stemDocument(c("fraud","refund","cheat", "cancellation", "misselling", "delay"))
mat <- DocumentTermMatrix(x=Corpus(VectorSource(df$Comment)), 
                          control=list(removePunctuation = TRUE,
                                       dictionary = terms, 
                                       stemming = TRUE,
                                       weighting = weightBin))
df2 <- as.data.frame(as.matrix(mat)) 
(df2 <- transform(df2, combo = fraud + missel))
df2
#   cancel cheat delay fraud missel refund combo
# 1      1     0     0     1      1      0     2
# 2      1     0     0     1      1      0     2
# 3      0     0     0     1      1      0     2
df3 <- cbind(df, df2)
df3
于 2014-06-03T19:53:35.937 回答