0

我正在尝试使用 R 中的 TM 包清理我的文本语料库,但是我不断收到此错误:

no applicable method for 'removePunctuation' applied to an object of class "data.frame"

我的数据由从文本文件中读取的聊天日志组成,在 R 中如下所示:

     V1
1   In the process
2   Sorry I had to step away for a moment.
3   I am getting an error page that says QB is currently unavailable.
4   That link gives me the same error message.

我用:

tdm <- TermDocumentMatrix(text,
                          control = list(removePunctuation = TRUE,
                                         stopwords = TRUE))

但我收到此错误:

Error in UseMethod("TermDocumentMatrix", x) : 
  no applicable method for 'TermDocumentMatrix' applied to an object of class "data.frame"

似乎我不应该将数据框输入到函数中,但我还能怎么做呢?

谢谢

4

3 回答 3

1

As @Martin Bel pointed out qdap version 1.1.0 can do this as well. I've added a bit of support to qdap to be more compatible with the tm package including a tdm function that would work well here:

First read in your data (I added colons):

library(qdap)
dat <- read.transcript(text="ID    V1
1   In the process
2   Sorry I had to step away for a moment.
3   I am getting an error page that says QB is currently unavailable.
4   That link gives me the same error message.", header=TRUE, sep="   ")

# To make a term document matrix:

tdm(dat$V1, id(dat), stopwords=tm::stopwords("en"))

# To do the same thing with the tm package:

TermDocumentMatrix(Corpus(VectorSource(dat[, 1])),
    control = list(
        removePunctuation = TRUE,
        stopwords = TRUE
    )
)
于 2013-11-12T01:04:23.803 回答
1

您非常接近,最快的方法是使用DataframeSource制作语料库对象,然后从中制作术语文档矩阵。使用您的示例:

让我们输入数据...

Text <- readLines(n=4)
In the process
Sorry I had to step away for a moment.
I am getting an error page that says QB is currently unavailable.
That link gives me the same error message.

df <- data.frame(V1 = Text, stringsAsFactors = FALSE)

现在将数据框转换为术语文档矩阵...

require(tm)
mycorpus <- Corpus(DataframeSource(df))
tdm <- TermDocumentMatrix(mycorpus, control = list(removePunctuation = TRUE, stopwords = TRUE))

现在检查输出...

inspect(tdm)
   A term-document matrix (14 terms, 4 documents)

Non-/sparse entries: 15/41
Sparsity           : 73%
Maximal term length: 11 
Weighting          : term frequency (tf)

             Docs
Terms         1 2 3 4
  away        0 1 0 0
  currently   0 0 1 0
  error       0 0 1 1
  getting     0 0 1 0
  gives       0 0 0 1
  link        0 0 0 1
  message     0 0 0 1
  moment      0 1 0 0
  page        0 0 1 0
  process     1 0 0 0
  says        0 0 1 0
  sorry       0 1 0 0
  step        0 1 0 0
  unavailable 0 0 1 0
于 2013-11-12T05:08:33.703 回答
-1

您只需通过执行以下操作从数据框中解压缩文本text[,1]

tdm <- TermDocumentMatrix(text[,1],
                          control = list(removePunctuation = TRUE,
                                         stopwords = TRUE))
于 2013-11-12T00:09:56.040 回答