r - R中的TM包清理文本

Question

我正在尝试使用 R 中的 TM 包清理我的文本语料库，但是我不断收到此错误：

no applicable method for 'removePunctuation' applied to an object of class "data.frame"

我的数据由从文本文件中读取的聊天日志组成，在 R 中如下所示：

     V1
1   In the process
2   Sorry I had to step away for a moment.
3   I am getting an error page that says QB is currently unavailable.
4   That link gives me the same error message.

我用：

tdm <- TermDocumentMatrix(text,
                          control = list(removePunctuation = TRUE,
                                         stopwords = TRUE))

但我收到此错误：

Error in UseMethod("TermDocumentMatrix", x) : 
  no applicable method for 'TermDocumentMatrix' applied to an object of class "data.frame"

似乎我不应该将数据框输入到函数中，但我还能怎么做呢？

谢谢

score 1 · Accepted Answer

As @Martin Bel pointed out qdap version 1.1.0 can do this as well. I've added a bit of support to qdap to be more compatible with the tm package including a tdm function that would work well here:

First read in your data (I added colons):

library(qdap)
dat <- read.transcript(text="ID    V1
1   In the process
2   Sorry I had to step away for a moment.
3   I am getting an error page that says QB is currently unavailable.
4   That link gives me the same error message.", header=TRUE, sep="   ")

# To make a term document matrix:

tdm(dat$V1, id(dat), stopwords=tm::stopwords("en"))

# To do the same thing with the tm package:

TermDocumentMatrix(Corpus(VectorSource(dat[, 1])),
    control = list(
        removePunctuation = TRUE,
        stopwords = TRUE
    )
)

score 1 · Accepted Answer

您非常接近，最快的方法是使用DataframeSource制作语料库对象，然后从中制作术语文档矩阵。使用您的示例：

让我们输入数据...

Text <- readLines(n=4)
In the process
Sorry I had to step away for a moment.
I am getting an error page that says QB is currently unavailable.
That link gives me the same error message.

df <- data.frame(V1 = Text, stringsAsFactors = FALSE)

现在将数据框转换为术语文档矩阵...

require(tm)
mycorpus <- Corpus(DataframeSource(df))
tdm <- TermDocumentMatrix(mycorpus, control = list(removePunctuation = TRUE, stopwords = TRUE))

现在检查输出...

inspect(tdm)
   A term-document matrix (14 terms, 4 documents)

Non-/sparse entries: 15/41
Sparsity           : 73%
Maximal term length: 11 
Weighting          : term frequency (tf)

             Docs
Terms         1 2 3 4
  away        0 1 0 0
  currently   0 0 1 0
  error       0 0 1 1
  getting     0 0 1 0
  gives       0 0 0 1
  link        0 0 0 1
  message     0 0 0 1
  moment      0 1 0 0
  page        0 0 1 0
  process     1 0 0 0
  says        0 0 1 0
  sorry       0 1 0 0
  step        0 1 0 0
  unavailable 0 0 1 0

score -1 · Accepted Answer

您只需通过执行以下操作从数据框中解压缩文本text[,1]：

tdm <- TermDocumentMatrix(text[,1],
                          control = list(removePunctuation = TRUE,
                                         stopwords = TRUE))

r - R中的TM包清理文本

3 回答 3

Related

Reference