r - tm_map(gsub...) 无法替换单词

Question

# Loading required libraries


# Set up logistics such as reading in data and setting up corpus

```{r}

# Relative path points to the local folder
folder.path="../data/InauguralSpeeches/"

# get the list of file names
speeches=list.files(path = folder.path, pattern = "*.txt")

# Truncate file names so it is only showing "FirstLast-Term"
prez.out=substr(speeches, 6, nchar(speeches)-4)

# Create a vector NA's equal to the length of the number of speeches
length.speeches=rep(NA, length(speeches))

# Create a corpus
ff.all<-Corpus(DirSource(folder.path))
```

# Clean the data

```{r}

# Use tm_map to strip all white spaces to a single space, to lower case case, remove stop words, empty strings and punctuation.
ff.all<-tm_map(ff.all, stripWhitespace)
ff.all<-tm_map(ff.all, content_transformer(tolower))
ff.all<-tm_map(ff.all, removeWords, stopwords("english"))
ff.all<-tm_map(ff.all, removeWords, c("can", "may", "upon", "shall", "will",     "must", ""))

问题线

ff.all<-tm_map（ff.all，gsub，模式=“免费”，替换=“自由”）

ff.all<-tm_map(ff.all, removeWords, character(0))
ff.all<-tm_map(ff.all, removePunctuation)

# tdm.all =  a Term Document Matrix
tdm.all<-TermDocumentMatrix(ff.all)

所以我试图用一个词根替换相似的词。例如，在文本挖掘项目中将“free”替换为“freedom”。

然后我从 Youtube 教程中学到了这一行：ff.all<-tm_map(ff.all, gsub, pattern = "free", replacement = "freedom")。没有这一行，代码就会运行。

添加此行后，R Studio 在执行此行时给出此错误“错误：inherits(doc, "TextDocument") is not TRUE ”：“ tdm.all<-TermDocumentMatrix(ff.all) ”

我认为这应该是一个相对简单的问题，但是我在 stackoverflow 上找不到解决方案。

score 1 · Accepted Answer

使用tm's 的内置数据，我可以通过这样的调用crude来解决您的问题。gsubcontent_transformer

ff.all<-tm_map(ff.all, content_transformer(function(x) gsub(x, pattern = "free", replacement = "freedom")))

根据我的经验，tm_map对自定义函数的返回对象做了一些奇怪的事情。因此，虽然您的原始线路工作tm_map并没有完全返回真正的“语料库”，但这就是导致错误的原因。

作为旁注：

这条线似乎什么也没做 ff.all<-tm_map(ff.all, removeWords, character(0))

与""ff.all<-tm_map(ff.all, removeWords, c("can", "may", "upon", "shall", "will", "must", "")) 中的相同

我的完整示例

library(tm)
data(crude)
ff.all <- crude

ff.all<-tm_map(ff.all, stripWhitespace)
ff.all<-tm_map(ff.all, content_transformer(tolower))
ff.all<-tm_map(ff.all, removeWords, stopwords("english"))
ff.all<-tm_map(ff.all, removeWords, c("can", "may", "upon", "shall", "will",     "must", ""))

ff.all<-tm_map(ff.all, content_transformer(function(x) gsub(x, pattern = "free", replacement = "freedom")))

ff.all<-tm_map(ff.all, removeWords, character(0))
ff.all<-tm_map(ff.all, removePunctuation)

# tdm.all =  a Term Document Matrix
tdm.all<-TermDocumentMatrix(ff.all)

r - tm_map(gsub...) 无法替换单词

问题线

1 回答 1

我的完整示例

Related

Reference