2
# Loading required libraries


# Set up logistics such as reading in data and setting up corpus

```{r}

# Relative path points to the local folder
folder.path="../data/InauguralSpeeches/"

# get the list of file names
speeches=list.files(path = folder.path, pattern = "*.txt")

# Truncate file names so it is only showing "FirstLast-Term"
prez.out=substr(speeches, 6, nchar(speeches)-4)

# Create a vector NA's equal to the length of the number of speeches
length.speeches=rep(NA, length(speeches))

# Create a corpus
ff.all<-Corpus(DirSource(folder.path))
```

# Clean the data

```{r}

# Use tm_map to strip all white spaces to a single space, to lower case case, remove stop words, empty strings and punctuation.
ff.all<-tm_map(ff.all, stripWhitespace)
ff.all<-tm_map(ff.all, content_transformer(tolower))
ff.all<-tm_map(ff.all, removeWords, stopwords("english"))
ff.all<-tm_map(ff.all, removeWords, c("can", "may", "upon", "shall", "will",     "must", ""))

问题线

ff.all<-tm_map(ff.all,gsub,模式=“免费”,替换=“自由”)

ff.all<-tm_map(ff.all, removeWords, character(0))
ff.all<-tm_map(ff.all, removePunctuation)

# tdm.all =  a Term Document Matrix
tdm.all<-TermDocumentMatrix(ff.all)

所以我试图用一个词根替换相似的词。例如,在文本挖掘项目中将“free”替换为“freedom”。

然后我从 Youtube 教程中学到了这一行:ff.all<-tm_map(ff.all, gsub, pattern = "free", replacement = "freedom")。没有这一行,代码就会运行。

添加此行后,R Studio 在执行此行时给出此错误“错误:inherits(doc, "TextDocument") is not TRUE ”:“ tdm.all<-TermDocumentMatrix(ff.all)

我认为这应该是一个相对简单的问题,但是我在 stackoverflow 上找不到解决方案。

4

1 回答 1

1

使用tm's 的内置数据,我可以通过这样的调用crude来解决您的问题。gsubcontent_transformer

ff.all<-tm_map(ff.all, content_transformer(function(x) gsub(x, pattern = "free", replacement = "freedom")))

根据我的经验,tm_map对自定义函数的返回对象做了一些奇怪的事情。因此,虽然您的原始线路工作tm_map并没有完全返回真正的“语料库”,但这就是导致错误的原因。

作为旁注:

这条线似乎什么也没做 ff.all<-tm_map(ff.all, removeWords, character(0))

""ff.all<-tm_map(ff.all, removeWords, c("can", "may", "upon", "shall", "will", "must", "")) 中的相同

我的完整示例

library(tm)
data(crude)
ff.all <- crude

ff.all<-tm_map(ff.all, stripWhitespace)
ff.all<-tm_map(ff.all, content_transformer(tolower))
ff.all<-tm_map(ff.all, removeWords, stopwords("english"))
ff.all<-tm_map(ff.all, removeWords, c("can", "may", "upon", "shall", "will",     "must", ""))

ff.all<-tm_map(ff.all, content_transformer(function(x) gsub(x, pattern = "free", replacement = "freedom")))

ff.all<-tm_map(ff.all, removeWords, character(0))
ff.all<-tm_map(ff.all, removePunctuation)

# tdm.all =  a Term Document Matrix
tdm.all<-TermDocumentMatrix(ff.all)
于 2017-01-30T16:24:36.843 回答