r - 使用 R 中的整个文档进行潜在文本分析（lsa 包）

Question

我有一个代码，可以使用 R 中的 lsa 包成功对短引用执行潜在文本分析（见下文）。但是，我更愿意在较大文档中的文本上使用此方法。在每个引用空间中复制粘贴整个内容非常低效——它有效，但需要很长时间才能运行。有什么方法可以直接从数据库或数据框中导入每个“引文”（在这种情况下是文档）？如果是这样，它应该是什么格式？Txt 格式的文档在导入 R 时会自动分成段落，我不确定这是否与 lsa 包执行的分析兼容。

# Load requisite packages
library(tm)
library(ggplot2)
library(lsa)

# Include citations (THIS IS WHERE I WOULD NEED HELP)
text <- c(
  "To Mr. Ken Lay, I’m writing to urge you to donate the millions of dollars you made from selling Enron stock before the company declared bankruptcy.",
  "while you netted well over a $100 million, many of Enron's employees were financially devastated when the company declared bankruptcy and their retirement plans were wiped out",
  "you sold $101 million worth of Enron stock while aggressively urging the company’s employees to keep buying it",
  "This is a reminder of Enron’s Email retention policy. The Email retention policy provides as follows . . .",
  "Furthermore, it is against policy to store Email outside of your Outlook Mailbox and/or your Public Folders. Please do not copy Email onto floppy disks, zip disks, CDs or the network.",
  "Based on our receipt of various subpoenas, we will be preserving your past and future email. Please be prudent in the circulation of email relating to your work and activities.",
  "We have recognized over $550 million of fair value gains on stocks via our swaps with Raptor.",
  "The Raptor accounting treatment looks questionable. a. Enron booked a $500 million gain from equity derivatives from a related party.",
  "In the third quarter we have a $250 million problem with Raptor 3 if we don’t “enhance” the capital structure of Raptor 3 to commit more ENE shares.")
view <- factor(rep(c("view 1", "view 2", "view 3"), each = 3))
df <- data.frame(text, view, stringsAsFactors = FALSE)

# Prepare corpus
corpus <- Corpus(VectorSource(df$text))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("english")))
corpus <- tm_map(corpus, stemDocument, language = "english")
corpus # check corpus


# Compute a term-document matrix that contains occurrance of terms
# Compute distance between pairs of documents and scale the multidimentional semantic space (MDS) onto two dimensions
td.mat <- as.matrix(TermDocumentMatrix(corpus))
dist.mat <- dist(t(as.matrix(td.mat)))
dist.mat  # check distance matrix

# Compute distance between pairs of documents and scale the multidimentional semantic space onto two dimensions
fit <- cmdscale(dist.mat, eig = TRUE, k = 2)
points <- data.frame(x = fit$points[, 1], y = fit$points[, 2])
ggplot(points, aes(x = x, y = y)) + geom_point(data = points, aes(x = x, y = y, color = df$view)) + geom_text(data = points, aes(x = x, y = y - 0.2, label = row.names(df)))

r - 使用 R 中的整个文档进行潜在文本分析（lsa 包）

0 回答 0

Related

Reference