database - R tm：将“PCorpus”后端文件哈希数据库重新加载为语料库（例如在重新启动的会话/脚本中）

Question

从这个网站上的答案中学到了很多东西（谢谢！），终于是时候问我自己的问题了。

我正在使用 R（tm 和 lsa 包）来创建、清理和简化，然后在大约 15,000 个文本文档的语料库上运行 LSA（潜在语义分析）。我在 Mac OS X 10.6 下的 R 3.0.0 中执行此操作。

为了提高效率（并应对内存太少），我一直在尝试使用 tm 中的“PCorpus”（“filehash”包支持的后端数据库支持）选项，或者更新的“tm.plugin.dc” ' 所谓的“分布式”语料库处理的选项）。但我真的不明白其中任何一个在引擎盖下是如何工作的。

使用带有 tm_map 的 DCorpus 的一个明显错误（现在不相关）导致我使用 PCorpus 选项做一些预处理工作。这需要几个小时。所以我使用 R CMD BATCH 来运行一个脚本，比如：

> # load corpus from predefined directory path,
> # and create backend database to support processing:
> bigCcorp = PCorpus(bigCdir, readerControl = list(load=FALSE), dbControl = list(useDb = TRUE, dbName = "bigCdb", dbType = "DB1"))

> # converting to lower case:
> bigCcorp = tm_map(bigCcorp, tolower)

> # removing stopwords:
> stoppedCcorp = tm_map(bigCcorp, removeWords, stoplist)

现在，假设我的脚本在这一点之后很快崩溃，或者我只是忘记以其他形式导出语料库，然后我重新启动 R。数据库仍然在我的硬盘上，充满了整理得很好的数据。当然，我可以将它重新加载回新的 R 会话，继续进行语料库处理，而不是重新开始？

感觉就像一个面条问题......但没有多少 dbInit() 或 dbLoad() 或“PCorpus()”函数的变体似乎有效。有谁知道正确的咒语？

我已经搜索了所有相关文档，以及我能找到的所有论文和网络论坛，但完全是空白 - 似乎没有人这样做。还是我错过了？

score 0 · Accepted Answer

The original question was from 2013. Meanwhile, in Feb 2015, a duplicate, or similar question, has been answered:

How to reconnect to the PCorpus in the R tm package?. That answer in that post is essential, although pretty minimalist, so I'll try to augment it here.

These are some comments I've just discovered while working on a similar problem:

Note that the dbInit() function is not part of the tm package.

First you need to install the filehash package, which the tm-Documentation only "suggests" to install. This means it is not a hard dependency of tm.

Supposedly, you can also use the filehashSQLite package with library("filehashSQLite") instead of library("filehash"), and both of these packages have the same interface and work seamlesslessly together, due to object-oriented design. So also install "filehashSQLite" (edit 2016: some functions such as tn::content_transformer() are not implemented for filehashSQLite).

then this works:

library(filehashSQLite)
# this string becomes filename, must not contain dots. 
# Example: "mydata.sqlite" is not permitted.
s <- "sqldb_pcorpus_mydata" #replace mydat with something more descriptive 

suppressMessages(library(filehashSQLite))

if(! file.exists(s)){
        # csv is a data frame of 900 documents, 18 cols/features
        pc = PCorpus(DataframeSource(csv), readerControl = list(language = "en"), dbControl = list(dbName = s, dbType = "SQLite"))
        dbCreate(s, "SQLite")
        db <- dbInit(s, "SQLite")
        set.seed(234)
        # add another record, just to show we can.
        # key="test", value = "Hi there"
        dbInsert(db, "test", "hi there")
} else {
        db <- dbInit(s, "SQLite")
        pc <- dbLoad(db)
}



show(pc)
# <<PCorpus>>
# Metadata:  corpus specific: 0, document level (indexed): 0
#Content:  documents: 900
dbFetch(db, "test")
# remove it
rm(db)
rm(pc)

#reload it
db <- dbInit(s, "SQLite")
pc <- dbLoad(db) 

# the corpus entries are now accessible, but not loaded into memory.
# now 900 documents are bound via "Active Bindings", created by makeActiveBinding() from the base package
show(pc)
# [1] "1"    "2"    "3"    "4"    "5"    "6"    "7"    "8"    "9"    
# ...
# [900]
#[883] "883"  "884"  "885"  "886"  "887"  "888"  "889"  "890"  "891"  "892" 
#"893"  "894"  "895"  "896"  "897"  "898"  "899"  "900" 
#[901] "test"

dbFetch(db, "900")
# <<PlainTextDocument>>
#         Metadata:  7
# Content:  chars: 33

dbFetch(db, "test")
#[1] "hi there"

This is what the database backend looks like. You can see that the documents from the data frame have been encoded somehow, inside the sqlite table.

This is what my RStudio IDE shows me:

database - R tm：将“PCorpus”后端文件哈希数据库重新加载为语料库（例如在重新启动的会话/脚本中）

1 回答 1

Related

Reference