r - 使用 rmongodb 和 plyr 将大型 MongoDB 集合传输到 R 中的 data.frame

Question

当尝试使用 rmongodb 和 plyr 包将数据帧从 MongoDB 传输到 R 时，我遇到了一些奇怪的结果，其中包含大量集合集。我从有关该主题的各种 github 和论坛中获取此代码，并根据我的目的对其进行调整：

## load the both packages
library(rmongodb)
library(plyr)
## connect to MongoDB
mongo <- mongo.create(host="localhost")
# [1] TRUE
## get the list of the databases
mongo.get.databases(mongo)
# list of databases (with mydatabase)
## get the list of the collections of mydatabase
mongo.get.collections(mongo, db = "mydatabase")
# list of all the collections of my database
## Verify the size of mycollection
DBNS = "mycollection"
mongo.count(mongo, ns = DBNS)
# [1] 845923 documents inside "my collection"
## transform mycollection (in BSON MongoDB format) to a data frame (adapted for R)
export = data.frame(stringAsFactors = FALSE)
cursor = mongo.find(mongo, DBNS)
i = 1
while(mongo.cursor.next(cursor))
{
tmp = mongo.bson.to.list(mongo.cursor.value(cursor))
tmp.df = as.data.frame(t(unlist(tmp)), stringAsFactors = FALSE)
export = rbind.fill(export, tmp.df)
i = i + 1
}
## show the size of the database "export"
dim(export)
# [1] 20585 23
## check more information on the database "export"
str(export)
# 'data.frame': 20585 obs. of 23 variables
# etc…

转移做得不好：在 MongoDB 中找到的“mycollection”中的 845923 个文档与 R 中的 20585 个观察结果之间存在巨大差异。

我可能不同意上面的代码。如果我没有要附加的特定值，我不确定 i = 1 和 i = i + 1 是否对这个函数有用（可能来自带有 rmongodb 查询的代码）。我还发现“t(unlist(tmp))”很奇怪，t 来自哪里？

问题是我遇到了与 MongoDB 中的集合大小和 R 中具有大型集合集（优于数千个文档）的数据库大小的一些很大差异。我的 PC 具有良好的 RAM，并且 R 在此过程中似乎运行良好（没有冻结、没有崩溃、需要时间但由于从 BSON 到列表到数据帧的大量转换而正常）。

我已经成功地将 36100 个文档的 MongoDB 集合从 MongoDB 传输到 R 进行数据分析，没有任何问题。

所以我不确定问题出在哪里。

在此先感谢您提供有关此主题的任何帮助。

score 1 · Accepted Answer

我会说这一切都不需要。您可以按以下简单方式进行：这需要 R 中名为“rmongodb”的包。此包需要最新版本，并且不会出现在早期版本中。这个包处理 mongodb。还有其他软件包，例如“RMongo”。

用于在 R 中安装 rmongodb

install.packages("rmongodb")

将MongoDB的大数据转换为R中的数据框

library(rmongodb)
mongo <- mongo.create() # create a connection to mongodb localhost
mongo.is.connected(mongo) # check whether mongodb is connected
mongo.get.databases(mongo) #shows all databases present in mongodb
mongo.get.database.collections(mongo,"mydb") #displays all collections present in database mydb
data <- mongo.find.all(mongo,"mydb.collection",data.frame=TRUE) # This would suffice as this would convert the entire list into a data frame in R.

r - 使用 rmongodb 和 plyr 将大型 MongoDB 集合传输到 R 中的 data.frame

1 回答 1

Related

Reference