r - 如何在 R 中使用 ff 包合并两个大型 data.frames？

Question

我有两个非常大的 .csv 文件，我们称它们为 CSV.1 和 CSV.2（CSV.1 大约 1.4 GB，CSV.2 大约 790 MB），我想使用 FULL OUTER 加入它们在公共字段“Id”上加入。CSV 文件的字段有多种类型，有些完全是数字，有些是字符串。此外，CSV.1 有大约 190 列和 160 万条记录，CSV.2 有大约 40 列和 57 万条记录。

最初，我编写并执行了以下代码：

first_csv <- read.csv("CSV.1")
second_csv <- read.csv("CSV.2")
joined_csv <- join(CSV.1, CSV.2, by="Id", type="full")

然而，这返回了典型的，Your RAM is fully taxed，错误。所以我尝试了以下方法：

# Install and invoke the ff package
install.packages("ff")
library(ff)
library(plyr)

# Read in the data
first_csv <- read.csv("CSV.1")
second_csv <- read.csv("CSV.2")

# Convert dataframes to ffdf's, while freeing up memory
first_csv_ff <- as.ffdf(first_csv)
rm(first_csv)
gc()
second_csv_ff <- as.ffdf(second_csv)
rm(second_csv)
gc()

# Attempt to join the two ffdf's by "Id"
joined_csv <- join(first_csv_ff, second_csv_ff, by="Id", type="full")

R 发出以下错误：

Error in as.hi.integer(x, maxindex = maxindex, dim = dim, vw = vw, pack = pack) : 
NAs in as.hi.integer

我也尝试了没有 as.ffdf 的“ <- ffdf() ”，但那里也没有任何乐趣。

非常感谢您的帮助！

score 1 · Accepted Answer

您可以使用merge包ff，仅供参考：

FULL Outer join ~ merge(x = df1, y = df2, ...., all = TRUE)

使用您的数据，这应该可以工作：

merge(first_csv_ff, second_csv_ff, by="Id", all=TRUE)

score 1 · Accepted Answer

ffbase 包为 ff 包提供了基本的统计功能。

install.packages(ffbase)
require(ffbase)
#now perform the merge
merge(ffdf1,ffdf2,by="key")

r - 如何在 R 中使用 ff 包合并两个大型 data.frames？

2 回答 2

Related

Reference