0

我正在使用 ff 和 R,因为我有一个庞大的数据集(大约 16 GB)可以使用。作为一个测试用例,我让文件读取大约 1M 条记录并将其写为 ff 数据库。

system.time(te3 <- read.csv.ffdf(file="testdata.csv", sep = ",", header=TRUE, first.rows=10000, next.rows=50000, colClasses=c("numeric","numeric","numeric","numeric")))

我在这里上传了结果文件(te3):http: //bit.ly/1c8pXqt

我试图做一个简单的计算来创建一个新变量

ffdfwith(te3, {odfips <- ofips*100000 + dfips})

我收到以下错误(没有丢失的记录),这让我很困惑:

Error in if (by < 1) stop("'by' must be > 0") : missing value where TRUE/FALSE needed
In addition: Warning message: In chunk.default(from = 1L, to = 1000000L, by = 2293760000, maxindex = 1000000L) : NAs introduced by coercion

任何见解将不胜感激。此外,与 FF 相关,是否可以在 FF 数据库中使用标准 R 包,例如 MCMC(我需要使用反伽马函数)?

TIA,

克里希南

4

1 回答 1

1

向 ffdf 添加额外变量是一个基本问题,但有多种选择可以达到相同的目标。见下文。我已经在http://bit.ly/1c8pXqt下载了您的 zip 文件并解压缩了它。

require(ffbase)
load.ffdf(dir="/home/janw/Desktop/stackoverflow/ffdb")

## Using ffdfwith or with will chunkwise execute the expression
te3$odfips <- ffdfwith(te3, ofips*100000 + dfips)
te3$odfips <- with(te3, ofips*100000 + dfips)
## It is better to restrict to the columns you need in the expression 
## otherwise you are going to load other columns in RAM also which is not needed. 
## This will speedup
te3$odfips <- ffdfwith(te3[c("ofips","dfips")], ofips*100000 + dfips)
te3$odfips <- with(te3[c("ofips","dfips")], ofips*100000 + dfips)
## ffdfwith will look at options("ffbatchbytes") and look at how many rows in your ffdf
## can be put in 1 batch in order to not overflow options("ffbatchbytes") and hence RAM. 
## So creating this variable will be done in chunks.
## If you want to specify the chunksize yourself, you can e.g. pass the by argument
## to with which will be passed on to ?chunk. Eg. below this variable is created
## in chunks of 100000 records.
te3$odfips <- with(te3[c("ofips","dfips")], ofips*100000 + dfips, by = 100000)

## As the Ops * and + are implemented in ffbase for ff vectors you can also do this:
te3$odfips <- te3$ofips * 100000 + te3$dfips

我不清楚您为什么会收到此错误。也许您已将 options("ffbatchbytes") 设置为非常低的数量?我没有收到此错误。

MCMC 的问题太模糊,无法回答。

于 2014-02-27T09:41:32.810 回答