我正在尝试使用更大的数字,超过 2^32。虽然我也在使用 data.table 和 fread,但我认为问题与它们无关。我可以在不更改 data.table 或使用 fread 的情况下打开和关闭它们的症状。我的症状是,当我预期正指数 1e+3 到 1e+17 时,我得到的报告平均值为 4.1e-302
使用与 integer64 相关的 bit64 包和函数时,问题始终出现。在“常规大小的数据和 R”中事情对我有用,但我在这个包中没有正确表达事情。请参阅下面的代码和数据。
我在 MacBook Pro,16GB,i7(更新)上。
我重新启动了我的 R 会话并清除了工作区,但问题始终存在。
请多多指教,感谢您的意见。我认为它必须使用库 bit64。
我查看的链接包括 bit64 doc
由 fread() 内存泄漏引起的具有类似症状的问题,但我认为我已消除
这是我的输入数据
var1,var2,var3,var4,var5,var6,expected_row_mean,expected_row_stddev
1000 ,993 ,987 ,1005 ,986 ,1003 ,996 ,8
100000 ,101040 ,97901 ,100318 ,96914 ,97451 ,98937 ,1722
10000000 ,9972997 ,9602778 ,9160554 ,8843583 ,8688500 ,9378069 ,565637
1000000000 ,1013849241 ,973896894 ,990440721 ,1030267777 ,1032689982 ,1006857436 ,23096234
100000000000 ,103171209097 ,103660949260 ,102360301140 ,103662297222 ,106399064194 ,103208970152 ,2078732545
10000000000000 ,9557954451905 ,9241065464713 ,9357562691674 ,9376495364909 ,9014072235909 ,9424525034852 ,334034298683
1000000000000000 ,985333546044881 ,994067361457872 ,1034392968759970 ,1057553099903410 ,1018695335152490 ,1015007051886440 ,27363415718203
100000000000000000 ,98733768902499600 ,103316759127969000 ,108062824583319000 ,111332326225036000 ,108671041505404000 ,105019453390705000 ,5100048567944390
我的代码,使用这个示例数据
# file: problem_bit64.R
# OBJECTIVE: Using larger numbers, I want to calculate a row mean and row standard deviation
# ERROR: I don't know what I am doing wrong to get such errors, seems bit64 related
# PRIORITY: BLOCKED (do this in Python instead?)
# reported Sat 9/24/2016 by Greg
# sample data:
# each row is 100 times larger on average, for 8 rows, starting with 1,000
# for the vars within a row, there is 10% uniform random variation. B2 = ROUND(A2+A2*0.1*(RAND()-0.5),0)
# Install development version of data.table --> for fwrite()
install.packages("data.table", repos = "https://Rdatatable.github.io/data.table", type = "source")
require(data.table)
require(bit64)
.Machine$integer.max # 2147483647 Is this an issue ?
.Machine$double.xmax # 1.797693e+308 I assume not
# -------------------------------------------------------------------
# ---- read in and basic info that works
csv_in <- "problem_bit64.csv"
dt <- fread( csv_in )
dim(dt) # 6 8
lapply(dt, class) # "integer64" for all 8
names(dt) # "var1" "var2" "var3" "var4" "var5" "var6" "expected_row_mean" "expected_row_stddev"
dtin <- dt[, 1:6, with=FALSE] # just save the 6 input columns
...现在问题开始了
# -------------------------------------------------------------------
# ---- CALCULATION PROBLEMS START HERE
# ---- for each row, I want to calculate the mean and standard deviation
a <- apply(dtin, 1, mean.integer64); a # get 8 values like 4.9e-321
b <- apply(dtin, 2, mean.integer64); b # get 6 values like 8.0e-308
# ---- try secondary variations that do not work
c <- apply(dtin, 1, mean); c # get 8 values like 4.9e-321
c <- apply(dtin, 1, mean.integer64); c # same result
c <- apply(dtin, 1, function(x) mean(x)); c # same
c <- apply(dtin, 1, function(x) sum(x)/length(x)); c # same results as mean(x)
##### I don't see any sd.integer64 # FEATURE REQUEST, Z-TRANSFORM IS COMMON
c <- apply(dtin, 1, function(x) sd(x)); c # unrealistic values - see expected
常规数据上的常规大小 R,仍然使用 fread() 将数据读入 data.table() - WORKS
# -------------------------------------------------------------------
# ---- delete big numbers, and try regular stuff - WHICH WORKS
dtin2 <- dtin[ 1:3, ] # just up to about 10 million (SAME DATA, SAME FREAD, SAME DATA.TABLE)
dtin2[ , var1 := as.integer(var1) ] # I know there are fancier ways to do this
dtin2[ , var2 := as.integer(var2) ] # but I want things to work before getting fancy.
dtin2[ , var3 := as.integer(var3) ]
dtin2[ , var4 := as.integer(var4) ]
dtin2[ , var5 := as.integer(var5) ]
dtin2[ , var6 := as.integer(var6) ]
lapply( dtin2, class ) # validation
c <- apply(dtin2, 1, mean); c # get 3 row values AS EXPECTED (matching expected columns)
c <- apply(dtin2, 1, function(x) mean(x)); c # CORRECT
c <- apply(dtin2, 1, function(x) sum(x)/length(x)); c # same results as mean(x)
c <- apply(dtin2, 1, sd); c # get 3 row values AS EXPECTED (matching expected columns)
c <- apply(dtin2, 1, function(x) sd(x)); c # CORRECT