r - 如何排除 NA？(fitdist 函数)

Question

我有 100x2 数据框 DFN。在列 DFN$Lret 上运行fitdist会给出错误消息“函数 mle 无法估计参数，错误代码为 100”。我认为原因是最后一行包含一个 NA。因此我运行fitdist排除 NA，现在我得到错误“数据必须是长度大于 1 的数字向量”。关于如何解决这个问题的任何想法？非常感谢。

DFN <- structure(list(LRet = c(0.0011, 0, -0.0026, 0, -0.0015, 0.0038, 3e-04, -0.0021, 4e-04, -0.001, 0, 0.0019, -6e-04, -8e-04, -5e-04, -8e-04, 3e-04, -5e-04, -0.0026, 0.0014, 7e-04, 0, -2e-04, 0.0011, -0.0025, 0.0042, 0.0022, -0.0017, -0.0058, 1e-04, 2e-04, 8e-04, -9e-04, -0.0014, -0.0014, -0.001, -0.0032, -0.0015, 6e-04, -8e-04, 0.001, -0.0014, -0.0017, -8e-04, -0.001, 0.0011, 0.0013, -0.001, 5e-04, 9e-04, -8e-04, -0.0025, 0.0027, 6e-04, 2e-04, -6e-04, 9e-04, -3e-04, -7e-04, 3e-04, 0, 2e-04, -6e-04, 1e-04, -1e-04, -7e-04, -8e-04, 7e-04, -1e-04, -7e-04, 7e-04, 8e-04, -8e-04, 8e-04, 0.0058, -1e-04, -5e-04, 0.0027, -0.0012, 7e-04, 7e-04, 0, 3e-04, -1e-04, 2e-04, -2e-04, -0.0013, -1e-04, 1e-04, -0.0011, 0.0013, 2e-04, -3e-04, -7e-04, 0, 0.0015, 1e-04, 3e-04, -0.0012, NA), LRetPct = c("0.11%", "0.00%", "-0.26%", "0.00%", "-0.15%", "0.38%", "0.03%", "-0.21%", "0.04%", "-0.10%", "0.00%", "0.19%", "-0.06%", "-0.08%", "-0.05%", "-0.08%", "0.03%", "-0.05%", "-0.26%", "0.14%", "0.07%", "0.00%", "-0.02%", "0.11%", "-0.25%", "0.42%", "0.22%", "-0.17%", "-0.58%", "0.01%", "0.02%", "0.08%", "-0.09%", "-0.14%", "-0.14%", "-0.10%", "-0.32%", "-0.15%", "0.06%", "-0.08%", "0.10%", "-0.14%", "-0.17%", "-0.08%", "-0.10%", "0.11%", "0.13%", "-0.10%", "0.05%", "0.09%", "-0.08%", "-0.25%", "0.27%", "0.06%", "0.02%", "-0.06%", "0.09%", "-0.03%", "-0.07%", "0.03%", "0.00%", "0.02%", "-0.06%", "0.01%", "-0.01%", "-0.07%", "-0.08%", "0.07%", "-0.01%", "-0.07%", "0.07%", "0.08%", "-0.08%", "0.08%", "0.58%", "-0.01%", "-0.05%", "0.27%", "-0.12%", "0.07%", "0.07%", "0.00%", "0.03%", "-0.01%", "0.02%", "-0.02%", "-0.13%", "-0.01%", "0.01%", "-0.11%", "0.13%", "0.02%", "-0.03%", "-0.07%", "0.00%", "0.15%", "0.01%", "0.03%", "-0.12%", " NA%")), .Names = c("LRet", "LRetPct"), class = "data.frame", row.names = 901:1000)

library(fitdistrplus)

#Following gives error code 100
f1 <- fitdist(DFN$LRet,"norm") 

#Following gives error code 100
f1 <- fitdist(DFN$LRet,"norm", na.rm=T)

#Following gives error data must be a numeric vector of length greater than 1"
f1 <- fitdist(na.exclude(DFN$LRet),"norm")
#Same result using na.omit

请注意，如果消除包含 NA 的最后一行，则上述代码可以正常工作。fitdist如果可以避免，我宁愿不必在运行前消除最后一行。

编辑/更新：用 NA 消除最后一行确实解决了问题，但我现在无法始终如一地重现该问题（即在消除最后一行后成功运行代码几次，但并非总是如此）。我试图理解为什么。我尝试使用 25x2 数据框、100x2 和 300x2 以及矢量，结果相似。曾认为数据框或向量的大小可能是问题的一部分，因此进行了不同大小的试验。

score 6 · Accepted Answer

fitdist通过节目调试

 if (!(is.vector(data) & is.numeric(data) & length(data) > 1)) 
    stop("data must be a numeric vector of length greater than 1")

看着?is.vector：

如果 'x' 是指定模式的向量，除了 names 之外没有其他属性，则 'is.vector' 返回 'TRUE' 。

na.exclude及其亲属（na.omit等）将有关排除值的信息保存为属性，因此is.vector()变为FALSE...

One of the side effects of c() is to drop attributes other than names, so is.vector(c(na.exclude(DFN$LRet))) is TRUE, so

fitdist(c(na.exclude(DFN$LRet)), "norm")

at least doesn't get the "must be a numeric vector" error -- but I still get the "error 100". Investigating further ...

Digging into the guts of fitdist some more, it appears that (as suggested by @42-) optim() is having trouble. Specifically, it actually gets to an answer, but when it tries to calculate the Hessian of the solution it tries a negative value for the standard deviation parameter and barfs.

As an illustration, this works:

nn <- c(na.exclude(DFN$LRet))
fn <- function(x) -sum(dnorm(nn,mean=x[1],sd=x[2],log=TRUE))
optim(fn,par=c(mean(nn),sd(nn)),method="Nelder-Mead")

but this fails:

optim(fn,par=c(mean(nn),sd(nn)),method="Nelder-Mead",hessian=TRUE)

score 4 · Accepted Answer

(Also found the poorly written is.vector section of the code, but it didn't solve the errors.) The fitdist function seems to have difficulty with vectors of small variance:

var( na.exclude(DFN$LRet))
[1] 2.220427e-06

You can get around that by multiplying by 10:

> f1 <- fitdist(10*c(na.exclude(DFN$LRet)),"norm")
> f1
Fitting of the distribution ' norm ' by maximum likelihood 
Parameters:
          estimate  Std. Error
mean -0.0009090909 0.001490034
sd    0.0148256472 0.001032122

Standard probability theory lets you then correct those estimates: divide by 10 for the mean and by 100 for the variance (or 10 for the sd). The estimates from corrected fitdist-results are reasonably close to the sample values:

> all.equal( 0.0148256472/10 , sd(na.exclude(DFN$LRet) ) )
[1] "Mean relative difference: 0.005089095"

r - 如何排除 NA？(fitdist 函数)

2 回答 2

Related

Reference