1

I'm trying to fit a large discrete proportional-hazards model (~100k rows, ~10k events). To do this I used coxph(..., method = "exact") as recommended by the survival package documentation documentation, which states:

The “exact partial likelihood” is equivalent to a conditional logistic model, and is appropriate when the times are a small set of discrete values. If there are a large number of ties and (start, stop) style survival data the computational time will be excessive.

There were some warnings about computational difficulty with coxph and large numbers of ties, but according to the documentation for clogit in the same package:

The computation of the exact partial likelihood can be very slow, however. If a particular strata had, say 10 events out of 20 subjects we have to add up a denominator that involves all possible ways of choosing 10 out of 20, which is 20!/(10! 10!) = 184756 terms. Gail et al describe a fast recursion method which largely ameleorates this; it was incorporated into version 2.36-11 of the survival package.

So I didn't expect the computational issues to be too bad. Nevertheless, I've run into many segmentation faults when trying to fit variants of a trivial (one-predictor) Cox model on my dataset. One is a "C stack overflow," resulting in the short and sweet (and uninformative) message:

Error: segfault from C stack overflow
Execution halted

The other is a "memory not mapped" error, which occurred when I accidentally flipped the "event" boolean so that I had ~90k events instead of ~10k:

 *** caught segfault ***
address 0xffffffffac577830, cause 'memory not mapped'

Traceback:
 1: fitter(X, Y, strats, offset, init, control, weights = weights,     method = method, row.names(mf))
 2: coxph(Surv(time, status == EVENT.STATUS) ~ litter, data = data,     method = "exact")
aborting ...

For reference, the code I'm running is simply coxph(Surv(t, d) ~ x, data = data, method = 'exact'). t is an integer column, d is Boolean and x is a float.

Are these known issues? Are there workarounds?

EDIT: Here's some code reproducing the problem on the rats dataset (replicated 1000 times):

library(survival)
print("constructing data")
data <- rats
SIZE <- nrow(rats)
# passes with 100 reps, but fails with 100 on my machine (MacBook Pro, 16g RAM)
REPS <- 1000
# set to 0 for "memory not mapped", 1 for "C stack overflow"
EVENT.STATUS <- 0
data <- data[rep(seq_len(SIZE), REPS), ]
print(summary(data$status == EVENT.STATUS))
print("fitting model")
fit <- coxph(Surv(time, status == EVENT.STATUS) ~ litter,
             data = data, method = "exact")

And here's version:

platform       x86_64-apple-darwin14.0.0
arch           x86_64
os             darwin14.0.0
system         x86_64, darwin14.0.0
status
major          3
minor          1.2
year           2014
month          10
day            31
svn rev        66913
language       R
version.string R version 3.1.2 (2014-10-31)
nickname       Pumpkin Helmet
4

1 回答 1

1

我能够使用该数据集制作泊松模型。(我有一个大型数据集,我不愿意冒可能出现段错误的风险。)

fit <- glm(  I(status == 0) ~ litter +offset(log(time)), 
               data = data, family=poisson)

> fit

Call:  glm(formula = I(status == 0) ~ litter + offset(log(time)), family = poisson, 
    data = data)

Coefficients:
(Intercept)       litter  
  -4.706485    -0.003883  

Degrees of Freedom: 149999 Total (i.e. Null);  149998 Residual
Null Deviance:      60500 
Residual Deviance: 60150    AIC: 280200

这种对 的影响的估计litter应该类似于您从 Cox PH 模型中得到的估计。

如果您想查看记录的“抵消技巧”,请参阅 Breslow 和 Day 的经典专着:“癌症研究中的统计方法;第二卷 - 队列研究的设计和分析”。他们使用了 GLIM 软件包,但代码与 R 的glm实现非常相似,因此概念的传输应该是直截了当的。(我有机会在我的硕士论文中使用 GLIM 与 Norm Breslow 进行了短暂的合作。他非常出色。我认为正是我之前接受 GLIM 的培训让学习 R 变得如此容易。)

于 2015-03-13T21:02:03.133 回答