I'm trying to fit a large discrete proportional-hazards model (~100k rows, ~10k events). To do this I used coxph(..., method = "exact")
as recommended by the survival package documentation documentation, which states:
The “exact partial likelihood” is equivalent to a conditional logistic model, and is appropriate when the times are a small set of discrete values. If there are a large number of ties and (start, stop) style survival data the computational time will be excessive.
There were some warnings about computational difficulty with coxph
and large numbers of ties, but according to the documentation for clogit
in the same package:
The computation of the exact partial likelihood can be very slow, however. If a particular strata had, say 10 events out of 20 subjects we have to add up a denominator that involves all possible ways of choosing 10 out of 20, which is 20!/(10! 10!) = 184756 terms. Gail et al describe a fast recursion method which largely ameleorates this; it was incorporated into version 2.36-11 of the
survival
package.
So I didn't expect the computational issues to be too bad. Nevertheless, I've run into many segmentation faults when trying to fit variants of a trivial (one-predictor) Cox model on my dataset. One is a "C stack overflow," resulting in the short and sweet (and uninformative) message:
Error: segfault from C stack overflow
Execution halted
The other is a "memory not mapped" error, which occurred when I accidentally flipped the "event" boolean so that I had ~90k events instead of ~10k:
*** caught segfault ***
address 0xffffffffac577830, cause 'memory not mapped'
Traceback:
1: fitter(X, Y, strats, offset, init, control, weights = weights, method = method, row.names(mf))
2: coxph(Surv(time, status == EVENT.STATUS) ~ litter, data = data, method = "exact")
aborting ...
For reference, the code I'm running is simply coxph(Surv(t, d) ~ x, data = data, method = 'exact')
. t
is an integer column, d
is Boolean and x
is a float.
Are these known issues? Are there workarounds?
EDIT: Here's some code reproducing the problem on the rats
dataset (replicated 1000 times):
library(survival)
print("constructing data")
data <- rats
SIZE <- nrow(rats)
# passes with 100 reps, but fails with 100 on my machine (MacBook Pro, 16g RAM)
REPS <- 1000
# set to 0 for "memory not mapped", 1 for "C stack overflow"
EVENT.STATUS <- 0
data <- data[rep(seq_len(SIZE), REPS), ]
print(summary(data$status == EVENT.STATUS))
print("fitting model")
fit <- coxph(Surv(time, status == EVENT.STATUS) ~ litter,
data = data, method = "exact")
And here's version
:
platform x86_64-apple-darwin14.0.0
arch x86_64
os darwin14.0.0
system x86_64, darwin14.0.0
status
major 3
minor 1.2
year 2014
month 10
day 31
svn rev 66913
language R
version.string R version 3.1.2 (2014-10-31)
nickname Pumpkin Helmet