r - R中coxph函数中使用的公式对象的解释

Question

在生存分析方面，我完全是新手。我正在做一个需要在“survival”包中使用 coxph 函数的项目，但我遇到了麻烦，因为我不明白公式对象需要什么。

我能找到的关于该功能的大多数描述如下：

“一个公式对象，响应在 ~ 运算符的左侧，项在右侧。响应必须是 Surv 函数返回的生存对象。”

我知道运算符左侧需要什么，问题是该函数对右侧的期望是什么。

这是我的数据的链接（实际数据集要大得多，为简洁起见，我只显示前 20 个数据点）：

数据简要说明：

-Row 1 is the header

-Each row after that is a separate patient

-The first column is the age of the patient at the time of the study

-columns 2 through 14 (headed by x2-x13), and 19 (x18) and 20 (x19) are covariates such as race, relationship status, medical conditions that take on either true (1) or false (0) values. 

-columns 15 (x14) through 18 (x17) are covariates such as tumor size, which take on whole number values greater than 0.

-The second to last column "sur" is the number of months survived, and "index" is whether or not that is a right-censored time (1 for true, 0 for false).

鉴于此数据，我需要绘制 Cox 比例风险曲线，但由于公式对象的右侧错误，我最终得到了不正确的图。

这是我的代码，“temp4”是我给数据表起的名字：

library("survival")
temp4 <- read.table("~/data.txt", header=TRUE)
seerCox <- coxph(Surv(sur, index)~ temp4$x1 + temp4$x2 + temp4$x3 + temp4$x4 + temp4$x5 + temp4$x6 + temp4$x7 + temp4$x8 + temp4$x9 + temp4$x10 + temp4$x11 + temp4$x12 + temp4$x13 + temp4$x14 + temp4$x15 + temp4$x16 + temp4$x17 + temp4$x18 + temp4$x19, data=temp4, singular.ok=TRUE)
plot(survfit(seerCox), main= "Cox Estimate", mark.time=FALSE, ylab="Probability", xlab="Survival Time in Months", col=c("blue", "red", "green"))

我还应该注意，我已尝试将您看到的右侧替换为数字 1，一个句点，将其留空。这些方法产生卡普兰-迈尔曲线。

以下是控制台输出：

每个新行都是根据我过滤数据的方式产生的错误示例。（即如果我只包括年龄大于 85 岁的患者等）

如果有人能解释它是如何工作的，将不胜感激。

PS-我已经搜索了一个多星期的解决方案，作为最后的手段，我在这里寻求帮助。

score 1 · Accepted Answer

You should not be using the prefix temp$ if you are also using a data argument. The whole purpose of supplying a data argument is to allow dropping those in the formula.

seerCox <- coxph( Surv(sur, index) ~ . , data=temp4, singular.ok=TRUE)

The above would use all of the x-variables in your temp data.frame. This will use just the first 3:

seerCox <- coxph( Surv(sur, index) ~ x1+x2+x3 , data=temp4)

Exactly what the warnings signify depends on the data (as you have in one sense already exemplified by producing different sorts of collinearity with different subsets.) If you have collinear columns, then you get singularities in the inversion of the model matrix and the software will attempt to drop aliased columns with a warning. This is really telling you that you do not have enough data to build the large models you are attempting. Exploring that possibility with table calls is often informative.

底线：这不是您的公式构建的问题，而是不了解所选方法对您组装的数据集的限制的问题。你需要更加小心地定义你的目标。这项研究的最高优先级是什么？你真的需要每个变量吗？是否有可能将这些匿名变量中的一些聚合成具有临床意义的类别，例如诊断类别或合并症？

r - R中coxph函数中使用的公式对象的解释

1 回答 1

Related

Reference