python - 如何从 python 中拟合的 scikit-survival 模型解释 .predict() 的输出？

Question

我很困惑如何解释scikit-survival 中.predict拟合模型的输出。CoxnetSurvivalAnalysis我已经阅读了scikit-survival 中的笔记本 Intro to Survival Analysis和 API 参考，但找不到解释。以下是导致我困惑的一个最小示例：

import pandas as pd
from sksurv.datasets import load_veterans_lung_cancer
from sksurv.linear_model import CoxnetSurvivalAnalysis

# load data
data_X, data_y = load_veterans_lung_cancer()

# one-hot-encode categorical columns in X
categorical_cols = ['Celltype', 'Prior_therapy', 'Treatment']

X = data_X.copy()
for c in categorical_cols:
    dummy_matrix = pd.get_dummies(X[c], prefix=c, drop_first=False)
    X = pd.concat([X, dummy_matrix], axis=1).drop(c, axis=1)

# display final X to fit Cox Elastic Net model on
del data_X
print(X.head(3))

所以这是进入模型的X：

   Age_in_years  Celltype  Karnofsky_score  Months_from_Diagnosis  \
0          69.0  squamous             60.0                    7.0   
1          64.0  squamous             70.0                    5.0   
2          38.0  squamous             60.0                    3.0   

  Prior_therapy Treatment  
0            no  standard  
1           yes  standard  
2            no  standard

...继续拟合模型并生成预测：

# Fit Model
coxnet = CoxnetSurvivalAnalysis()
coxnet.fit(X, data_y)    

# What are these predictions?    
preds = coxnet.predict(X)

preds具有与相同数量的记录X，但它们的值与中的值完全不同data_y，即使在它们拟合的相同数据上进行预测时也是如此。

print(preds.mean()) 
print(data_y['Survival_in_days'].mean())

输出：

-0.044114643249153422
121.62773722627738

那么究竟是preds什么？显然.predict，这里的含义与 scikit-learn 中的完全不同，但我不知道是什么。API 参考说它返回“预测的决策函数”，但这是什么意思？以及如何在yhat给定的几个月内达到预测的估计值X？我是生存分析的新手，所以我显然遗漏了一些东西。

score 3 · Accepted Answer

我在 github 上发布了这个问题，尽管作者重命名了问题问题。

我对predict输出是什么有了一些有用的解释，但仍然不确定如何获得一组预测的生存时间，这正是我真正想要的。以下是来自该 github 线程的一些有用的解释：

predictions are risk scores on an arbitrary scale, which means you can 
usually only determine the sequence of events, but not their exact time.

-sebp（图书馆作者）

It [predict] returns a type of risk score. Higher value means higher
risk of your event (class value = True)...You were probably looking
for a predicted time. You can get the predicted survival function with
estimator.predict_survival_function as in the example 00
notebook...EDIT: Actually, I’m trying to extract this but it’s been a
bit of a pain to munge

-pavopax。

github线程上有更多解释，尽管我并不能真正理解所有这些。我需要尝试一下predict_survival_function，predict_cumulative_hazard_function看看我是否可以逐行对最有可能的生存时间进行一组预测X，这正是我真正想要的。

我不会在这里接受这个答案，以防其他人有更好的答案。

score 0 · Accepted Answer

使用 X 输入，您可以对输入数组进行评估：

def predict(self, X, alpha=None):
    """The linear predictor of the model.
    Parameters
    ----------
    X : array-like, shape = (n_samples, n_features)
        Test data of which to calculate log-likelihood from
    alpha : float, optional
        Constant that multiplies the penalty terms. If the same alpha was used during training, exact
        coefficients are used, otherwise coefficients are interpolated from the closest alpha values that
        were used during training. If set to ``None``, the last alpha in the solution path is used.
    Returns
    -------
    T : array, shape = (n_samples,)
        The predicted decision function
    """
    X = check_array(X)
    coef = self._get_coef(alpha)
    return numpy.dot(X, coef)

定义 check_array 来自另一个库。您可以查看coxnet的代码。

python - 如何从 python 中拟合的 scikit-survival 模型解释 .predict() 的输出？

2 回答 2

Related

Reference