2

我正在使用Titanic来自 Kaggle 的数据集,并想学习一个简单的逻辑回归模型。

我在火车和测试数据中读取了, 和train$Survived都是train$Sex因子。test$Survivedtest$Sex

我想执行一个非常简单的逻辑回归,其中 Sex 是唯一的自变量。

fit <- glm(formula = Survived ~ Sex, family = binomial)

对我来说似乎没问题:

> fit

Call:  glm(formula = Survived ~ Sex, family = binomial)

Coefficients:
(Intercept)      Sexmale  
      1.057       -2.514  

Degrees of Freedom: 890 Total (i.e. Null);  889 Residual
Null Deviance:      1187 
Residual Deviance: 917.8    AIC: 921.8

问题是,我无法将此学习模型应用于测试数据。当我执行以下操作时:

predict(fit, train$Sex)

我得到一个包含 891 个值的向量,它是训练集中训练示例的数量。

我似乎找不到任何有关如何正确执行此操作的信息。

任何帮助将不胜感激!

4

1 回答 1

2

I'm posting an answer to correct a couple points that seem to have gotten confused. There really is no predict-function as such. That is what is meant where the help page says "predict" is a "generic function". Sometimes generic functions do have a fun.default method, but in the case of predict.*, there is no default method. So dispatch is on the basis of the class of the first argument. There will be separate help pages for each method and the help page for "predict" lists several. Package authors need to write their own predict methods for new classes.

Logistic regression predates the machine learning paradigm, so expecting it to "predict classes" is somewhat unrealistic. Even the fact that you can get a "response" prediction is a gift over what the software would have provided 30 years ago when some of us were taking our regression classes. One needs to understand that probabilities are generally not 0 or 1 but rather something in between. If the user wants to set a threshold and determine how many cases exceed the threshold then that is an analyst decision and the analysts need to make any transformations to categories they deem worthwhile.

Executing: predict(fit, train$Sex) would be expected to give a result that was as long as there were values from the training set, so I'm guessing that you perhaps meant to try predict(fit, test$Sex) and were disappointed. If that's the case then it should have been: predict(fit, list(Sex=test$Sex) ). R needs the argument to be a value that can be coerced to a dataframe, so a named list of values is a minimum requirement for predict-ors.

If predict.glm gets a malformed argument to the second argument, newdata, it falls back on the original data argument and uses the linear predictors that are retained in the model object.

于 2013-10-07T17:06:44.413 回答