r - 在循环遍历 R 中的因子级别时应用回归

Question

我正在尝试将回归函数应用于因子（主题）的每个单独级别。这个想法是，对于每个主题，我可以根据他们的实际阅读时间（RT）和相应的打印字符串（WordLen）的长度来获得预测的阅读时间。一位同事帮助我编写了一些代码，用于根据（主题）内另一个函数（区域）的每个级别应用函数。但是，原始代码和我尝试的修改（将函数应用于单个因素的中断）都不起作用。

以下是一些示例数据的尝试：

 test0<-structure(list(Subject = c(101L, 101L, 101L, 101L, 101L, 101L, 
101L, 101L, 101L, 101L, 102L, 102L, 102L, 102L, 102L, 102L, 102L, 
102L, 102L, 102L, 103L, 103L, 103L, 103L, 103L, 103L, 103L, 103L, 
103L, 103L), Region = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L), RT = c(294L, 241L, 346L, 339L, 332L, NA, 399L, 
377L, 400L, 439L, 905L, 819L, 600L, 520L, 811L, 1021L, 508L, 
550L, 1048L, 1246L, 470L, NA, 385L, 347L, 592L, 507L, 472L, 396L, 
761L, 430L), WordLen = c(3L, 3L, 3L, 3L, 3L, 3L, 5L, 7L, 3L, 
9L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 7L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 5L, 7L, 3L)), .Names = c("Subject", "Region", "RT", "WordLen"
), class = "data.frame", row.names = c(NA, -30L))

不幸的是，这些数据返回了一个我的完整数据集没有得到的问题：

"Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
  0 (non-NA) cases"

也许这是因为样本数据太小了？

无论如何，我希望有人会看到代码的问题，尽管我有能力提供工作数据......

这是原始代码（不起作用）：

for(i in 1:length(levels(test0$Subject)))
  for(j in 1:length(levels(test0$Region)))
    {tmp=predict(lm(RT~WordLen,test0[test0$Subject==levels(test0$Subject)[i] & test0$Region==levels(test0$Region)[j],],na.action="na.exclude"))
    test0[names(tmp),"rt.predicted"]=tmp
    }

这是修改后的代码（不足为奇，也不起作用）：

for(i in 1:length(levels(test0$Subject)))
    {tmp=predict(lm(RT~WordLen,test0[test0$Subject==levels(test0$Subject)[i],],na.action="na.exclude"))
    test0[names(tmp),"rt.predicted"]=tmp
    }

我将非常感谢任何建议。

score 3 · Accepted Answer

ddply()您可以使用library中的函数来实现结果plyr。这将根据拆分数据框Subject，计算回归模型的预测，然后将其作为新列添加到数据框中。

ddply(test0,.(Subject),transform, 
   pred=predict(lm(RT~WordLen,na.action="na.exclude")))

   Subject Region   RT WordLen     pred
1      101      1  294       3 327.9778
......
4      101      1  339       3 327.9778
5      101      1  332       3 327.9778
6      101      2   NA       3       NA
7      101      2  399       5 363.8444
.......
13     102      1  600       3 785.4146

要拆分数据Subject，Region您应该将两个变量都放在.().

ddply(test0,.(Subject,Region),transform,
    pred=predict(lm(RT~WordLen,na.action="na.exclude")))

score 2 · Accepted Answer

您的测试数据中唯一的问题是，Subject并且Region不是因素。

test0$Subject <- factor(test0$Subject)
test0$Region <- factor(test0$Region)

for(i in 1:length(levels(test0$Subject)))
  for(j in 1:length(levels(test0$Region)))
  {tmp=predict(lm(RT~WordLen,test0[test0$Subject==levels(test0$Subject)[i] & test0$Region==levels(test0$Region)[j],],na.action="na.exclude"))
   test0[names(tmp),"rt.predicted"]=tmp
  }
#   26     27     28     29     30 
# 442.25 442.25 560.50 678.75 442.25

您收到错误（0 non-NA cases）的原因是，当您进行子集设置时，您是在不是因素的变量级别上进行的。在您的原始数据集中，尝试：

test0[test0$Subject==levels(test0$Subject)[1],]

你得到：

# [1] Subject Region  RT      WordLen
# <0 rows> (or 0-length row.names)

这就是lm()试图与之合作的东西

score 2 · Accepted Answer

虽然您的问题似乎是在询问其他人已经回答的错误解释（数据根本不是因素），但这是一种仅使用base包的方法

test0$rt.predicted <- unlist(by(test0[, c("RT", "WordLen")], list(test0$Subject, test0$Region), FUN = function(x) predict(lm(RT ~ 
    WordLen, x, na.action = "na.exclude"))))

test0
##    Subject Region   RT WordLen rt.predicted
## 1      101      1  294       3     310.4000
## 2      101      1  241       3     310.4000
## 3      101      1  346       3     310.4000
## 4      101      1  339       3     310.4000
## 5      101      1  332       3     310.4000
## 6      101      2   NA       3     731.0000
## 7      101      2  399       5     731.0000
## 8      101      2  377       7     731.0000
## 9      101      2  400       3     731.0000
## 10     101      2  439       9     731.0000
## 11     102      1  905       3     448.5000
## 12     102      1  819       3           NA
## 13     102      1  600       3     448.5000
## 14     102      1  520       3     448.5000
## 15     102      1  811       3     448.5000
## 16     102      2 1021       3           NA
## 17     102      2  508       3     399.0000
## 18     102      2  550       5     408.5000
## 19     102      2 1048       7     389.5000
## 20     102      2 1246       3     418.0000
## 21     103      1  470       3     870.4375
## 22     103      1   NA       3     870.4375
## 23     103      1  385       3     877.3750
## 24     103      1  347       3     884.3125
## 25     103      1  592       3     870.4375
## 26     103      2  507       3     442.2500
## 27     103      2  472       3     442.2500
## 28     103      2  396       5     560.5000
## 29     103      2  761       7     678.7500
## 30     103      2  430       3     442.2500

score 0 · Accepted Answer

我希望这是由于您的两个分类变量的组合不存在数据这一事实引起的。您可以做的是首先提取子集，检查它是否不等于NULL，并且只有在有数据时才执行 lm 。

r - 在循环遍历 R 中的因子级别时应用回归

4 回答 4

Related

Reference