r - 如何计算数据集中每个主题的变量

Question

我有需要为每个主题评分的反应时间和准确性数据，我想知道哪些 R 包或功能最能满足我的需求。以下是 2 个受试者的数据片段。每行代表受试者对刺激做出反应的单个试验。

 date subject trialn blockcode     trialtype latency response correct
32913      15      1  practice    taskswitch    1765      205       1
32913      15      2  practice     cueswitch    4372      203       1
32913      15      3  practice cuerepetition    2523      203       0
32913      15      1      test     cueswitch    2239      205       1
32913      15      2      test cuerepetition    1244      203       1
32913      15      3      test    taskswitch    1472      203       0
32913      15      4      test     cueswitch    1877      205       1
32913      15      5      test    taskswitch    2271      203       1
30413      16      1  practice    taskswitch    1377      203       1
30413      16      2  practice    taskswitch    1648      203       1
30413      16      3  practice     cueswitch    1181      205       1
30413      16      1      test     cueswitch    1045      205       1
30413      16      2      test cuerepetition     969      203       0
30413      16      3      test     cueswitch     857      203       1
30413      16      4      test    taskswitch    1038      205       1
30413      16      5      test cuerepetition     836      203       0

这是我想做的事情的描述：

只看“测试”试验，对于每个主题，计算
- 试验总数
- 延迟（即反应时间）低于 300 毫秒的试验次数
- 平均潜伏期
- 意思是正确的
然后，仅查看延迟在受试者平均延迟 3 个标准差以内的试验，计算每种试验类型的平均延迟
最后，创建一个包含所有这些变量以及主题 ID 和日期的新数据框

score 3 · Accepted Answer

plyr 包对这类事情很方便（还有 data.table，但我不知道它的语法）。这是一个开始的例子：

my_function <- function(tmp){
  data.frame(n_trials = sum(tmp[ ,'trialn']),
             n_trialslat  = sum(tmp[tmp[,'latency'] <= 300 ,'trialn']),
             mean_latency = mean(tmp[,'latency']))
}
library(plyr)
ddply(subset(d, blockcode == "test"), 'subject', my_function)

score 3 · Accepted Answer

Stackoverflow 并不是真的用于教程，因此请务必查看有关data.table. 该网站是一个好的开始，关于 SO 上的软件包有很多问题，几乎涵盖了所有内容。

在这里，我只想向您展示如果您习惯了包的语法，它是多么容易。

首先，让我们加载包并读入您的数据：

library(data.table)
str <- "date subject trialn blockcode     trialtype latency response correct
        32913      15      1  practice    taskswitch    1765      205       1
        32913      15      2  practice     cueswitch    4372      203       1
        32913      15      3  practice cuerepetition    2523      203       0
        32913      15      1      test     cueswitch    2239      205       1
        32913      15      2      test cuerepetition    1244      203       1
        32913      15      3      test    taskswitch    1472      203       0
        32913      15      4      test     cueswitch    1877      205       1
        32913      15      5      test    taskswitch    2271      203       1
        30413      16      1  practice    taskswitch    1377      203       1
        30413      16      2  practice    taskswitch    1648      203       1
        30413      16      3  practice     cueswitch    1181      205       1
        30413      16      1      test     cueswitch    1045      205       1
        30413      16      2      test cuerepetition     969      203       0
        30413      16      3      test     cueswitch     857      203       1
        30413      16      4      test    taskswitch    1038      205       1
        30413      16      5      test cuerepetition     836      203       0"
DT <- as.data.table(read.table(text=str, header=TRUE))

现在，这是您要求的一件事：

仅查看“测试”试验，对于每个独特的受试者，计算试验总数、延迟（即反应时间）低于 300 毫秒的试验数、平均延迟平均正确（即准确度）。

DT[blockcode=="test", 
   list(TotalNr = .N,
        NrTrailLat = sum(latency < 300),
        MeanLat = mean(latency),
        MeanCor = mean(correct)), 
   by="subject"]
subject TotalNr NrTrailLat MeanLat MeanCor
1:      15       5          0  1820.6     0.8
2:      16       5          0   949.0     0.6

基本上，通过这几行代码，我可以回答所有这些问题。在我看来，语法也很简单。对于我们来说DT，我们只想看看观察在哪里blockcode=="test"。接下来，我们要分别为每个主题运行所有分析。这很容易通过by="subject"声明完成。很酷的事情：如果要拆分几个维度，只需添加它们...与其忽略实践，让我们分别看一下：

DT[, 
   list(TotalNr = .N,
        NrTrailLat = sum(latency < 300),
        MeanLat = mean(latency),
        MeanCor = mean(correct)), 
   by="subject,blockcode"]
   subject blockcode TotalNr NrTrailLat  MeanLat   MeanCor
1:      15  practice       3          0 2886.667 0.6666667
2:      15      test       5          0 1820.600 0.8000000
3:      16  practice       3          0 1402.000 1.0000000
4:      16      test       5          0  949.000 0.6000000

现在不要告诉我这不可怕！

让我们尝试另一个：

此外，创建包含日期和 subjectID 的最后（或第一个）值的变量（这是为了将数据和 subjectID 放在新的数据框中）。

我不确定你在这里的意思是什么，因为date你的每个主题的例子都没有改变。所以让我们让它更难一点。假设我们想知道subject,blockcode第一次试验的每个组合的延迟。为此，我们应该首先进行排序DT，以便我们知道第一个trialn始终为 1。（对于此示例数据，这并不是真正必要的，因为它似乎已经排序）。

setkey(DT, subject, blockcode, trialn)
DT[, list(FirstLat = latency[1]) , by="subject,blockcode"]
subject blockcode FirstLat
1:      15  practice     1765
2:      15      test     2239
3:      16  practice     1377
4:      16      test     1045

但是，您想将此作为新列添加到DT. 为此，您可以使用:=运算符：

DT[, FirstLat := latency[1] , by="subject,blockcode"]  
DT
date subject trialn blockcode     trialtype latency response correct FirstLat
1: 32913      15      1  practice    taskswitch    1765      205       1     1765
2: 32913      15      2  practice     cueswitch    4372      203       1     1765
3: 32913      15      3  practice cuerepetition    2523      203       0     1765
4: 32913      15      1      test     cueswitch    2239      205       1     2239
5: 32913      15      2      test cuerepetition    1244      203       1     2239
6: 32913      15      3      test    taskswitch    1472      203       0     2239
7: 32913      15      4      test     cueswitch    1877      205       1     2239
8: 32913      15      5      test    taskswitch    2271      203       1     2239
9: 30413      16      1  practice    taskswitch    1377      203       1     1377
10: 30413      16      2  practice    taskswitch    1648      203       1     1377
11: 30413      16      3  practice     cueswitch    1181      205       1     1377
12: 30413      16      1      test     cueswitch    1045      205       1     1045
13: 30413      16      2      test cuerepetition     969      203       0     1045
14: 30413      16      3      test     cueswitch     857      203       1     1045
15: 30413      16      4      test    taskswitch    1038      205       1     1045
16: 30413      16      5      test cuerepetition     836      203       0     1045

所以这些只是让你开始的一些想法。我之所以这样做，是因为我想向您展示，当您了解基础知识后，大多数事情都会变得非常容易。这应该是通过手册完成它的动力，这在开始时可能有点矫枉过正。但这是值得的，相信我！因为我什至没有提到最好的部分：data.table也非常快。祝你分析顺利。

r - 如何计算数据集中每个主题的变量

2 回答 2

Related

Reference