r - 数据帧上的“子集”和“[”给出的结果略有不同，为什么？

Question

有人可以解释一下为什么我在下面的最后两行代码（identical()调用）中得到不同的结果吗？这两个对象似乎是相同的对象，但是当我在应用函数中使用它们时，我遇到了一些麻烦：

df <- data.frame(a = 1:5, b = 6:2, c = rep(7,5))
df_ab <- df[,c(1,2)]
df_AB <- subset(df, select = c(1,2))
identical(df_ab,df_AB)
[1] TRUE

apply(df_ab,2,function(x) identical(1:5,x))
    a     b 
TRUE FALSE

apply(df_AB,2,function(x) identical(1:5,x))
    a     b 
FALSE FALSE

score 13 · Accepted Answer

在对每一列调用函数之前，该apply()函数将其第一个参数强制转换为矩阵。因此，您的数据框被强制转换为矩阵对象。该转换的结果是as.matrix(df_AB)具有非空行名，而as.matrix(df_ab)没有：

> str(as.matrix(df_ab))
 int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "a" "b"
> str(as.matrix(df_AB))
 int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:5] "1" "2" "3" "4" ...
  ..$ : chr [1:2] "a" "b"

因此，当您apply()对的一列进行子集化时df_AB，您会得到一个命名向量，它与未命名向量不同。

apply(df_AB, 2, str)
 Named int [1:5] 1 2 3 4 5
 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ...
 Named int [1:5] 6 5 4 3 2
 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ...
NULL

将其与函数进行对比，该subset()函数使用逻辑向量作为的值来选择行i。并且看起来像子集一个具有非缺失值的 data.framei会导致row.names属性中的这种差异：

> str(as.matrix(df[1:5, 1:2]))
 int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:5] "1" "2" "3" "4" ...
  ..$ : chr [1:2] "a" "b"
> str(as.matrix(df[, 1:2]))
 int [1:5, 1:2] 1 2 3 4 5 6 5 4 3 2
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "a" "b"

.Internal(inspect(x))您可以使用该函数查看 data.frames 之间差异的所有血腥细节。有兴趣的可以自己看看。

正如 Roland 在他的评论中指出的那样，您可以使用该.row_names_info()函数仅查看行名称的差异。

请注意，当i缺失时，结果.row_names_info()为负数，但如果您使用非缺失的子集，则结果为正数i。

> .row_names_info(df_ab, type=1)
[1] -5
> .row_names_info(df_AB, type=1)
[1] 5

这些值的含义在以下内容中进行了解释?.row_names_info：

type: integer.  Currently ‘type = 0’ returns the internal
      ‘"row.names"’ attribute (possibly ‘NULL’), ‘type = 2’ the
      number of rows implied by the attribute, and ‘type = 1’ the
      latter with a negative sign for ‘automatic’ row names.

score 8 · Accepted Answer

如果要将值1:5与列中的值进行比较，则不应使用apply因为apply在应用函数之前将数据帧转换为矩阵。由于使用创建的子集中的行名称[（请参阅@Joshua Ulrich 的回答），这些值1:5与包含相同值的命名向量不同。

您应该改为使用sapply将identical函数应用于列。这避免了将数据帧转换为矩阵：

> sapply(df_ab, identical, 1:5)
    a     b 
 TRUE FALSE 
> sapply(df_AB, identical, 1:5)
    a     b 
 TRUE FALSE

如您所见，在两个数据框中，第一列中的值与相同1:5。

score 5 · Accepted Answer

在一个版本（使用[）中，您的列是整数，而在另一个版本（使用subset）中，您的列被命名为整数。

apply(df_ab, 2, str)

 int [1:5] 1 2 3 4 5
 int [1:5] 6 5 4 3 2
NULL


apply(df_AB, 2, str)

 Named int [1:5] 1 2 3 4 5
 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ...
 Named int [1:5] 6 5 4 3 2
 - attr(*, "names")= chr [1:5] "1" "2" "3" "4" ...
NULL

score 3 · Accepted Answer

在提交之前查看这两个 object 的结构apply仅显示一个差异：在行名中，但不是我预期会产生您所看到的差异的差异。我不认为 Joshua 目前提供的“子集”作为解释这一点的逻辑索引。为什么row.names = c(NA, -5L))在使用“[”提取时会产生命名结果尚不清楚。

> dput(df_AB)
structure(list(a = 1:5, b = c(6L, 5L, 4L, 3L, 2L)), .Names = c("a", 
"b"), row.names = c(NA, 5L), class = "data.frame")
> dput(df_ab)
structure(list(a = 1:5, b = c(6L, 5L, 4L, 3L, 2L)), .Names = c("a", 
"b"), class = "data.frame", row.names = c(NA, -5L))

我同意这是 as.matrix 强制需要进一步调查：

> attributes(df_AB[,1])
NULL
> attributes(df_ab[,1])
NULL
> attributes(as.matrix(df_AB)[,1])
$names
[1] "1" "2" "3" "4" "5"

r - 数据帧上的“子集”和“[”给出的结果略有不同，为什么？

4 回答 4

Related

Reference