r - 为什么 as.matrix 在将数字转换为字符时会添加额外的空格？

Question

如果您对具有字符和数字列的 data.frame 上的行使用 apply，则 apply 在内部使用 as.matrix 将 data.frame 转换为仅字符。但是如果数字列由不同长度的数字组成， as.matrix 会添加空格以匹配最高/“最长”的数字。

一个例子：

df <- data.frame(id1=c(rep("a",3)),id2=c(100,90,8), stringsAsFactors = FALSE) 
df
##   id1 id2
## 1   a 100
## 2   a  90
## 3   a   8
as.matrix(df)
##      id1 id2  
## [1,] "a" "100"
## [2,] "a" " 90"
## [3,] "a" "  8"

我本来希望结果是：

     id1 id2  
[1,] "a" "100"
[2,] "a" "90"
[3,] "a" "8"

为什么要多出空格？

在 data.frame 上使用 apply 时，它们可能会产生意想不到的结果：

myfunc <- function(row){
  paste(row[1], row[2], sep = "")
}
> apply(df, 1, myfunc)
[1] "a100" "a 90" "a  8"
>

虽然循环给出了预期的结果。

> for (i in 1:nrow(df)){
  print(myfunc(df[i,]))
}
[1] "a100"
[1] "a90"
[1] "a8"

和

> paste(df[,1], df[,2], sep = "")
[1] "a100" "a90"  "a8"

是否存在使用 as.matrix 添加的额外空格有用的情况？

score 23 · Accepted Answer

这是因为在方法中转换非数字数据的as.matrix.data.frame方式。有一个简单的解决方法，如下所示。

细节

?as.matrix注意转换是通过完成的format()，并且在这里添加了额外的空格。具体来说，?as.matrix在详细信息部分中有这个：

 ‘as.matrix’ is a generic function.  The method for data frames
 will return a character matrix if there is only atomic columns and
 any non-(numeric/logical/complex) column, applying ‘as.vector’ to
 factors and ‘format’ to other non-character columns.  Otherwise,
 the usual coercion hierarchy (logical < integer < double <
 complex) will be used, e.g., all-logical data frames will be
 coerced to a logical matrix, mixed logical-integer will give a
 integer matrix, etc.

?format还指出

字符串用空格填充到最宽的显示宽度。

考虑这个说明行为的例子

> format(df[,2])
[1] "100" " 90" "  8"
> nchar(format(df[,2]))
[1] 3 3 3

format不必像以前那样工作trim：

trim: logical; if ‘FALSE’, logical, numeric and complex values are
      right-justified to a common width: if ‘TRUE’ the leading
      blanks for justification are suppressed.

例如

> format(df[,2], trim = TRUE)
[1] "100" "90"  "8"

但是没有办法将此参数传递给as.matrix.data.frame方法。

解决方法

解决此问题的一种方法是format()通过sapply. 在那里你可以通过trim = TRUE

> sapply(df, format, trim = TRUE)
     id1 id2  
[1,] "a" "100"
[2,] "a" "90" 
[3,] "a" "8"

或者，使用vapply我们可以说明我们期望返回的内容（这里是长度为 3 [ nrow(df)] 的字符向量）：

> vapply(df, format, FUN.VALUE = character(nrow(df)), trim = TRUE)
     id1 id2  
[1,] "a" "100"
[2,] "a" "90" 
[3,] "a" "8"

score 9 · Accepted Answer

似乎有点奇怪。在手册 ( ?as.matrix) 中，它解释format了转换为字符矩阵所需的内容：

如果只有原子列和任何非（数字/逻辑/复杂）列，则数据帧的方法将返回字符矩阵，将 as.vector 应用于因子并将格式应用于其他非字符列。

您可以看到，如果您format直接调用，它会as.matrix执行以下操作：

format(df$id2)
[1] "100" " 90" "  8"

您需要做的是传递trim参数：

format(df$id2,trim=TRUE)
[1] "100" "90"  "8"

但是，不幸的是，该as.matrix.data.frame功能不允许您这样做。

else if (non.numeric) {
    for (j in pseq) {
        if (is.character(X[[j]])) 
            next
        xj <- X[[j]]
        miss <- is.na(xj)
        xj <- if (length(levels(xj))) 
            as.vector(xj)
        else format(xj) # This could have ... as an argument
        # else format(xj,...)
        is.na(xj) <- miss
        X[[j]] <- xj
    }
}

所以，你可以修改as.data.frame.matrix. 但是，我认为将其包含在基础中将是一个不错的功能添加。

但是，一个快速的解决方案是简单地：

as.matrix(data.frame(lapply(df,as.character)))
     id1 id2  
[1,] "a" "100"
[2,] "a" "90" 
[3,] "a" "8"  
# As mentioned in the comments, this also works:
sapply(df,as.character)

score 6 · Accepted Answer

as.matrix内部调用format：

 > format(df$id2)
[1] "100" " 90" "  8"

这就是额外空间的来源。format有一个额外的参数trim来删除那些：

> format(df$id2, trim = TRUE)
[1] "100" "90"  "8"

但是，您不能将此参数提供给as.matrix.

score 1 · Accepted Answer

这种行为的原因已经在之前的答案中解释过，但我想提供另一种规避方法：

df <- data.frame(id1=c(rep("a",3)),id2=c(100,90,8), stringsAsFactors = FALSE) 
do.call(cbind,df)
     id1 id2  
[1,] "a" "100"
[2,] "a" "90" 
[3,] "a" "8"

请注意，如果使用stringsAsFactors = TRUE，则这不起作用，因为因子级别将转换为数字。

score 0 · Accepted Answer

另一个解决方案：如果您不介意下载软件包，trimWhiteSpace(x)（来自 limma R pckg）也可以完成这项工作。

source("https://bioconductor.org/biocLite.R")
biocLite("limma")
library(limma)
df <- data.frame(id1=c(rep("a",3)),id2=c(100,90,8), stringsAsFactors = FALSE) 
as.matrix(df)
 id1 id2  
[1,] "a" "100"
[2,] "a" " 90"
[3,] "a" "  8"

trimWhiteSpace(as.matrix(df))
 id1 id2  enter code here
[1,] "a" "100"
[2,] "a" "90" 
[3,] "a" "8"

r - 为什么 as.matrix 在将数字转换为字符时会添加额外的空格？

5 回答 5

细节

解决方法

Related

Reference