1

这是我在 R 中用来提取数据集数值列的标准差的代码。但是,for 循环在没有显示任何输出的情况下结束。我的代码有什么问题?我确定我的数据集中有数字列。

for(col in colnames(stats)){
  if(is.numeric(stats[, col])){
    cat(paste(col, "sd is ", as.character(round(sd(stats[, col]), 2)), '\n'))
  }
}

我的数据框的结构(统计)

> str(stats)
tibble [3,145 x 11] (S3: tbl_df/tbl/data.frame)
 $ Name                    : chr [1:3145] "A. Urzi" "V. Castellanos" "E. Palacios" "L. Martínez" ...
 $ Age                     : num [1:3145] 19 20 20 21 21 21 21 21 21 21 ...
 $ Nationality             : chr [1:3145] "Argentina" "Argentina" "Argentina" "Argentina" ...
 $ Club                    : chr [1:3145] "Club Athletico Banfield" "New York City FC" "River Plate" "Ajax" ...
 $ Overall                 : num [1:3145] 69 63 77 77 68 73 81 66 66 78 ...
 $ Potential               : num [1:3145] 87 80 87 85 81 87 89 76 79 87 ...
 $ International Reputation: num [1:3145] 1 1 1 1 1 1 1 1 1 1 ...
 $ Skill Moves             : num [1:3145] 3 3 4 3 3 2 4 2 2 4 ...
 $ Team Position           : chr [1:3145] "Attacker" "Attacker" "Midfielder" "Defender" ...
 $ Contract Valid Until    : num [1:3145] 2021 2022 2021 2023 2019 ...
 $ Value in Euros          : num [1:3145] 2.3e+06 8.0e+05 1.4e+07 1.2e+07 1.7e+06 8.0e+06 2.7e+07 9.5e+05 1.2e+06 1.6e+07 ...

> dput(head(stats))
structure(list(Name = c("A. Urzi", "V. Castellanos", "E. Palacios", 
"L. Martínez", "F. Moyano", "C. Romero"), Age = c(19, 20, 20, 
21, 21, 21), Nationality = c("Argentina", "Argentina", "Argentina", 
"Argentina", "Argentina", "Argentina"), Club = c("Club Athletico Banfield", 
"New York City FC", "River Plate", "Ajax", "Argentinos Juniors", 
"Genoa"), Overall = c(69, 63, 77, 77, 68, 73), Potential = c(87, 
80, 87, 85, 81, 87), `International Reputation` = c(1, 1, 1, 
1, 1, 1), `Skill Moves` = c(3, 3, 4, 3, 3, 2), `Team Position` = c("Attacker", 
"Attacker", "Midfielder", "Defender", "Midfielder", "Defender"
), `Contract Valid Until` = c(2021, 2022, 2021, 2023, 2019, 2024
), `Value in Euros` = c(2300000, 8e+05, 1.4e+07, 1.2e+07, 1700000, 
8e+06)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
4

1 回答 1

2

这里混乱的驱动因素statstibble

class(stats)
[1] "tbl_df"     "tbl"        "data.frame"

当您子集一个 tibble(即 tbl_df 对象)时[,结果是另一个 tibble 对象。考虑以下与第一个“数字”列的差异Overall

class(stats[, "Overall"])
[1] "tbl_df"     "tbl"        "data.frame"

这与 a 不同data.frame

class(as.data.frame(stats)[, "Overall"])
[1] "numeric"

这是因为data.frame在基数 R 中对 a 进行子集化的默认行为是简化将单列返回到向量的任何结果。我们可以通过以下方式避免这种行为drop = FALSE

class(as.data.frame(stats)[, "Overall", drop = FALSE])
[1] "data.frame"

同样,也许出乎意料:

is.numeric(stats[, "Overall"])
[1] FALSE
is.numeric(as.data.frame(stats)[, "Overall"])
[1] TRUE
is.numeric(as.data.frame(stats)[, "Overall", drop = FALSE])
[1] FALSE

并且为了很好的衡量,但可能会增加混乱,当您使用双括号来检查子集时[[

class(stats[["Overall"]])
[1] "numeric"
is.numeric(stats[["Overall"]])
[1] TRUE

因此,如果您想“按原样”使用您的代码,您可以tbl_df在适当的位置将其转换为普通的 data.frame:

for(col in colnames(stats)) {
  if(is.numeric(as.data.frame(stats)[, col])) {
    cat(paste(col, "sd is", round(sd(as.data.frame(stats)[, col]), 2), '\n'))
  }
}

或者,您可以使用[[

for(col in colnames(stats)) {
  if(is.numeric(stats[[col]])) {
    cat(paste(col, "sd is", round(sd(stats[[col]]), 2), '\n'))
  }
}

最后,由于我假设您使用的是tidyverse因为此数据被格式化为 tibble,所以更 tidyverse-flavored 的方法可能是:

library(dplyr)
library(glue)

stats %>%
  summarise_if(is.numeric, sd) %>% 
  glue_data("{colnames(.)} sd is {round(., 2)}")

Age sd is 0.82
Overall sd is 5.53
Potential sd is 3.21
International Reputation sd is 0
Skill Moves sd is 0.63
Contract Valid Until sd is 1.75
Value in Euros sd is 5690577.01

drop = FALSE这里要吸取的教训是,如果你想用 [ , ] 对 data.frame 进行子集化,你应该养成使用的习惯。这是一篇不错的博客文章,其中包含更多详细信息和解释原因。

于 2020-05-19T17:19:59.190 回答