r - 你遇到的最大的 R-gotcha 是什么？

Question

有没有哪一个 R-gotcha 让你有一天真的感到惊讶？我想我们都会从分享这些中获益。

这是我的：在列表索引中，my.list[[1]]不是my.list[1]. 在 R 的早期就学到了这一点。

score 43 · Accepted Answer

当使用序列作为迭代的索引时，最好使用seq_along()函数而不是类似1:length(x).

在这里，我创建了一个向量，两种方法都返回相同的内容：

> x <- 1:10
> 1:length(x)
 [1]  1  2  3  4  5  6  7  8  9 10
> seq_along(x)
 [1]  1  2  3  4  5  6  7  8  9 10

现在制作向量NULL：

> x <- NULL
> seq_along(x) # returns an empty integer; good behavior
integer(0)
> 1:length(x) # wraps around and returns a sequence; this is bad
[1] 1 0

这可能会导致循环中的一些混乱：

> for(i in 1:length(x)) print(i)
[1] 1
[1] 0
> for(i in seq_along(x)) print(i)
>

score 36 · Accepted Answer

加载数据时自动创建因子。您不假思索地将数据框中的一列视为字符，这很有效，直到您尝试将值更改为不是级别的值。这将生成一个警告，但会在您的数据框中留下 NA...

当您的 R 脚本中出现意外错误时，请检查是否应归咎于这些因素。

score 32 · Accepted Answer

忘记将子集矩阵中的 drop=FALSE 参数降低到一维，从而也删除对象类：

R> X <- matrix(1:4,2)
R> X
     [,1] [,2]
[1,]    1    3
[2,]    2    4
R> class(X)
[1] "matrix"
R> X[,1]
[1] 1 2
R> class(X[,1])
[1] "integer"
R> X[,1, drop=FALSE]
     [,1]
[1,]    1
[2,]    2
R> class(X[,1, drop=FALSE])
[1] "matrix"
R>

score 32 · Accepted Answer

删除数据框中的行将导致添加非唯一命名的行，然后会出错：

> a<-data.frame(c(1,2,3,4),c(4,3,2,1))
> a<-a[-3,]
> a
  c.1..2..3..4. c.4..3..2..1.
1             1             4
2             2             3
4             4             1
> a[4,1]<-1
> a
Error in data.frame(c.1..2..3..4. = c("1", "2", "4", "1"), c.4..3..2..1. = c(" 4",  : 
  duplicate row.names: 4

所以这里发生的事情是：

创建了一个四行 data.frame，因此行名是 c(1,2,3,4)
第三行被删除，所以行名是 c(1,2,4)
添加了第四行，R 自动将行名设置为等于索引，即 4，因此行名是 c(1,2,4,4)。这是非法的，因为行名应该是唯一的。我不明白为什么 R 应该允许这种类型的行为。在我看来，R 应该提供一个唯一的行名。

score 25 · Accepted Answer

首先，让我说我理解在二进制系统中表示数字的基本问题。尽管如此，我认为可以轻松改进的一个问题是当十进制值超出 R 的典型表示范围时数字的表示。

x <- 10.2 * 100
x
1020
as.integer(x)
1019

当结果真的可以表示为整数时，我不介意结果是否表示为整数。例如，如果该值确实是 1020，那么为 x 打印就可以了。但是在这种情况下，当打印 x 时，像 1020.0 这样简单的东西会更明显地表明该值不是整数并且不能表示为 1。当存在未显示的极小十进制分量时，R 应默认为某种指示。

score 20 · Accepted Answer

必须允许和的组合可能很NA烦人。它们的行为不同，对一个的测试不一定适用于其他的：NaNInf

> x <- c(NA,NaN,Inf)
> is.na(x)
[1]  TRUE  TRUE FALSE
> is.nan(x)
[1] FALSE  TRUE FALSE
> is.infinite(x)
[1] FALSE FALSE  TRUE

然而，测试这些麻烦制造者的最安全方法是：

> is.finite(x)
[1] FALSE FALSE FALSE

score 18 · Accepted Answer

总是测试当你有一个时会发生什么NA！

我总是需要特别注意的一件事（在经历了许多痛苦的经历之后）是NA价值观。R 函数易于使用，但任何编程方式都无法克服数据问题。

例如，任何带有 an 的网络向量运算NA都等于NA。这从表面上看是“令人惊讶的”：

> x <- c(1,1,2,NA)
> 1 + NA
[1] NA
> sum(x)
[1] NA
> mean(x)
[1] NA

这被外推到其他更高级别的函数中。

换句话说，缺失值通常与默认测量值一样重要。许多函数都有na.rm=TRUE/FALSE默认值；花一些时间来决定如何解释这些默认设置是值得的。

编辑 1： Marek 提出了一个很好的观点。 NA值也可能导致索引中的混乱行为。例如：

> TRUE && NA
[1] NA
> FALSE && NA
[1] FALSE
> TRUE || NA
[1] TRUE
> FALSE || NA
[1] NA

当您尝试创建条件表达式（对于 if 语句）时也是如此：

> any(c(TRUE, NA))
[1] TRUE
> any(c(FALSE, NA))
[1] NA
> all(c(TRUE, NA))
[1] NA

当这些 NA 值最终成为您的矢量索引时，可能会出现许多意想不到的事情。这对 R 来说都是很好的行为，因为这意味着您必须小心缺失值。但它可能会在一开始就引起严重的头痛。

score 13 · Accepted Answer

忘记这一点strptime()，朋友们会回到总是九点的POSIXt POSIXlt地方——转换为有帮助：length()POSIXct

R> length(strptime("2009-10-07 20:21:22", "%Y-%m-%d %H:%M:%S"))
[1] 9
R> length(as.POSIXct(strptime("2009-10-07 20:21:22", "%Y-%m-%d %H:%M:%S")))
[1] 1
R>

score 13 · Accepted Answer

13

该round函数总是四舍五入到偶数。

> round(3.5)
[1] 4  

> round(4.5)
[1] 4

于 2011-05-26T16:58:07.507 回答

score 12 · Accepted Answer

整数数学与双精度数略有不同（有时复杂也很奇怪）

更新他们在 R 2.15 中修复了一些东西

1^NA      # 1
1L^NA     # NA
(1+0i)^NA # NA 

0L %/% 0L # 0L  (NA from R 2.15)
0 %/% 0   # NaN
4L %/% 0L # 0L  (NA from R 2.15)
4 %/% 0   # Inf

score 11 · Accepted Answer

我很惊讶没有人提到这一点，但是：

T&F可以被覆盖，TRUE&FALSE不要。

例子：

x <- sample(c(0,1,NA), 100, T)
T <- 0:10

mean(x, na.rm=T)
# Warning in if (na.rm) x <- x[!is.na(x)] :
#   the condition has length > 1 and only the first element will be used
# Calls: mean -> mean.default
# [1] NA

plot(rnorm(7), axes=T)
# Warning in if (axes) { :
#   the condition has length > 1 and only the first element will be used
# Calls: plot -> plot.default
# Warning in if (frame.plot) localBox(...) :
#   the condition has length > 1 and only the first element will be used
# Calls: plot -> plot.default

[编辑]ctrf+F欺骗我。Shane 在他的评论中提到了这一点。

score 8 · Accepted Answer

读取数据可能比您想象的更成问题。今天我发现如果你使用read.csv()，如果 .csv 文件中有一行是空白的，read.csv()会自动跳过它。这对大多数应用程序来说是有意义的，但是如果您从数千个文件中自动从（例如）第 27 行中提取数据，并且前面的某些行可能是空白的，也可能不是空白的，如果您不小心，事情可能会变得非常糟糕错误的。

我现在用

data1 <- read.table(file_name, blank.lines.skip = F, sep = ",")

当您导入数据时，请检查您是否在一次又一次地执行您认为自己正在执行的操作...

score 8 · Accepted Answer

该all.equal()函数的棘手行为。

我的一个连续错误是比较一组浮点数。我有一个 CSV，例如：

... mu,  tau, ...
... 0.5, 1.7, ...

读取文件并尝试对数据进行子集化有时会奏效，有时会失败——当然，这是由于一次又一次地陷入浮点陷阱的坑。起初，数据只包含整数值，然后它总是转换为实际值，你知道这个故事。比较应该用all.equal()函数而不是==运算符来完成，当然，我第一次写的代码使用的是后一种方法。

是的，很酷，但all.equal()返回TRUE相同的数字，但如果失败，则会出现文本错误消息：

> all.equal(1,1)
[1] TRUE
> all.equal(1:10, 1:5)
[1] "Numeric: lengths (10, 5) differ"
> all.equal(1:10, c(1:5,1:5))
[1] "Mean relative difference: 0.625"

解决方案是使用isTRUE()功能：

if (!isTRUE(all.equal(x, y, tolerance=doubleErrorRate))) {
    ...
}

我看了多少遍all.equals()描述...

score 7 · Accepted Answer

这个伤害太大了，以至于我花了几个小时在错误报告中添加评论。我没有如愿以偿，但至少 R 的下一个版本会产生错误。

R> nchar(factor(letters))
 [1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

更新：从 R 3.2.0（可能更早）开始，此示例现在生成错误消息。正如下面的评论中提到的，一个因子不是一个向量，而 nchar() 需要一个向量。

R> nchar(factor(letters))
Error in nchar(factor(letters)) : 'nchar()' requires a character vector
R> is.vector(factor(letters))
[1] FALSE

score 6 · Accepted Answer

由于忘记包含空括号而意外列出了函数的源代码：例如“ls”与“ls()”
true & false 不要将其切割为预定义的常量，例如在 Matlab、C++、Java、Python 中；必须使用 TRUE & FALSE
不可见的返回值：例如“.packages()”不返回任何内容，而“(.packages())”返回包基本名称的字符向量

score 5 · Accepted Answer

例如，数字 3.14 是一个数值常量，但表达式 +3.14 和 -3.14 是对函数+和的调用-：

> class(quote(3.14))
[1] "numeric"
> class(quote(+3.14))
[1] "call"
> class(quote(-3.14))
[1] "call"

请参阅 John Chambers 书中的“数据分析软件- 使用 R 编程”中的第 13.2 节

score 5 · Accepted Answer

运算符中的部分匹配$：这适用于列表，但也适用于data.frame

df1 <- data.frame(foo=1:10, foobar=10:1)
df2 <- data.frame(foobar=10:1)

df1$foo # Correctly gets the foo column
df2$foo # Expect NULL, but this returns the foobar column!!!

# So, should use double bracket instead:
df1[["foo"]]
df2[["foo"]]

[[操作员也有一个exact标志，但幸运的TRUE是默认情况下。

部分匹配也会影响attr：

x1 <- structure(1, foo=1:10, foobar=10:1)
x2 <- structure(2, foobar=10:1)

attr(x1, "foo") # Correctly gets the foo attribute
attr(x2, "foo") # Expect NULL, but this returns the foobar attribute!!!

# So, should use exact=TRUE
attr(x1, "foo", exact=TRUE)
attr(x2, "foo", exact=TRUE)

score 5 · Accepted Answer

用作索引的向量（ “回收” ）的自动重复：

R> all.numbers <- c(1:5)
R> all.numbers
[1] 1 2 3 4 5
R> good.idxs <- c(T,F,T)
R> #note unfortunate length mismatch
R> good.numbers <- all.numbers[good.idxs]
R> good.numbers
[1] 1 3 4
R> #wtf? 
R> #why would you repeat the vector used as an index 
R> #without even a warning?

score 5 · Accepted Answer

零长度向量有一些怪癖：

R> kk=vector(mode="numeric",length=0)
R> kk
numeric(0)
R> sum(kk)
[1] 0
R> var(kk)
[1] NA

score 4 · Accepted Answer

使用列表时，有一些不直观的事情：

当然，和之间的区别[需要[[一些时间来适应。对于列表，[返回（可能是 1 个）元素的列表，而[[返回列表内的元素。

列表创建：

# When you're used to this:
x <- numeric(5) # A vector of length 5 with zeroes
# ... this might surprise you
x <- list(5)    # A list with a SINGLE element: the value 5
# This is what you have to do instead:
x <- vector('list', 5) # A vector of length 5 with NULLS

那么，如何将 NULL 插入到列表中呢？

x <- list("foo", 1:3, letters, LETTERS) # A sample list
x[[2]] <- 1:5        # Put 1:5 in the second element
# The obvious way doesn't work: 
x[[2]] <- NULL       # This DELETES the second element!
# This doesn't work either: 
x[2] <- NULL       # This DELETES the second element!

# The solution is NOT very intuitive:
x[2] <- list(NULL) # Put NULL in the second element

# Btw, now that we think we know how to delete an element:
x <- 1:10
x[[2]] <- NULL  # Nope, gives an ERROR!
x <- x[-2]    # This is the only way for atomic vectors (works for lists too)

最后是一些高级的东西，比如通过嵌套列表进行索引：

x <- list(a=1:3, b=list(c=42, d=13, e="HELLO"), f='bar')
x[[c(2,3)]] # HELLO (first selects second element and then it's third element)
x[c(2,3)]   # The second and third elements (b and f)

score 4 · Accepted Answer

R 中最大的困惑之一是它[i, drop = TRUE]确实会降低因子水平，但[i, j, drop = TRUE]不会！

> df = data.frame(a = c("europe", "asia", "oceania"), b = c(1, 2, 3))
> df$a[1:2, drop = TRUE]
[1] europe asia  
Levels: asia europe          <---- drops factor levels, works fine
> df[1:2,, drop = TRUE]$a
[1] europe asia  
Levels: asia europe oceania  <---- does not drops factor levels!

有关更多信息，请参阅：drop = TRUE 不会在 data.frame 中删除因子级别，而在 vector 中会删除

score 3 · Accepted Answer

来自编译语言和 Matlab，我偶尔会对函数式语言中函数的一个基本方面感到困惑：它们必须在使用之前定义！仅仅让 R 解释器解析它们是不够的。当您使用嵌套函数时，这通常会引起注意。

在 Matlab 中，您可以执行以下操作：

function f1()
  v1 = 1;
  v2 = f2();
  fprintf('2 == %d\n', v2);

  function r1 = f2()
    r1 = v1 + 1 % nested function scope
  end
end

如果你试图在 R 中做同样的事情，你必须把嵌套函数放在第一位，否则你会出错！仅仅因为您已经定义了函数，它才在命名空间中，直到它被分配给一个变量！另一方面，函数可以引用尚未定义的变量。

f1 <- function() {
  f2 <- function() {
    v1 + 1
  }

  v1 <- 1

  v2 = f2()

  print(sprintf("2 == %d", v2))
}

score 3 · Accepted Answer

3

从今天开始我的：qnorm() 采用概率，pnorm() 采用分位数。

于 2010-08-03T15:50:56.987 回答

score 3 · Accepted Answer

对我来说，这是一种反直觉的方式，当您使用将 data.frame 导出到文本文件write.csv时，然后再导入它，您需要添加一个额外的参数以获得完全相同的 data.frame，如下所示：

write.csv(m, file = 'm.csv')
read.csv('m.csv', row.names = 1) # Note the row.names argument

我还在 SO 中发布了这个问题，并被 @BenBolker 建议作为这个 Q 的答案。

score 1 · Accepted Answer

这apply组函数不仅适用于矩阵，而且可以扩展到多维数组。在我的研究中，我经常有一个数据集，例如大气温度。这存储在一个具有维度的多维数组中x,y,level,time，从现在开始称为multi_dim_array。一个模型示例是：

multi_dim_array = array(runif(96 * 48 * 6 * 100, -50, 50), 
                        dim = c(96, 48, 6, 100))
> str(multi_dim_array)
#     x     y     lev  time    
 num [1:96, 1:48, 1:6, 1:100] 42.4 16 32.3 49.5 24.9 ...

使用apply一个可以很容易地得到：

# temporal mean value
> str(apply(multi_dim_array, 4, mean))
 num [1:100] -0.0113 -0.0329 -0.3424 -0.3595 -0.0801 ...
# temporal mean value per gridcell (x,y location)
> str(apply(multi_dim_array, c(1,2), mean))
 num [1:96, 1:48] -1.506 0.4553 -1.7951 0.0703 0.2915 ...
# temporal mean value per gridcell and level (x,y location, level)
> str(apply(multi_dim_array, c(1,2,3), mean))
 num [1:96, 1:48, 1:6] -3.839 -3.672 0.131 -1.024 -2.143 ...
# Spatial mean per level
> str(apply(multi_dim_array, c(3,4), mean))
 num [1:6, 1:100] -0.4436 -0.3026 -0.3158 0.0902 0.2438 ...

这使得这个margin论点apply看起来不那么反直觉。我首先，为什么不使用“row”和“col”而不是 1 和 2。但它也适用于具有更多维度的数组这一事实清楚地说明了为什么margin喜欢这样使用。

score 0 · Accepted Answer

which.min并且which.max在使用比较运算符时会产生相反的期望，甚至会给出错误的答案。例如，试图找出排序数字列表中的哪个元素是小于阈值的最大数字。（即从 100 到 200 的序列，这是小于 110 的最大数）

set.seed(420)
x = seq(100, 200)
which(x < 110)
> [1]  1  2  3  4  5  6  7  8  9 10
which.max(x < 110)
> [1] 1
which.min(x < 110)
> [1] 11
x[11]
> [1] 110
max(which(x < 110))
>[1] 10
x[10]
> [1] 109

score -1 · Accepted Answer

很难找到的脏东西！像这样切割多行表达式：

K <- hyperpar$intcept.sigma2
        + cov.NN.additive(x1$env, x2 = NULL, sigma2_int = hyperpar$env.sigma2_int, sigma2_slope = hyperpar$env.sigma2_slope)
        + hyperpar$env.sigma2 * K.cache$k.env

R 只会评估第一行，其他两个只会浪费！它不会说任何警告，什么都没有！这对毫无戒心的用户来说是非常讨厌的背叛。它实际上必须这样写：

K <- hyperpar$intcept.sigma2 +
        cov.NN.additive(x1$env, x2 = NULL, sigma2_int = hyperpar$env.sigma2_int, sigma2_slope = hyperpar$env.sigma2_slope) +
        hyperpar$env.sigma2 * K.cache$k.env

这不是很自然的写作方式。

score -1 · Accepted Answer

这个！

all(c(1,2,3,4) == NULL)
$[1] TRUE

我在我的代码中进行了此检查，我确实需要两个表具有相同的列名：

stopifnot(all(names(x$x$env) == names(x$obsx$env)))

但是检查通过（评估为 TRUE）时x$x$env甚至不存在！

score -1 · Accepted Answer

您可以使用options(warn = 2), 根据手册：

如果 warn 为 2 或更大，则所有警告都会变成错误。

确实，警告变成了错误，但是，明白了！出现此类错误后代码仍继续运行！！！

source("script.R")
# ...
# Loading required package: bayesmeta
# Failed with error:  ‘(converted from warning) there is no package called ‘bayesmeta’’
# computing posterior (co)variances ... 
# (script continues running)
...

PS：但是从警告转换的其他一些错误确实会停止脚本...所以我不知道，我很困惑。这确实停止了脚本：

Error in optimise(psiline, c(0, 2), adiff, a, as.matrix(K), y, d0, mn,  :
  (converted from warning) NA/Inf replaced by maximum positive value

r - 你遇到的最大的 R-gotcha 是什么？

29 回答 29

Related

Reference