r - R | 中的因子水平默认为 1 和 2 虚拟变量

Question

我正在从 Stata 过渡到 R。在 Stata 中，如果我将因子水平（比如 - 0 和 1）标记为（M 和 F），0 和 1 将保持原样。此外，在包括 Excel 和 SPSS 在内的大多数软件中，虚拟变量线性回归都需要这样做。

但是，我注意到 R 默认因子水平为 1,2 而不是 0,1。我不知道为什么 R 会这样做，尽管内部回归（并且正确地）假设 0 和 1 作为因子变量。我将不胜感激任何帮助。

这是我所做的：

尝试#1：

sex<-c(0,1,0,1,1)
sex<-factor(sex,levels = c(1,0),labels = c("F","M"))
str(sex)
Factor w/ 2 levels "F","M": 2 1 2 1 1

似乎因子水平现在重置为 1 和 2。我相信 1 和 2s 是对因子水平的引用。但是，我丢失了原始值，即 0 和 1。

尝试2：

sex<-c(0,1,0,1,1)
sex<-factor(sex,levels = c(0,1),labels = c("F","M"))
str(sex)
Factor w/ 2 levels "F","M": 1 2 1 2 2

同上。我的 0 和 1 现在是 1 和 2。相当令人惊讶。为什么会这样。

Try3 现在，我想看看 1s 和 2s 是否有任何不良影响回归。所以，这就是我所做的：

这是我的数据的样子：

> head(data.frame(sassign$total_,sassign$gender))
  sassign.total_ sassign.gender
1            357              M
2            138              M
3            172              F
4            272              F
5            149              F
6            113              F

myfit<-lm(sassign$total_ ~ sassign$gender)

myfit$coefficients
    (Intercept) sassign$genderM 
      200.63522        23.00606

所以，事实证明手段是正确的。在运行回归时，R 确实使用 0 和 1 值作为虚拟变量。

我确实检查了 SO 上的其他线程，但他们大多谈论 R 如何编码因子变量而没有告诉我原因。Stata 和 SPSS 通常要求基本变量为“0”。所以，我想问问这个。

我会很感激任何想法。

score 8 · Accepted Answer

R不是Stata。而且您将需要忘记很多关于虚拟变量构造的知识。R 在幕后为你做这件事。您不能使 R 的行为与 Stata 完全相同。诚然，R 在“F”级别的模型矩阵列中确实有 0 和 1，但它们乘以因子值（在本例中为 1 和 2）。然而，对比总是关于差异，差异 btwn (0,1) 与差异 btwn (1,2) 相同。

数据示例：

dput(dat)
structure(list(total = c(357L, 138L, 172L, 272L, 149L, 113L), 
    gender = structure(c(2L, 2L, 1L, 1L, 1L, 1L), .Label = c("F", 
    "M"), class = "factor")), .Names = c("total", "gender"), row.names = c("1", 
"2", "3", "4", "5", "6"), class = "data.frame")

这两个回归模型具有不同的模型矩阵（模型矩阵是 R 构造其“虚拟变量”的方式。

> myfit<-lm(total ~ gender, dat)
> 
> myfit$coefficients
(Intercept)     genderM 
      176.5        71.0 
> dat$gender=factor(dat$gender, levels=c("M","F") )
> myfit<-lm(total ~ gender, dat)
> 
> myfit$coefficients
(Intercept)     genderF 
      247.5       -71.0 
> model.matrix(myfit)
  (Intercept) genderF
1           1       0
2           1       0
3           1       1
4           1       1
5           1       1
6           1       1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$gender
[1] "contr.treatment"

> dat$gender=factor(dat$gender, levels=c("F","M") )
> myfit<-lm(total ~ gender, dat)
> 
> myfit$coefficients
(Intercept)     genderM 
      176.5        71.0 
> model.matrix(myfit)
  (Intercept) genderM
1           1       1
2           1       1
3           1       0
4           1       0
5           1       0
6           1       0
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$gender
[1] "contr.treatment"

score 5 · Accepted Answer

简而言之，您只是混淆了两个不同的概念。下面我将一一说明。

你看到的整数的含义str()

您从中看到的str()是因子变量的内部表示。因子在内部是一个整数，其中的数字给出了向量中级别的位置。例如：

x <- gl(3, 2, labels = letters[1:3])
#[1] a a b b c c
#Levels: a b c

storage.mode(x)  ## or `typeof(x)`
#[1] "integer"

str(x)
# Factor w/ 3 levels "a","b","c": 1 1 2 2 3 3

as.integer(x)
#[1] 1 1 2 2 3 3

levels(x)
#[1] "a" "b" "c"

此类职位的常见用途是以as.character(x)最有效的方式执行：

levels(x)[x]
#[1] "a" "a" "b" "b" "c" "c"

您对模型矩阵的误解

在我看来，您认为模型矩阵是通过以下方式获得的

cbind(1L, as.integer(x))
#     [,1] [,2]
#[1,]    1    1
#[2,]    1    1
#[3,]    1    2
#[4,]    1    2
#[5,]    1    3
#[6,]    1    3

这不是真的。以这种方式，您只是将因子变量视为数值变量。

模型矩阵是这样构造的：

xlevels <- levels(x)
cbind(1L, match(x, xlevels[2], nomatch=0), match(x, xlevels[3], nomatch=0))
#     [,1] [,2] [,3]
#[1,]    1    0    0
#[2,]    1    0    0
#[3,]    1    1    0
#[4,]    1    1    0
#[5,]    1    0    1
#[6,]    1    0    1

和1分别0表示“匹配”/“出现”和“不匹配”/“不出现”。

R 例程model.matrix将通过易于阅读的列名和行名有效地为您执行此操作：

model.matrix(~x)
#  (Intercept) xb xc
#1           1  0  0
#2           1  0  0
#3           1  1  0
#4           1  1  0
#5           1  0  1
#6           1  0  1

编写一个 R 函数来自己生成模型矩阵

我们可以编写一个名义例程mm来生成模型矩阵。尽管它的效率远低于model.matrix，但它可能有助于更好地理解这一概念。

mm <- function (x, contrast = TRUE) {
  xlevels <- levels(x)
  lst <- lapply(xlevels, function (z) match(x, z, nomatch = 0L))
  if (contrast) do.call("cbind", c(list(1L), lst[-1]))
  else do.call("cbind", lst)
  }

例如，如果我们有一个y具有 5 个级别的因子：

set.seed(1); y <- factor(sample(1:5, 10, replace=TRUE), labels = letters[1:5])
y
# [1] b b c e b e e d d a
#Levels: a b c d e
str(y)
#Factor w/ 5 levels "a","b","c","d",..: 2 2 3 5 2 5 5 4 4 1

其有/无对比处理的模型矩阵分别为：

mm(y, TRUE)
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    1    1    0    0    0
# [2,]    1    1    0    0    0
# [3,]    1    0    1    0    0
# [4,]    1    0    0    0    1
# [5,]    1    1    0    0    0
# [6,]    1    0    0    0    1
# [7,]    1    0    0    0    1
# [8,]    1    0    0    1    0
# [9,]    1    0    0    1    0
#[10,]    1    0    0    0    0

mm(y, FALSE)
#      [,1] [,2] [,3] [,4] [,5]
# [1,]    0    1    0    0    0
# [2,]    0    1    0    0    0
# [3,]    0    0    1    0    0
# [4,]    0    0    0    0    1
# [5,]    0    1    0    0    0
# [6,]    0    0    0    0    1
# [7,]    0    0    0    0    1
# [8,]    0    0    0    1    0
# [9,]    0    0    0    1    0
#[10,]    1    0    0    0    0

相应的model.matrix调用将分别为：

model.matrix(~ y)
model.matrix(~ y - 1)

r - R | 中的因子水平默认为 1 和 2 虚拟变量

2 回答 2

Related

Reference