72

我有一个data.frame由数字和因子变量组成的,如下所示。

testFrame <- data.frame(First=sample(1:10, 20, replace=T),
           Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
           Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
           Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))

我想构建一个matrix将虚拟变量分配给因子并单独保留数字变量的 a。

model.matrix(~ First + Second + Third + Fourth + Fifth, data=testFrame)

正如预期的那样,运行时lm会忽略每个因素的一个水平作为参考水平。但是,我想为matrix所有因素的每个级别构建一个带有虚拟/指标变量的变量。我正在构建这个矩阵,glmnet所以我不担心多重共线性。

有没有办法为model.matrix因子的每个级别创建虚拟对象?

4

11 回答 11

71

(试图赎回自己......)作为对 Jared 对 @Fabians 关于自动化的回答的评论,请注意,您需要提供的只是对比矩阵的命名列表。contrasts()接受一个向量/因子并从中生成对比矩阵。为此,我们可以使用在我们的数据集中的每个因素上lapply()运行,例如对于提供的示例:contrasts()testFrame

> lapply(testFrame[,4:5], contrasts, contrasts = FALSE)
$Fourth
        Alice Bob Charlie David
Alice       1   0       0     0
Bob         0   1       0     0
Charlie     0   0       1     0
David       0   0       0     1

$Fifth
        Edward Frank Georgia Hank Isaac
Edward       1     0       0    0     0
Frank        0     1       0    0     0
Georgia      0     0       1    0     0
Hank         0     0       0    1     0
Isaac        0     0       0    0     1

哪个插槽很好地融入了@fabians 答案:

model.matrix(~ ., data=testFrame, 
             contrasts.arg = lapply(testFrame[,4:5], contrasts, contrasts=FALSE))
于 2010-12-31T09:26:23.613 回答
54

您需要重置contrasts因子变量:

model.matrix(~ Fourth + Fifth, data=testFrame, 
        contrasts.arg=list(Fourth=contrasts(testFrame$Fourth, contrasts=F), 
                Fifth=contrasts(testFrame$Fifth, contrasts=F)))

或者,输入少一点并且没有专有名称:

model.matrix(~ Fourth + Fifth, data=testFrame, 
    contrasts.arg=list(Fourth=diag(nlevels(testFrame$Fourth)), 
            Fifth=diag(nlevels(testFrame$Fifth))))
于 2010-12-30T09:38:21.307 回答
18

caret用两行实现了一个很好的功能dummyVars来实现这一点:

library(caret) dmy <- dummyVars(" ~ .", data = testFrame) testFrame2 <- data.frame(predict(dmy, newdata = testFrame))

检查最后一列:

colnames(testFrame2)

"First"  "Second"         "Third"          "Fourth.Alice"   "Fourth.Bob"     "Fourth.Charlie" "Fourth.David"   "Fifth.Edward"   "Fifth.Frank"   "Fifth.Georgia"  "Fifth.Hank"     "Fifth.Isaac"   

这里最好的一点是你得到了原始数据框,加上排除了用于转换的原始变量的虚拟变量。

更多信息:http ://amunategui.github.io/dummyVar-Walkthrough/

于 2016-12-28T18:08:50.223 回答
11

dummyVarsfromcaret也可以使用。http://caret.r-forge.r-project.org/preprocess.html

于 2013-03-14T02:29:10.497 回答
3

行。只需阅读以上内容并将其放在一起。假设您想要矩阵(例如“X.factors”)乘以系数向量来获得线性预测器。还有几个额外的步骤:

X.factors = 
  model.matrix( ~ ., data=X, contrasts.arg = 
    lapply(data.frame(X[,sapply(data.frame(X), is.factor)]),
                                             contrasts, contrasts = FALSE))

(请注意,如果您只有一个因子列,则需要将 X[*] 转回数据框。)

然后说你得到这样的东西:

attr(X.factors,"assign")
[1]  0  1  **2**  2  **3**  3  3  **4**  4  4  5  6  7  8  9 10 #emphasis added

我们希望摆脱每个因素的 **'d 参考水平

att = attr(X.factors,"assign")
factor.columns = unique(att[duplicated(att)])
unwanted.columns = match(factor.columns,att)
X.factors = X.factors[,-unwanted.columns]
X.factors = (data.matrix(X.factors))
于 2014-07-24T18:05:51.790 回答
3

一个tidyverse答案:

library(dplyr)
library(tidyr)
result <- testFrame %>% 
    mutate(one = 1) %>% spread(Fourth, one, fill = 0, sep = "") %>% 
    mutate(one = 1) %>% spread(Fifth, one, fill = 0, sep = "")

产生所需的结果(与@Gavin Simpson 的回答相同):

> head(result, 6)
  First Second Third FourthAlice FourthBob FourthCharlie FourthDavid FifthEdward FifthFrank FifthGeorgia FifthHank FifthIsaac
1     1      5     4           0         0             1           0           0          1            0         0          0
2     1     14    10           0         0             0           1           0          0            1         0          0
3     2      2     9           0         1             0           0           1          0            0         0          0
4     2      5     4           0         0             0           1           0          1            0         0          0
5     2     13     5           0         0             1           0           1          0            0         0          0
6     2     15     7           1         0             0           0           1          0            0         0          0
于 2019-02-16T09:43:12.083 回答
2

使用 R 包“CatEncoders”

library(CatEncoders)
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
           Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
           Fourth=rep(c("Alice","Bob","Charlie","David"), 5),
           Fifth=rep(c("Edward","Frank","Georgia","Hank","Isaac"),4))

fit <- OneHotEncoder.fit(testFrame)

z <- transform(fit,testFrame,sparse=TRUE) # give the sparse output
z <- transform(fit,testFrame,sparse=FALSE) # give the dense output
于 2016-09-14T01:56:17.607 回答
2

我目前正在学习 Lasso 模型和glmnet::cv.glmnet(),model.matrix()Matrix::sparse.model.matrix()(对于高维矩阵,使用model.matrix会浪费我们的时间,正如 . 的作者所建议的那样glmnet)。

只是在那里分享有一个整洁的编码,以获得与@fabians 和@Gavin 的答案相同的答案。同时,@asdf123 也引入了另一个包library('CatEncoders')

> require('useful')
> # always use all levels
> build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = FALSE)
> 
> # just use all levels for Fourth
> build.x(First ~ Second + Fourth + Fifth, data = testFrame, contrasts = c(Fourth = FALSE, Fifth = TRUE))

资料来源:R for Everyone:高级分析和图形(第 273 页)

于 2017-01-15T17:59:29.870 回答
1
model.matrix(~ First + Second + Third + Fourth + Fifth - 1, data=testFrame)

或者

model.matrix(~ First + Second + Third + Fourth + Fifth + 0, data=testFrame)

应该是最直接的

于 2015-09-04T08:05:07.663 回答
1

您可以使用tidyverse来实现此目的,而无需手动指定每一列。

诀窍是制作一个“长”数据框。

然后,整理一些东西,并将其传播到广泛的范围以创建指标/虚拟变量。

代码:

library(tidyverse)

## add index variable for pivoting
testFrame$id <- 1:nrow(testFrame)

testFrame %>%
    ## pivot to "long" format
    gather(feature, value, -id) %>%
    ## add indicator value
    mutate(indicator=1) %>%
    ## create feature name that unites a feature and its value
    unite(feature, value, col="feature_value", sep="_") %>%
    ## convert to wide format, filling missing values with zero
    spread(feature_value, indicator, fill=0)

输出:

   id Fifth_Edward Fifth_Frank Fifth_Georgia Fifth_Hank Fifth_Isaac First_2 First_3 First_4 ...
1   1            1           0             0          0           0       0       0       0
2   2            0           1             0          0           0       0       0       0
3   3            0           0             1          0           0       0       0       0
4   4            0           0             0          1           0       0       0       0
5   5            0           0             0          0           1       0       0       0
6   6            1           0             0          0           0       0       0       0
7   7            0           1             0          0           0       0       1       0
8   8            0           0             1          0           0       1       0       0
9   9            0           0             0          1           0       0       0       0
10 10            0           0             0          0           1       0       0       0
11 11            1           0             0          0           0       0       0       0
12 12            0           1             0          0           0       0       0       0
...
于 2020-03-27T00:22:31.003 回答
1

我编写了一个名为ModelMatrixModel的包来改进 model.matrix() 的功能。包中的 ModelMatrixModel() 函数默认返回一个包含稀疏矩阵的类,该类具有所有级别的虚拟变量,适合在 glmnet 包中的 cv.glmnet() 中输入。重要的是,返回的类还存储转换参数,例如因子级别信息,然后可以将其应用于新数据。该函数可以处理 r 公式中的大多数项目,例如 poly() 和交互。它还提供了其他几个选项,例如处理无效因子级别和缩放输出。

#devtools::install_github("xinyongtian/R_ModelMatrixModel")
library(ModelMatrixModel)
testFrame <- data.frame(First=sample(1:10, 20, replace=T),
                        Second=sample(1:20, 20, replace=T), Third=sample(1:10, 20, replace=T),
                        Fourth=rep(c("Alice","Bob","Charlie","David"), 5))
newdata=data.frame(First=sample(1:10, 2, replace=T),
                   Second=sample(1:20, 2, replace=T), Third=sample(1:10, 2, replace=T),
                   Fourth=c("Bob","Charlie"))
mm=ModelMatrixModel(~First+Second+Fourth, data = testFrame)
class(mm)
## [1] "ModelMatrixModel"
class(mm$x) #default output is sparse matrix
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
data.frame(as.matrix(head(mm$x,2)))
##   First Second FourthAlice FourthBob FourthCharlie FourthDavid
## 1     7     17           1         0             0           0
## 2     9      7           0         1             0           0

#apply the same transformation to new data, note the dummy variables for 'Fourth' includes the levels not appearing in new data     
mm_new=predict(mm,newdata)
data.frame(as.matrix(head(mm_new$x,2))) 
##   First Second FourthAlice FourthBob FourthCharlie FourthDavid
## 1     6      3           0         1             0           0
## 2     2     12           0         0             1           0
于 2021-08-11T17:02:33.027 回答