0

在使用正则表达式之前,我已经为此制定了一个非常冗长的解决方案,但我希望有一种更原生的方法来做到这一点。

给定一个模型,也许像

data(tips, package="reshape2")
mod <- lm(tip ~ total_bill*sex + sex*day, tips)
mod$coefficients

我想确定公式中哪个系数与哪个变量有关。像这样:

|    Coefficient     |    Variable    |
|:-------------------|:---------------|
| total_bill         | total_bill     |
| sexMale            | sex            |
| daySat             | day            |
| daySun             | day            |
| dayThur            | day            |
| total_bill:sexMale | total_bill,sex |
| sexMale:daySat     | sex,day        |
| sexMale:daySun     | sex,day        |
| sexMale:dayThur    | sex,day        |

我已经检查过了model.matrixmodel.formula但是那些让我看到了这行代码

.Internal(model.matrix(t, data))

我深入研究了 C 代码,但我认为必须有一种更简单的方法。有没有?

为了回应 DWin 的好回答,我构建了一个复杂的示例,其中正则表达式可能会失败。这是正则表达式让我害怕的边缘情况之一。

data.frame是用变量名和值构建的,这些变量名和值很容易混淆,这种情况经常发生。

baseball <- data.frame(Bat=sample(1:100, 20, replace=T), Batter=sample(c("David", "Batley", "Bob", "Ace"), 20, replace=T), Hits=sample(1:20, 20, replace=T))
baseball
bMod <- lm(Hits ~ Bat*Batter, baseball)
bMod$coefficients

col.matx <- sapply(colnames(model.matrix(bMod)), function(cols) sapply(labels(bMod), function(trm) grep(patt=trm, x=cols, value=TRUE)))

这将连续变量Bat与 的所有系数相匹配factor Batter

是的,这是一个愚蠢的例子,但很容易发生。

4

2 回答 2

0

这不是确切的答案,但应该能够看到一种重新排列它以满足您的目的的方法。第一步确定labels每个model.matrix列名中有哪些。

col.matx <- sapply(colnames(model.matrix(mod)), function(cols) 
         sapply(labels(mod), function(trm) grep(patt=trm, x=cols, value=TRUE)))
#---------------------------------------
               (Intercept) total_bill   sexMale     daySat      daySun      dayThur    
total_bill     Character,0 "total_bill" Character,0 Character,0 Character,0 Character,0
sex            Character,0 Character,0  "sexMale"   Character,0 Character,0 Character,0
day            Character,0 Character,0  Character,0 "daySat"    "daySun"    "dayThur"  
total_bill:sex Character,0 Character,0  Character,0 Character,0 Character,0 Character,0
sex:day        Character,0 Character,0  Character,0 Character,0 Character,0 Character,0
               total_bill:sexMale   sexMale:daySat   sexMale:daySun   sexMale:dayThur  
total_bill     "total_bill:sexMale" Character,0      Character,0      Character,0      
sex            "total_bill:sexMale" "sexMale:daySat" "sexMale:daySun" "sexMale:dayThur"
day            Character,0          "sexMale:daySat" "sexMale:daySun" "sexMale:dayThur"
total_bill:sex "total_bill:sexMale" Character,0      Character,0      Character,0      
sex:day        Character,0          Character,0      Character,0      Character,0      

当值不是零长度项时,需要折叠与矩阵值相关联的结果的行名:

> which(sapply(col.matx, length) != 0 , arr.ind=TRUE)
 [1]  6 12 18 23 28 31 32 34 37 38 42 43 47 48

所以这会从上面的矩阵中挑选出项目名称,并使用模算术将它们与 term.labels 相关联:

data.frame(coef =  unlist(col.matx[
                      which(sapply(col.matx, length) != 0 , arr.ind=TRUE)] ), 
           term.label =rownames(col.matx)[
                       which(sapply(col.matx, length) != 0 , arr.ind=TRUE) %% 5 ])
                 coef     term.label
1          total_bill     total_bill
2             sexMale            sex
3              daySat            day
4              daySun            day
5             dayThur            day
6  total_bill:sexMale     total_bill
7  total_bill:sexMale            sex
8  total_bill:sexMale total_bill:sex
9      sexMale:daySat            sex
10     sexMale:daySat            day
11     sexMale:daySun            sex
12     sexMale:daySun            day
13    sexMale:dayThur            sex
14    sexMale:dayThur            day

折叠值方法是 SO 上的常见请求。有一个在过去 24 小时内得到答复。

于 2013-03-06T23:23:11.797 回答
0

好的,使用完全包含在模型中的信息找到了 lm 模型的解决方案。

require(plyr)       # for join function
require(reshape2)   # for melt function

matchCoefs <- function(model)
{
    # get the terms
    theTerms <- model$terms
    # get the assignment position
    thePos <- model$assign
    # get intercept indicator
    inter <- attr(theTerms, "intercept")
    # get coef names
    coefNames <- names(coef(model))
    # get pred names
    predNames <- attr(theTerms, "term.labels")
    # expand out pred names to match coefficient names
    predNames <- predNames[thePos]
    # if there's an intercept term add it to the pred names
    if(inter == 1)
    {
        predNames <- c("(Intercept)", predNames)
    }

    # build data.frame linking term to coefficient name
    matching <- data.frame(Term=predNames, Coefficient=coefNames)

    ## now match individual predictor to term
    # get matrix as data.frame
    factorMat <- as.data.frame(attr(theTerms, "factor"))
    # add column from rownames as identifier
    factorMat$.Pred <- rownames(factorMat)
    # melt it down for comparison
    factorMelt <- melt(factorMat, id.vars=".Pred", variable.name="Term", )
    # only keep rows where there's a match
    factorMelt <- factorMelt[factorMelt$value == 1, ]
    # again, bring in coefficient if needed
    if(inter == 1)
    {
        factorMelt <- rbind(data.frame(.Pred="(Intercept)", Term="(Intercept)", value=1), factorMelt)
    }
    # join into the matching data.frame
    matching <- join(matching, factorMelt, by="Term")

    return(matching)
}

# fit some models with different terms
mod1 <- lm(tip ~ total_bill * sex + day, tips)
mod2 <- lm(tip ~ total_bill * sex + day - 1, tips)
mod3 <- lm(tip ~ (total_bill + sex + day)^3, tips)
mod4 <- lm(tip ~ total_bill * sex + day + I(total_bill^2), tips)

matchCoefs(mod1)
matchCoefs(mod2)
matchCoefs(mod3)
matchCoefs(mod4)

# now with the convoluted baseball example
baseball <- data.frame(Bat=sample(1:100, 20, replace=T), Batter=sample(c("David", "Batley", "Bob", "Ace"), 20, replace=T), Hits=sample(1:20, 20, replace=T))
bMod <- lm(Hits ~ Bat*Batter, baseball)
matchCoefs(bMod)

没有循环,所有内置函数,没有正则表达式。我需要对此进行更多测试并输入数据类型信息,但这应该比较简单。

于 2013-03-11T05:00:19.697 回答