r - 组合数据框列中的因子水平

Question

我有一个数据框data，其中有一列名为“Project License”，它代表一个分类变量，因此，在 R 术语中，是一个factor。我正在尝试创建一个新列，其中根据我的分类将开源软件许可证组合成更大的类别。但是，当我尝试组合（合并）该因子的级别时，我最终会得到一个列，其中所有级别都丢失或未更改，或者出现错误消息，例如以下消息：

因子错误（数据[[“项目许可证”]]，级别 = 分类，标签 = c（“高度限制”，：无效的“标签”；长度 4 应为 1 或 6

这是我用于此功能的代码（从函数中提取）：

myLevels <- c('gpl', 'lgpl', 'bsd',
              'other', 'artistic', 'public')
myLabels <- c('GPL', 'LGPL', 'BSD',
              'Other', 'Artistic', 'Public')

licenses <- factor(data[["Project License"]],
                   levels = myLevels, labels = myLabels)

data[["Project License"]] <- licenses

classification <- c(highly = c('gpl'),
                    restrictive = c('lgpl', 'public'),
                    permissive = c('bsd', 'artistic'),
                    unknown = c('other'))

restrictiveness <- 
  factor(data[["Project License"]],
         levels = classification,
         labels = c('Highly Restrictive', 'Restrictive',
                    'Permissive', 'Unknown'))

data[["License Restrictiveness"]] <- restrictiveness

我还尝试了一些其他方法（包括“R Inferno”中第 8.2.5 节中描述的方法），但到目前为止还没有成功。

我做错了什么以及如何解决这个问题？谢谢！

更新（数据）：

> head(data, n=20)
   Project ID Project License
1       45556            lgpl
2       41636             bsd
3       95627             gpl
4       66930             gpl
5       51103             gpl
6       65637             gpl
7       41834             gpl
8       70998             gpl
9       95064             gpl
10      48810            lgpl
11      95934             gpl
12      90909             gpl
13       6538         website
14      16439             gpl
15      41924             gpl
16      78987             gpl
17      58662            zlib
18       1904             bsd
19      93838          public
20      90047            lgpl

> str(data)
'data.frame':   45033 obs. of  2 variables:
 $ Project ID     : chr  "45556" "41636" "95627" "66930" ...
 $ Project License: chr  "lgpl" "bsd" "gpl" "gpl" ...
 - attr(*, "SQL")=Class 'base64'  chr "ClNFTEVDVCBncm91cF9pZCwgbGljZW5zZQpGUk9NIHNmMDMxNC5ncm91cHMKV0hFUkUgZ3JvdXBfaWQgPCAxMDAwMDA="
 - attr(*, "indicatorName")=Class 'base64'  chr "cHJqTGljZW5zZQ=="
 - attr(*, "resultNames")=Class 'base64'  chr "UHJvamVjdCBJRCwgUHJvamVjdCBMaWNlbnNl"

更新 2（数据）：

> unique(data[["Project License"]])
 [1] "lgpl"       "bsd"        "gpl"        "website"    "zlib"
 [6] "public"     "other"      "ibmcpl"     "rpl"        "mpl11"
[11] "mit"        "afl"        "python"     "mpl"        "apache"
[16] "osl"        "w3c"        "iosl"       "artistic"   "apsl"
[21] "ibm"        "plan9"      "php"        "qpl"        "psfl"
[26] "ncsa"       "rscpl"      "sunpublic"  "zope"       "eiffel"
[31] "nethack"    "sissl"      "none"       "opengroup"  "sleepycat"
[36] "nokia"      "attribut"   "xnet"       "eiffel2"    "wxwindows"
[41] "motosoto"   "vovida"     "jabber"     "cvw"        "historical"
[46] "nausite"    "real"

score 3 · Accepted Answer

问题是水平数不等于因子创建中的标签数，也不是长度为1。

来自?factor：

labels  
  either an optional character vector of labels for the levels (in the same order as
  levels after removing those in exclude), or a character string of length 1.

你需要让这些一致。中的名称classification不是factor组合标签的提示。

例如：

factor(..., levels=classification, labels=c('Highly Restrictive',
                                            'Restrictive.1',
                                            'Restrictive.2',
                                            'Permissive.1',
                                            'Permissive.2',
                                            'Unknown'))

要将因子映射到具有较少级别的另一个因子，您可以按名称索引向量。将向量翻转classification为查找：

 classification <- c(gpl='Highly Restrictive',
                     lgpl='Restrictive', 
                     public='Restrictive',
                     bsd='Permissive',
                     artistic='Permissive',
                     other='Unknown')

要将其用作查找表：

data[["License Restrictiveness"]] <- 
    as.factor(classification[as.character(data[['Project License']])])

head(data)
##   Project ID Project License License Restrictiveness
## 1      45556            lgpl             Restrictive
## 2      41636             bsd              Permissive
## 3      95627             gpl      Highly Restrictive
## 4      66930             gpl      Highly Restrictive
## 5      51103             gpl      Highly Restrictive
## 6      65637             gpl      Highly Restrictive

score 1 · Accepted Answer

例如，如果您先转换为角色，您的任务可能会变得更容易（未经测试）

license.map <- c(lgpl="Permissive", bsd="Permissive", 
                 gpl="Restrictive", website="Unkown") # etc.
dat <- transform(dat, LicenseType=license.map[Project.License])

由于默认情况下 stringsAsFactor 是True，因此新列是一个因素。

r - 组合数据框列中的因子水平

2 回答 2

Related

Reference