我想创建一个新变量,该变量等于其他两个变量之一的值,条件是其他变量的值。这是一个带有假数据的玩具示例。
数据框的每一行代表一个学生。每个学生最多可以学习两门科目(subj1
和subj2
),并且可以在每个科目中攻读学位(“BA”)或辅修(“MN”)。我的真实数据包括数千名学生,几种类型的学位,大约50个科目,学生最多可以有五个专业/辅修。
ID subj1 degree1 subj2 degree2
1 1 BUS BA <NA> <NA>
2 2 SCI BA ENG BA
3 3 BUS MN ENG BA
4 4 SCI MN BUS BA
5 5 ENG BA BUS MN
6 6 SCI MN <NA> <NA>
7 7 ENG MN SCI BA
8 8 BUS BA ENG MN
...
现在我想创建第六个变量 ,df$major
它等于subj1
if的值subj1
是学生的主要专业,或者subj2
if的值subj2
是主要的专业。初级专业是学位等于“BA”的第一门学科。我尝试了以下代码:
df$major[df$degree1 == "BA"] = df$subj1
df$major[df$degree1 != "BA" & df$degree2 == "BA"] = df$subj2
不幸的是,我收到一条错误消息:
> df$major[df$degree1 == "BA"] = df$subj1
Error in df$major[df$degree1 == "BA"] = df$subj1 :
NAs are not allowed in subscripted assignments
我认为这意味着如果至少一行的赋值评估为 NA,则不能使用矢量化赋值。
我觉得我必须在这里遗漏一些基本的东西,但上面的代码似乎是显而易见的事情,我无法想出替代方案。
如果它有助于编写答案,这里是使用创建的示例数据,dput()
格式与上面列出的假数据相同:
structure(list(ID = 1:20, subj1 = structure(c(3L, NA, 1L, 2L,
2L, 3L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 3L, 3L, 1L, 2L, 1L
), .Label = c("BUS", "ENG", "SCI"), class = "factor"), degree1 = structure(c(2L,
NA, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("BA", "MN"), class = "factor"), subj2 = structure(c(1L,
2L, NA, NA, 1L, NA, 3L, 2L, NA, 2L, 2L, 1L, 3L, NA, 2L, 1L, 1L,
NA, 2L, 2L), .Label = c("BUS", "ENG", "SCI"), class = "factor"),
degree2 = structure(c(2L, 2L, NA, NA, 2L, NA, 1L, 2L, NA,
2L, 1L, 1L, 2L, NA, 1L, 2L, 2L, NA, 1L, 2L), .Label = c("BA",
"MN"), class = "factor")), .Names = c("ID", "subj1", "degree1",
"subj2", "degree2"), row.names = c(NA, -20L), class = "data.frame")