我正在收集调查数据(使用开放数据工具包),我的现场团队,祝福他们的心,有时会在人名的拼写上有点创意。所以我有一个“正确”的受访者姓名,以及一些与“家庭成员姓名”变量相关联的记录的年龄变量。有许多不同年龄的家庭成员。我想要受访者的年龄。
这是一些说明我的问题的假数据:
#the respondent
r = data.frame(name = c("Barack Obama", "George Bush", "Hillary Clinton"))
#a male member
m = data.frame(name = c("Barack Obama","George", "Wulliam Clenton"), age = c(55,59,70)); m$name=as.character(m$name)
#a female member
f = data.frame(name = c("Michelle O","Laura Busch", "Hillary Rodham Clinton"), age = c(54,58,69)); f$name=as.character(f$name)
#if the responsent is the the given member, record their age. if not, NA
a = cbind(
ifelse(r$name==m$name,m$age,NA)
,ifelse(r$name==f$name,f$age,NA)
)
#make a function for plyr that gives me the age of the matched respondent
f = function(row){
d = row[is.na(row)==0]
ifelse(length(d)==0,NA,d)
}
require(plyr)
b = aaply(a,.margins=1,.fun=f)
data.frame(names=r$name,age=b)
names age
1 Barack Obama 55
2 George Bush NA
3 Hillary Clinton NA
what.I.would.like = data.frame(names=c("Barack Obama", "George Bush", "Hillary Clinton"),age = c(55,59,70))
1> what.I.would.like
names age
1 Barack Obama 55
2 George Bush 59
3 Hillary Clinton 70
在我的真实数据中,我有数百人和多达 13 个家庭成员。从那以后,我将调查更改为分别记录受访者的年龄,但我有一堆数据需要清理。