这是来自stats.stackexchange的转贴,我没有得到满意的回复。我有两个数据集,第一个是关于学校的,第二个列出了每所学校在标准化考试中不及格的学生(强调是故意的)。可以通过以下方式生成假数据集(感谢Tharen):
#random school data for 30 schools
schools.num = 30
schools.data = data.frame(school_id=seq(1,schools.num)
,tot_white=sample(100:300,schools.num,TRUE)
,tot_black=sample(100:300,schools.num,TRUE)
,tot_asian=sample(100:300,schools.num,TRUE)
,school_rev=sample(4e6:6e6,schools.num,TRUE)
)
#total students in each school
schools.data$tot_students = schools.data$tot_white + schools.data$tot_black + schools.data$tot_asian
#sum of all students all schools
tot_students = sum(schools.data$tot_white, schools.data$tot_black, schools.data$tot_asian)
#generate some random failing students
fail.num = as.integer(tot_students * 0.05)
students = data.frame(student_id=sample(seq(1:tot_students), fail.num, FALSE)
,school_id=sample(1:schools.num, fail.num, TRUE)
,race=sample(c('white', 'black', 'asian'), fail.num, TRUE)
)
我正在尝试估计 P(Fail=1 | Student Race, School Revenue)。如果我在学生数据集上运行多项式离散选择模型,我显然会估计 P(Race | Fail=1)。我显然必须估计这个的倒数。由于两个数据集(P(失败)、P(种族)、收入)中的所有信息都可用,我认为没有理由不能这样做。但我对如何在 R 中实现感到困惑。任何指针都将不胜感激。谢谢。