from patsy import *
from pandas import *
dta = DataFrame([["lo", 1],["hi", 2.4],["lo", 1.2],["lo", 1.4],["very_high",1.8]], columns=["carbs", "score"])
dmatrix("carbs + score", dta)
DesignMatrix with shape (5, 4)
Intercept carbs[T.lo] carbs[T.very_high] score
1 1 0 1.0
1 0 0 2.4
1 1 0 1.2
1 1 0 1.4
1 0 1 1.8
Terms:
'Intercept' (column 0), 'carbs' (columns 1:3), 'score' (column 3)
问题:不是使用 Designinfo 指定列的“名称”(这基本上使我的代码的可重用性降低),我可以不读取此 DesignMatrix 给出的名称,以便稍后将其输入 DataFrame,而无需知道预先“参考水平/对照组”水平是什么?
IE。当我做 dmatrix("C(carbs, Treatment(reference='lo')) + score", dta)
"""
# How can I get something like this with dmatrix's output without hardcoding ?
names = obtained from dmatrix's output above
This should give names = ['Intercept' ,'carbs[T.lo]', 'carbs[T.very_high]', 'score']
"""
g=DataFrame(dmatrix("carbs + score", dta),columns=names)
Intercept carbs[T.lo] carbs[T.very_high] score
0 1 2 3
0 1 1 0 1.0
1 1 0 0 2.4
2 1 1 0 1.2
3 1 1 0 1.4
4 1 0 1 1.8
type(g)=<class 'pandas.core.frame.DataFrame'>
所以 g 将是转换后的数据框,我可以在不需要记录(或硬编码)列名及其参考级别的情况下对其进行逻辑建模。