2

I was curious if there is an as_formula specifier (like in statsmodels) for sklearn.tree.decisiontreeclassifier in Python, or some way to hack one in. Currently, I must use

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)

but I would prefer to have something like

clf = clf.fit(formula='Y ~ X', data=df)

The reason is that I would like to specify more than one X without having to do a lot of array shaping. Thanks.

4

2 回答 2

3

谢谢提供信息。虽然没有当前的Patsy接口sklearn,但Patsy很容易提供我需要的功能。举个例子...

from sklearn import tree
from patsy import dmatrix

red = [1,0,0,0,0,1,1,0,0,1,1,0]
green = [0,0,0,1,0,1,1,0,0,1,1,0]
blue = [0,0,1,1,0,0,0,1,0,0,0,0]

y = [0,0,0,0,0,1,1,0,0,1,1,0]

X = dmatrix('red + green + blue + 0')

dt_clf = tree.DecisionTreeClassifier()
dt_clf = dt_clf.fit(X, y)

pred_r = [1,1,0,0,1,1,0,0,0,0,0,0]
pred_g = [1,1,0,0,1,1,0,0,0,0,0,0]
pred_b = [0,0,1,1,0,0,0,1,0,0,0,0]

test = dmatrix('pred_r + pred_g + pred_b + 0')
dt_clf.predict(test) 

也许更方便的是sklearnpandas. 使用与上述相同的数据...

import pandas as pd

df = pd.DataFrame()
df['red'] = red
df['green'] = green
df['blue'] = blue
df['y'] = y

dt_clf = dt_clf.fit(df[['red','green','blue']], df['y'])
dt_clf.predict(test)

希望这可以帮助与我处于相同情况的人。

注意:要非常小心,X 的顺序保持不变。例如,不要训练为 df[['red','green','blue']] 然后预测 (df[['blue','green','red']]。可能看起来很明显,但是把事情搞砸的简单方法。

于 2015-08-08T21:57:55.463 回答
1

目前这是不可能的,但是如果有一个 scikit-learn 的 patsy 界面会很棒。不过,我认为目前没有人在研究它。

于 2015-08-08T15:29:24.120 回答