python - How do I use OneHotEncoding in scikit to encode categorical data in conjunction with a DecisionTreeClassifier?

Question

Right now I have the following code that gets some features and labels data from a csv file and uses them to create a DecisionTreeClassifier model and fit it.

import csv
from sklearn import tree
from sklearn.externals import joblib

mycsv = csv.reader(open('postsBase2.csv'))

features = []
labels = []

for row in mycsv:
    features.append([row[2], row[3], row[6]])
    labels.append(row[8])


clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)

I actually have a few other fields in the csv I would like to load that are categorical data. They are in row indexes 7 and 8. The categorical data in row index 7 can be one of 4 categories and the categorical data in row index 8 can be one of 5 categories.

I want to add these to my features and then pass them into the OneHotEncoding class somehow to turn them into categorical data the model can be fitted with: The update code with some psuedocode for what I want to do is below:

import csv
from sklearn import tree
from sklearn.externals import joblib

mycsv = csv.reader(open('postsBase2.csv'))

features = []
labels = []
for row in mycsv:
    features.append([row[2], row[3], row[6], row[7], row[8]])
    labels.append(row[8])


//Here I now want to process the features from row index 7 and 8 via OneHotEncoding somehow to make them acceptable for the DecisionTreeClassifier

clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)

How can I do this?

python - How do I use OneHotEncoding in scikit to encode categorical data in conjunction with a DecisionTreeClassifier?

0 回答 0

Related

Reference