我正在尝试制作一个脚本,它需要一个 json 文件(pizza-train.json)(来自这个 Kaggle 比赛。我想从列表中的每个字典中提取 request_text 字段,并构造一个字符串的单词表示(字符串到计数列表)。
下一步是训练一个逻辑回归分类器来预测变量“requester_received_pizza”。我想训练 90% 的数据并预测 10%。问题是我不知道如何预测这 10%。任何建议都会非常有帮助!
import json
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
f_json = json.load(open('pizza-train.json'))
request_text = []
y = []
for item in f_json[:100]:
request_text.append(item['request_text'])
y.append(item['requester_received_pizza'])
vectorizer = CountVectorizer(min_df=1, lowercase=True, stop_words='english')
train_data_features = vectorizer.fit_transform(request_text)
train_data_features = train_data_features.toarray()
print 'Shape = '
print train_data_features.shape
vocab = vectorizer.get_feature_names()
print '\n'
print 'Vocab = '
print vocab
x_train, x_test, y_train, y_test = train_test_split(train_data_features, y, test_size=0.10)