我正在使用 Sci-Kit Learn 来训练一个模型,该模型根据姓名输入进行性别输出预测。我的模型运行成功,准确率约为 80%。我正在尝试使用 LIME 模块来可视化模型用于进行预测的特征和权重。但是,我收到错误“字符串索引超出范围”。下面我将提供可重现的代码:
#Prepare data
import numpy as np
import nltk
from nltk.corpus import names
import random
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
random.shuffle(labeled_names)
以下函数从单个名称中提取所需的特征。这是导致错误的函数,特别是在第三行中创建了“y”变量。无论 explain_instance 函数作为 letter_extractor 的“x”传递,似乎都不是字符串。
def letter_extractor(x):
feats = []
y = x[0] + " " + x[0:2] + " " + x[0:3] + " " + x[-3:] + " " + x[-2:] + " " + x[-1]
nmlen = str(len(x))
feats.append(y + ' ' + nmlen)
if re.search(r"[aeiou][^aeiou]{2}[aeiou]\b", x):
feats.append("VccV")
if re.search(r"[aeiou]{2}\b", x):
feats.append("VV")
if x[-1] in ['a','e','i']:
feats.append("_fem_")
if x[-1] == 'o':
feats.append("_masc_")
concat_feats = " ".join(i for i in feats)
return concat_feats
下面的这个函数将用于在名称列表上循环上述函数,但也将接受单个字符串/名称来提取特征,然后通过管道
def text_vec(data):
ready = []
if type(data) == str:
b = letter_extractor(data)
ready.append(b)
elif type(data[0]) == str:
for i in data:
y = letter_extractor(i)
ready.append(y)
return ready
然后,我创建 test_train 拆分
from sklearn.model_selection import train_test_split
from collections import Counter
train, test = train_test_split(labeled_names, test_size = 0.33, random_state=42)
X_train = [i[0] for i in train]
X_test = [i[0] for i in test]
y_train = [i[1] for i in train]
y_test = [i[1] for i in test]
创建特征的 tfidf
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
feat_ext = text_vec(X_train) #applying feature extraction to list of names
tfidf_vec = TfidfVectorizer(min_df = 10, token_pattern = r'[a-zA-Z]+', lowercase = False)
tfidf_feats_train = tfidf_vec.fit_transform(feat_ext) #create tfidf of features
chk_mod = text_vec(X_test)
tfidf_feats_test = tfidf_vec.transform(chk_mod)
准备管道
import lime
from lime.lime_text import LimeTextExplainer
from sklearn.pipeline import Pipeline
from sklearn import svm
from sklearn.preprocessing import FunctionTransformer
text_vec_trfmr = FunctionTransformer(text_vec)
pipeline = Pipeline([
("text_vectorizer", text_vec_trfmr),
("feature_vectorizer", tfidf_vec),
("classifier", svm.SVC(C=4, kernel='linear',probability=True))])
pipeline.fit(X_train,y_train)
查看单个随机示例的统计信息
import textwrap
names_test = X_test
gender_test = y_test
idx = 63
text_sample = X_test[idx]
class_names = ['female', 'male']
print('Review ID-{}:'.format(idx))
print('-'*50)
print('Review Text:\n', textwrap.fill(text_sample,400))
print('-'*50)
print('Probability(male) =', pipeline.predict_proba([text_sample])[0,1])
print('Probability(female) =', pipeline.predict_proba([text_sample])[0,0])
print('Predicted class: %s' % pipeline.predict([text_sample]))
print('True class: %s' % y_test[idx])
最后,在可视化单个示例时出现问题(上面创建的“text_sample”)
import matplotlib
matplotlib.rcParams['figure.dpi']=300
%matplotlib inline
explainer = LimeTextExplainer(class_names=class_names)
# Error is occurring here,but traces all the way back to the letter_extractor function created earlier
explanation = explainer.explain_instance(text_sample,
pipeline.predict_proba,
num_features=10)
explanation.show_in_notebook(text=True)
管道效果很好,处理单个字符串和列表没有任何问题
print(pipeline.predict_proba(text_sample))
print(pipeline.predict(text_sample))
print(pipeline.predict_proba(X_train[0:4])
print(pipeline.predict(X_train[0:4])
我只是很困惑为什么管道可以处理一个字符串,但是会explain_instance
抛出一个错误。