0

我正在使用 Sci-Kit Learn 来训练一个模型,该模型根据姓名输入进行性别输出预测。我的模型运行成功,准确率约为 80%。我正在尝试使用 LIME 模块来可视化模型用于进行预测的特征和权重。但是,我收到错误“字符串索引超出范围”。下面我将提供可重现的代码:

#Prepare data 
import numpy as np
import nltk
from nltk.corpus import names
import random


labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
                 [(name, 'female') for name in names.words('female.txt')])
random.shuffle(labeled_names)

以下函数从单个名称中提取所需的特征。这是导致错误的函数,特别是在第三行中创建了“y”变量。无论 explain_instance 函数作为 letter_extractor 的“x”传递,似乎都不是字符串。

def letter_extractor(x):
        feats = []
        y = x[0] + " " + x[0:2] + " " + x[0:3] + " " + x[-3:] + " " + x[-2:] + " " + x[-1]
        nmlen = str(len(x))
        feats.append(y + ' ' + nmlen)
        if re.search(r"[aeiou][^aeiou]{2}[aeiou]\b", x):
            feats.append("VccV")
        if re.search(r"[aeiou]{2}\b", x):
            feats.append("VV")
        if x[-1] in ['a','e','i']:
            feats.append("_fem_")
        if x[-1] == 'o':
            feats.append("_masc_")
        concat_feats = " ".join(i for i in feats)
        return concat_feats

下面的这个函数将用于在名称列表上循环上述函数,但也将接受单个字符串/名称来提取特征,然后通过管道

def text_vec(data):
    ready = []
    if type(data) == str:
        b = letter_extractor(data)
        ready.append(b)
    elif type(data[0]) == str:
        for i in data:
            y = letter_extractor(i)
            ready.append(y)
    return ready

然后,我创建 test_train 拆分

from sklearn.model_selection import train_test_split
from collections import Counter

train, test = train_test_split(labeled_names, test_size = 0.33, random_state=42)

X_train = [i[0] for i in train]
X_test = [i[0] for i in test]
y_train = [i[1] for i in train]
y_test = [i[1] for i in test]

创建特征的 tfidf

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

feat_ext = text_vec(X_train) #applying feature extraction to list of names
tfidf_vec = TfidfVectorizer(min_df = 10, token_pattern = r'[a-zA-Z]+', lowercase = False)
tfidf_feats_train = tfidf_vec.fit_transform(feat_ext) #create tfidf of features

chk_mod = text_vec(X_test)
tfidf_feats_test = tfidf_vec.transform(chk_mod)

准备管道

import lime
from lime.lime_text import LimeTextExplainer
from sklearn.pipeline import Pipeline
from sklearn import svm
from sklearn.preprocessing import FunctionTransformer
text_vec_trfmr = FunctionTransformer(text_vec)
pipeline = Pipeline([
    ("text_vectorizer", text_vec_trfmr),
    ("feature_vectorizer", tfidf_vec),
    ("classifier", svm.SVC(C=4, kernel='linear',probability=True))])
pipeline.fit(X_train,y_train)

查看单个随机示例的统计信息

import textwrap
names_test = X_test
gender_test = y_test

idx = 63
text_sample = X_test[idx]
class_names = ['female', 'male']

print('Review ID-{}:'.format(idx))
print('-'*50)
print('Review Text:\n', textwrap.fill(text_sample,400))
print('-'*50)
print('Probability(male) =', pipeline.predict_proba([text_sample])[0,1])
print('Probability(female) =', pipeline.predict_proba([text_sample])[0,0])
print('Predicted class: %s' % pipeline.predict([text_sample]))
print('True class: %s' % y_test[idx])

最后,在可视化单个示例时出现问题(上面创建的“text_sample”)

import matplotlib
matplotlib.rcParams['figure.dpi']=300
%matplotlib inline

explainer = LimeTextExplainer(class_names=class_names)

# Error is occurring here,but traces all the way back to the letter_extractor function created earlier
explanation = explainer.explain_instance(text_sample,       
                                         pipeline.predict_proba,
                                         num_features=10)
explanation.show_in_notebook(text=True) 

管道效果很好,处理单个字符串和列表没有任何问题

print(pipeline.predict_proba(text_sample))
print(pipeline.predict(text_sample))
print(pipeline.predict_proba(X_train[0:4])
print(pipeline.predict(X_train[0:4])

我只是很困惑为什么管道可以处理一个字符串,但是会explain_instance抛出一个错误。

4

0 回答 0