我正在使用 scikit-learn 构建一个分类器来预测两个句子是否是释义(例如释义:爱因斯坦有多高与阿尔伯特爱因斯坦的长度是多少)。
我的数据由 2 个带有字符串(短语对)的列和 1 个带有 0 和 1 的目标列组成(= 没有释义,释义)。我想尝试不同的算法。
我希望下面的最后一行代码适合模型。相反,预处理管道不断产生我无法解决的错误:“AttributeError:'numpy.ndarray'对象没有属性'lower'。”
代码如下,我已经隔离了显示的最后一行中发生的错误(为简洁起见,我排除了其余部分)。我怀疑这是因为目标列包含 0 和 1,不能转为小写。
我已经在stackoverflow上尝试过类似问题的答案,但到目前为止还没有运气。
你怎么能解决这个问题?
question1 question2 is_paraphrase
How long was Einstein? How tall was Albert Einstein? 1
Does society place too How do sports contribute to the 0
much importance on society?
sports?
What is a narcissistic What is narcissistic personality 1
personality disorder? disorder?
======
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
para = "paraphrases.tsv"
df = pd.read_csv(para, usecols = [3, 5], nrows = 100, header=0, sep="\t")
y = df["is_paraphrase"].values
X = df.drop("is_paraphrase", axis=1).values
X = X.astype(str) # I have tried this
X = np.char.lower(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3,
random_state = 21, stratify = y)
text_clf = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
text_clf.fit(X_train, y_train)