嗨,我正在为 KNN 建模准备一些短信数据。
我正在尝试训练和测试数据。但是我的数据列表的长度似乎只有 1。这很奇怪。当我调用 PRINT 函数时,我可以看到那里的所有文本。
这会导致值错误:“ValueError:找到样本数量不一致的输入变量:[3, 2]”
有没有办法让我确定正确的列表长度?
或者任何人都可以帮我解决我哪里出错了?
提前致谢。
ps 我是学生,刚开始学 Python
代码在这里:
archive = zipfile.ZipFile('SMS_Spam01.csv.zip', 'r')
names=archive.namelist()
files_names = archive.namelist()
sms_SPAM01df01 = [str(archive.read(item), encoding='utf8', errors='ignore') for item in files_names]
archive = zipfile.ZipFile('SMS_Ham01.csv.zip', 'r')
names=archive.namelist()
files_names = archive.namelist()
sms_HAM01df01 = [str(archive.read(item), encoding='utf8', errors='ignore') for item in files_names]
sms_SPAM_prep01 = [sms.lower() for sms in sms_SPAM01df01]
sms_HAM_prep01 = [sms.lower() for sms in sms_HAM01df01]
sms_SPAM_prep01 =[word_tokenize(sms) for sms in sms_SPAM_prep01]
sms_HAM_prep01 =[word_tokenize(sms) for sms in sms_HAM_prep01]
list_stopwords=stopwords.words('english')
sms_SPAM_prep01= [[word for word in sms if not word in list_stopwords] for sms in sms_SPAM_prep01]
sms_HAM_prep01= [[word for word in sms if not word in list_stopwords] for sms in sms_HAM_prep01]
CountVec = CountVectorizer(lowercase=True,analyzer='word',stop_words='english')
feature_vectors = CountVec.fit_transform(sms_HAM01df01 + sms_SPAM01df01)
CountVec.get_feature_names_out()
CountVec.vocabulary_
X_train, X_test, y_train, y_test = train_test_split(feature_vectors, [0] * len(sms_HAM01df01) + [1] * len(sms_SPAM01df01), random_state = 0, test_size=0.2)