1

嗨,我正在为 KNN 建模准备一些短信数据。

我正在尝试训练和测试数据。但是我的数据列表的长度似乎只有 1。这很奇怪。当我调用 PRINT 函数时,我可以看到那里的所有文本。

这会导致值错误:“ValueError:找到样本数量不一致的输入变量:[3, 2]”

有没有办法让我确定正确的列表长度?

或者任何人都可以帮我解决我哪里出错了?

提前致谢。

ps 我是学生,刚开始学 Python

代码在这里:

archive = zipfile.ZipFile('SMS_Spam01.csv.zip',  'r')
names=archive.namelist()
files_names = archive.namelist()

sms_SPAM01df01 = [str(archive.read(item), encoding='utf8', errors='ignore') for item in files_names]

archive = zipfile.ZipFile('SMS_Ham01.csv.zip',  'r')
names=archive.namelist()
files_names = archive.namelist()

sms_HAM01df01 = [str(archive.read(item), encoding='utf8', errors='ignore') for item in files_names]


sms_SPAM_prep01 = [sms.lower() for sms in sms_SPAM01df01] 
sms_HAM_prep01 = [sms.lower() for sms in sms_HAM01df01] 

sms_SPAM_prep01 =[word_tokenize(sms) for sms in sms_SPAM_prep01]
sms_HAM_prep01 =[word_tokenize(sms) for sms in sms_HAM_prep01] 

list_stopwords=stopwords.words('english') 

sms_SPAM_prep01= [[word for word in sms if not word in list_stopwords] for sms in sms_SPAM_prep01] 
sms_HAM_prep01= [[word for word in sms if not word in list_stopwords] for sms in sms_HAM_prep01] 

CountVec = CountVectorizer(lowercase=True,analyzer='word',stop_words='english') 

feature_vectors = CountVec.fit_transform(sms_HAM01df01 + sms_SPAM01df01) 

CountVec.get_feature_names_out() 

CountVec.vocabulary_

X_train, X_test, y_train, y_test = train_test_split(feature_vectors, [0] * len(sms_HAM01df01) + [1] * len(sms_SPAM01df01), random_state = 0, test_size=0.2)

4

0 回答 0