实际上SMOTE
预计X
只是数字数据。这不是标签的问题,标签可以是字符串。
阅读此处了解 SMOTE 如何在内部工作。基本上,它使用所选邻居的凸组合为少数类创建合成数据点。
TfidfVectorizer
因此,使用或将您的文本数据(成绩单)转换为数字CountVectorizer
。您可以使用inverse_transform
这些矢量化器的方法来取回文本,但问题是您会丢失单词的顺序。
import pandas as pd
df = pd.DataFrame({'transcripts': ['I want to check this',
'how about one more sentence',
'hopefully this works well fr you',
'I want to check this',
'This is the last sentence or transcript'],
'labels': ['good','bad', 'bad', 'good','bad']})
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer()
X = vec.fit_transform(df['transcripts'])
from imblearn.over_sampling import SMOTE
sm = SMOTE(k_neighbors=1, random_state = 2)
X_train_res, y_train_res = sm.fit_sample(X, df.labels)
vec.inverse_transform(X_train_res)
# [array(['this', 'check', 'to', 'want'], dtype='<U10'),
# array(['sentence', 'more', 'one', 'about', 'how'], dtype='<U10'),
# array(['you', 'fr', 'well', 'works', 'hopefully', 'this'], dtype='<U10'),
# array(['this', 'check', 'to', 'want'], dtype='<U10'),
# array(['transcript', 'or', 'last', 'the', 'is', 'sentence', 'this'],
# dtype='<U10'),
# array(['want', 'to', 'check', 'this'], dtype='<U10')]