我一直在尝试使用我在 github 上找到的某些数据集,以了解我对不同数据集进行情感分析的能力以及代码的工作原理。所以我有一个数据集,我想将其合并到代码中,我发现唯一的问题是它是一个高度不平衡的数据集。例如,负面情绪大约有 5000 条推文,而正面情绪大约有 15,000 条推文。所以我找到了不同的方法来处理这种情况。第一个是使用 sklearn resample 使用以下代码:
from sklearn.utils import resample
df_majority = my_df[my_df.target==1]
df_minority = my_df[my_df.target==0]
df_minority_upsampled = resample(df_minority,
replace=True,
n_samples=15025,
random_state=123)
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
x = df_upsampled.Tweet
y = df_upsampled.target
from sklearn.model_selection import train_test_split
SEED = 2000
x_train, x_validation_and_test, y_train, y_validation_and_test = train_test_split(x, y, test_size=.02, random_state=SEED)
x_validation, x_test, y_validation, y_test = train_test_split(x_validation_and_test, y_validation_and_test, test_size=.5, random_state=SEED)
但是使用以下代码我觉得结果不太正确。然后我继续阅读大量关于 SMOTE 的文章,它在不平衡的数据集上运行得非常好。唯一的问题是我不知道如何将其合并到我在网上找到的代码中。老实说,我在编码方面非常业余,因此将不胜感激。这是我使用的以下代码:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from time import time
cvec = CountVectorizer()
lr = LogisticRegression()
n_features = np.arange(1000,20000,1000)
def nfeature_accuracy_checker(vectorizer=cvec, n_features=n_features, stop_words=None, ngram_range=(1, 1), classifier=lr):
result = []
print (classifier)
print ("\n")
for n in n_features:
vectorizer.set_params(stop_words=stop_words, max_features=n, ngram_range=ngram_range)
checker_pipeline = Pipeline([
('vectorizer', vectorizer),
('classifier', classifier)
])
print ("Validation result for {} features".format(n))
nfeature_accuracy,tt_time = accuracy_summary(checker_pipeline, x_train, y_train, x_validation, y_validation)
result.append((n,nfeature_accuracy,tt_time))
return result
我的想法是合并:
SMOTE_pipeline = make_pipeline(tvec, SMOTE(random_state=777),lr)
将上面的代码更改为:
def nfeature_accuracy_checker(pipeline, vectorizer=cvec, n_features=n_features, stop_words=None, ngram_range=(1, 1), classifier=lr):
然后使用以下命令调用结果:
print ("RESULT FOR UNIGRAM WITH STOP WORDS (Tfidf)\n")
feature_result_ugt = nfeature_accuracy_checker(SMOTE_pipeline, vectorizer=tvec)
我的想法是对的还是我完全屠杀了整个事情?如果有人不完全理解我想要做什么,我也很乐意进一步解释。谢谢你