1

我一直在尝试使用我在 github 上找到的某些数据集,以了解我对不同数据集进行情感分析的能力以及代码的工作原理。所以我有一个数据集,我想将其合并到代码中,我发现唯一的问题是它是一个高度不平衡的数据集。例如,负面情绪大约有 5000 条推文,而正面情绪大约有 15,000 条推文。所以我找到了不同的方法来处理这种情况。第一个是使用 sklearn resample 使用以下代码:

from sklearn.utils import resample   
df_majority = my_df[my_df.target==1]
df_minority = my_df[my_df.target==0]
 
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     
                                 n_samples=15025,   
                                 random_state=123) 
 
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
x = df_upsampled.Tweet
y = df_upsampled.target

from sklearn.model_selection import train_test_split
SEED = 2000
x_train, x_validation_and_test, y_train, y_validation_and_test = train_test_split(x, y, test_size=.02, random_state=SEED)
x_validation, x_test, y_validation, y_test = train_test_split(x_validation_and_test, y_validation_and_test, test_size=.5, random_state=SEED) 

但是使用以下代码我觉得结果不太正确。然后我继续阅读大量关于 SMOTE 的文章,它在不平衡的数据集上运行得非常好。唯一的问题是我不知道如何将其合并到我在网上找到的代码中。老实说,我在编码方面非常业余,因此将不胜感激。这是我使用的以下代码:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from time import time

cvec = CountVectorizer()
lr = LogisticRegression()
n_features = np.arange(1000,20000,1000)

def nfeature_accuracy_checker(vectorizer=cvec, n_features=n_features, stop_words=None, ngram_range=(1, 1), classifier=lr):
    result = []
    print (classifier)
    print ("\n")
    for n in n_features:
        vectorizer.set_params(stop_words=stop_words, max_features=n, ngram_range=ngram_range)
        checker_pipeline = Pipeline([
            ('vectorizer', vectorizer),
            ('classifier', classifier)
        ])
        print ("Validation result for {} features".format(n))
        nfeature_accuracy,tt_time = accuracy_summary(checker_pipeline, x_train, y_train, x_validation, y_validation)
        result.append((n,nfeature_accuracy,tt_time))
    return result

我的想法是合并:

SMOTE_pipeline = make_pipeline(tvec, SMOTE(random_state=777),lr)

将上面的代码更改为:

def nfeature_accuracy_checker(pipeline, vectorizer=cvec, n_features=n_features, stop_words=None, ngram_range=(1, 1), classifier=lr):

然后使用以下命令调用结果:

print ("RESULT FOR UNIGRAM WITH STOP WORDS (Tfidf)\n")
feature_result_ugt = nfeature_accuracy_checker(SMOTE_pipeline, vectorizer=tvec)

我的想法是对的还是我完全屠杀了整个事情?如果有人不完全理解我想要做什么,我也很乐意进一步解释。谢谢你

4

0 回答 0