machine-learning - BERT 多类情感分析准确率低？

Question

我正在研究一个小型数据集：

包含1500篇新闻文章。
所有这些文章都根据人类的情绪/积极程度按 5 分制进行排名。
在拼写错误方面干净。在导入分析之前，我使用谷歌表检查拼写。还有一些字符没有正确编码，但不多。
平均长度大于 512 字。
稍微不平衡的数据集。

我认为这是一个多类分类问题，我想用这个数据集微调 BERT。为了做到这一点，我使用Ktrain了包并基本上遵循教程。下面是我的代码：

(x_train, y_train), (x_test, y_test), preproc = text.texts_from_array(
                                                                    x_train=x_train, 
                                                                    y_train=y_train,
                                                                    x_test=x_test, 
                                                                    y_test=y_test,
                                                                    class_names=categories,
                                                                    preprocess_mode='bert',
                                                                    maxlen= 510,
                                                                    max_features=35000)

model = text.text_classifier('bert', train_data=(x_train, y_train), preproc=preproc)
learner = ktrain.get_learner(model, train_data=(x_train, y_train), batch_size=6)
learner.fit_onecycle(2e-5, 4)

但是，我的验证准确率只有 25% 左右，太低了。

          precision-recall f1-score support

   1       0.33      0.40      0.36        75
   2       0.27      0.36      0.31        84
   3       0.23      0.24      0.23        58
   4       0.18      0.09      0.12        54
   5       0.33      0.04      0.07        24

accuracy                               0.27       295
macro avg          0.27      0.23      0.22       295
weighted avg       0.26      0.27      0.25       295

我也尝试了头+尾截断策略，因为有些文章很长，但是性能保持不变。

谁能给我一些建议？

非常感谢！

最好的

许

================== 更新 7.21=================

按照 Kartikey 的建议，我尝试了 find_lr。下面是结果。看来 2e^-5 是一个合理的学习率。

simulating training for different learning rates... this may take a few 
moments...
Train on 1182 samples
Epoch 1/2
1182/1182 [==============================] - 223s 188ms/sample - loss: 1.6878 
- accuracy: 0.2487
Epoch 2/2
432/1182 [=========>....................] - ETA: 2:12 - loss: 3.4780 - 
accuracy: 0.2639
done.
Visually inspect loss plot and select learning rate associated with falling 
loss

学习率.jpg

我只是试着用一些权重来运行它：

{0: 0,
 1: 0.8294736842105264,
 2: 0.6715909090909091,
 3: 1.0844036697247708,
 4: 1.1311004784688996,
 5: 2.0033898305084747}

这是结果。变化不大。

          precision    recall  f1-score   support

       1       0.43      0.27      0.33        88
       2       0.22      0.46      0.30        69
       3       0.19      0.09      0.13        64
       4       0.13      0.13      0.13        47
       5       0.16      0.11      0.13        28

accuracy                            0.24       296
macro avg       0.23      0.21      0.20       296
weighted avg    0.26      0.24      0.23       296

array([[24, 41,  9,  8,  6],
       [13, 32,  6, 12,  6],
       [ 9, 33,  6, 14,  2],
       [ 4, 25, 10,  6,  2],
       [ 6, 14,  0,  5,  3]])

============== 更新 7.22 =============

为了获得一些基线结果，我将 5 分制的分类问题折叠成一个二元分类问题，这只是为了预测正面或负面。这次准确率提高到 55% 左右。以下是我的策略的详细说明：

training data: 956 samples (excluding those classified as neutural)
truncation strategy: use the first 128 and last 128 tokens
(x_train,  y_train), (x_test, y_test), preproc_l1 = 
                     text.texts_from_array(x_train=x_train, y_train=y_train,    
                     x_test=x_test, y_test=y_test                      
                     class_names=categories_1,                      
                     preprocess_mode='bert',                                                          
                     maxlen=  256,                                                                  
                     max_features=35000)
Results:
              precision    recall  f1-score   support

       1       0.65      0.80      0.72       151
       2       0.45      0.28      0.35        89

accuracy                               0.61       240
macro avg          0.55      0.54      0.53       240
weighted avg       0.58      0.61      0.58       240

array([[121,  30],
       [ 64,  25]])

但是，我认为 55% 仍然不是一个令人满意的准确率，比随机猜测略好。

============ 更新 7.26 =============

按照 Marcos Lima 的建议，我在我的程序中做了几个额外的步骤：

在 Ktrain pkg 预处理之前删除所有数字、标点符号和多余的空格。（我认为 Ktrain pkg 会为我做这个，但不确定）
我使用示例中任何文本的前 384 个和后 128 个标记。这就是我所说的“头+尾”策略。
任务仍然是二分类（正vs负）

这是学习曲线的图。它和我之前发布的一样。它看起来仍然与 Marcos Lima 发布的非常不同：

更新的学习曲线

以下是我的结果，这可能是我得到的最好的一组结果。

begin training using onecycle policy with max lr of 1e-05...
Train on 1405 samples
Epoch 1/4
1405/1405 [==============================] - 186s 133ms/sample - loss: 0.7220 
- accuracy: 0.5431
Epoch 2/4
1405/1405 [==============================] - 167s 119ms/sample - loss: 0.6866 
- accuracy: 0.5843
Epoch 3/4
1405/1405 [==============================] - 166s 118ms/sample - loss: 0.6565 
- accuracy: 0.6335
Epoch 4/4
1405/1405 [==============================] - 166s 118ms/sample - loss: 0.5321 
- accuracy: 0.7587

             precision    recall  f1-score   support

       1       0.77      0.69      0.73       241
       2       0.46      0.56      0.50       111

accuracy                           0.65       352
macro avg       0.61      0.63      0.62       352
weighted avg       0.67      0.65      0.66       352

array([[167,  74],
       [ 49,  62]])

注意：我认为 pkg 很难在我的任务上很好地工作的原因可能是这个任务就像是分类和情感分析的结合。新闻文章的经典分类任务是分类新闻属于哪个类别，例如，生物学、经济学、体育。不同类别中使用的词非常不同。另一方面，情感分类的经典例子是分析 Yelp 或 IMDB 评论。我的猜测是，这些文本在表达情绪方面非常简单，而我的样本中的文本（经济新闻）在发布前经过精心整理和组织，因此情绪可能总是以某种隐含的方式出现，而 BERT 可能无法检测到。

score 2 · Accepted Answer

尝试超参数优化。

做之前learner.fit_onecycle(2e-5, 4)。尝试：learner.lr_find(show_plot=True, max_epochs=2)

所有课程的权重都在 20% 左右吗？也许尝试这种方式：

MODEL_NAME = 'bert'
t = text.Transformer(MODEL_NAME, maxlen=500, class_names=train_b.target_names)

.....
.....

# the one we got most wrong
learner.view_top_losses(n=1, preproc=t)

为上述类增加权重。

验证集是分层抽样还是随机抽样？

score 1 · Accepted Answer

The form of your learning curve is not expected.

My curve (above) shows that the TR should be around 1e-5, but yours is flat.

Try to pre-process your data:

Remove numbers and emojis.
Recheck your data for errors (usually in y_train).
Use your language model or multilanguage if it's not english.

You said that:

The average length is greater than 512 words.

Try to break each text in 512 tokens-long because you can lose a lot o information for classification when BERT model truncates it.

score 0 · Accepted Answer

0

尝试将问题视为文本回归任务，例如使用ktrain训练的Yelp 情感模型。

于 2020-07-21T00:45:18.157 回答

machine-learning - BERT 多类情感分析准确率低？

3 回答 3

Related

Reference