4

I'm building a multiclass text classification model using Keras and Bert (HuggingFace), but I have a very imbalanced dataset. I've used SMOTE from Sklearn in order to generate additional samples for the underbalanced classes (I have 45 in total), which works fine when I use the input ids from the Bert Tokenizer.

However, I would like to be able to also use smote for the input masks ids, in order to allow the model to determine where the padded values are.

My question is how can I use smote for both input ids and mask ids? I've done the following so far, and the model doesn't complain, but I'm not sure if the resampled masks match the resampled input ids row for row. Smote requires two inputs, inputs and labels, so I've duplicated the process with the same random state, and just returned the required elements:

def smote(input_ids, input_masks, labels):

    smote = SMOTE("not majority", random_state=27)

    input_ids_resampled, labels_resampled = smote.fit_sample(input_ids, labels)
    input_masks_resampled, _ = smote.fit_sample(input_masks, labels)

    return input_ids_resampled, input_masks_resampled, labels_resampled

Is this acceptable? Is there a better way to do this?

4

2 回答 2

2

I do not think the given code is a good idea.

Since the mask ids tell you which tokens are real and which ones come from padding, if you smote-sample them independently from the input ids, you're gonna end up with synthetic input ids coming from real input ids who are disregarded by the model because the corresponding synthetic mask ids (generated from completely independent tokens) indicates your synthetic input ids are padding.

Silly example:

  • t_1: input ids = [1209, 80183, 290], mask ids = [1,1,0,]
  • t_2: input ids = [39103, 38109, 2931], mask ids = [1,1,1]
  • t_3: input ids = [1242, 1294, 3233], mask ids = [1,0, 0]

Suppose for simplicity that the synthetic creation is done by averaging two tensors. If your random smote-sampling averages the input ids of t_1 and t_2, but the mask ids of t_2 and t_3, the resulting synthetic t_4 has no meaning whatsoever: it is not the average of any real observation.

A reasonable fix for the problem above: you smote-sample the input ids only, and as mask ids of your synthetic token you take the average of the same mask ids. I say average, but I think the median value of each entry of the mask id vectors is probably more appropriate. One could arrange that by just flattening input ids and mask ids into a 1d tensor and applying smote to those (I think, assuming smote is working component-wise).

However, I think the above fix still does not make too much sense. I am far from a BERT expert, but my understanding is that every token correspond to a precise integer (up to perhaps hash-collisions). If that is the case, by simply averaging tokens you will end up with total gibberish. Even picking the median for every token (of, say, 5 tensors in the same class) would result in a complete gibberish sentence.

So the conclusion is, I do not know how to fix this problem. Maybe one can smote at some point halfway through the BERT model, when tokens have already been partially processed to be floats. Or perhaps even at the exit of the standard Bert model and before fine-tuning to your specific task.

Finally, I want to leave here for the next person who comes across this: apparently there's a modification of SMOTE, SMOTENC, that is suitable (among other tasks) for integer-valued vectors. For the reasons explained above, I do not think it is suited for this purpose, but it's nice to know.

于 2020-07-13T00:55:19.257 回答
1

I just want to clarify this that this is a wrong way to apply SMOTE to input_ ids. You need to take the corresponding embedding to the CLS. Use BERT to get CLS token for each tweet then applies SMOTE to it. Then pass it from Classifier (any classifier). This should be done without fine-tuning.

于 2020-08-12T14:47:22.320 回答