I'm building a multiclass text classification model using Keras and Bert (HuggingFace), but I have a very imbalanced dataset. I've used SMOTE from Sklearn in order to generate additional samples for the underbalanced classes (I have 45 in total), which works fine when I use the input ids from the Bert Tokenizer.
However, I would like to be able to also use smote for the input masks ids, in order to allow the model to determine where the padded values are.
My question is how can I use smote for both input ids and mask ids? I've done the following so far, and the model doesn't complain, but I'm not sure if the resampled masks match the resampled input ids row for row. Smote requires two inputs, inputs and labels, so I've duplicated the process with the same random state, and just returned the required elements:
def smote(input_ids, input_masks, labels):
smote = SMOTE("not majority", random_state=27)
input_ids_resampled, labels_resampled = smote.fit_sample(input_ids, labels)
input_masks_resampled, _ = smote.fit_sample(input_masks, labels)
return input_ids_resampled, input_masks_resampled, labels_resampled
Is this acceptable? Is there a better way to do this?