I'm using transformers TFBertModel to classify a bunch of input strings, however I'd like to access the CLS embedding in order to be able to rebalance my data.
When I pass a single element of my data to the predict method of my simplified bert model (in order to get the CLS data), I take the first array of the last_hidden_state
, and voila. However, when I pass in more than one row of data, the shape of the output changes as expected, but it seems the actual CLS embedding (of the first row that I first passed in) changes too.
My dataset contains the input ids and the masks, and the model:
from transformers import TFBertModel
model = TFBertModel.from_pretrained('bert-base-multilingual-cased', trainable=False, num_labels=len(le.classes_))
input_ids_layer = Input(shape=(256,), dtype=np.int32)
input_mask_layer = Input(shape=(256,), dtype=np.int32)
bert_layer = model([input_ids_layer, input_mask_layer])
model = Model(inputs=[input_ids_layer, input_mask_layer], outputs=bert_layer)
Then, to get the CLS embeddings I just call the predict method and dig into the result. So for the first row of data (data_x[0] being the input ids, and data_x[1] being the masks)
output1 = model.predict([data_x[0][0], data_x[1][0]])
TFBaseModelOutputWithPooling([('last_hidden_state',
array([[[ 0.35013607, -0.5340336 , 0.28577858, ..., -0.03405955,
-0.0165604 , -0.36481357]],
[[ 0.34572566, -0.5361709 , 0.281771 , ..., -0.03687727,
-0.01690093, -0.35451806]],
[[ 0.34878412, -0.5399749 , 0.28948805, ..., -0.03613809,
-0.01503076, -0.35425758]],
...,
My understanding is that the CLS representation of the sentence is the first array of the last_hidden_state i.e:
lhs1 = output1[0]
lhs1.shape
>> (256, 1, 768)
cls1 = lhs1[0][0]
cls1
>>[0.35013607 ... -0.36481357]` (as above)
So far so good. My confusion arises when I now want to obtain the first 2 of the CLS embeddings from my dataset:
output_both = model.predict([data_x[0][:2], data_x[1][:2]])
lhs_both = output_both[0] # last hidden states
lhs_both.shape
>> (2, 256, 768)
cls_both = lhs_both[0][0] # I thought this would give me two CLS arrays including the first one above
Inspecting cls_both
:
array([[[ 0.11075249, -0.02257648, -0.40831113, ..., 0.18384863,
0.17032738, -0.05989586],
[-0.22926208, -0.5627498 , 0.2617012 , ..., 0.20701236,
0.3141808 , -0.8650396 ],
[-0.22352833, -0.49676323, -0.5286081 , ..., 0.23819353,
0.3742358 , -0.69018203],
...,
[ 0.5120927 , -0.09863365, 0.7378716 , ..., -0.19551781,
0.45915398, 0.22804889],
[-0.13397002, 0.1617202 , 0.15663634, ..., -0.511597 ,
0.3959382 , 0.30565232],
[-0.14100523, 0.22792323, -0.15898004, ..., -0.2690729 ,
0.4730471 , 0.18431285]],
[[-0.20033133, -0.08412935, -0.0411438 , ..., 0.34706163,
0.1919156 , -0.08740871],
[-0.12536147, -0.44519228, 1.2984221 , ..., 0.07149828,
0.7915938 , 0.08048639],
[ 0.4596323 , -0.3316555 , 1.2545322 , ..., -0.02128018,
0.5344383 , 0.32054782],
...,
[-0.54777217, 0.23129587, 0.5007771 , ..., 0.70299244,
0.27277255, -0.2848366 ],
[-0.49410668, 0.37352908, 0.8732239 , ..., 0.6065303 ,
0.152081 , -0.9312557 ],
[-0.33172935, -0.35368383, 0.5942321 , ..., 0.7171531 ,
0.24436645, 0.08909844]]], dtype=float32)
I'm not sure how to interpret this - my expectation was to see the first rows CLS cls1
contained within cls_both
, but as you can see, the first row in the first sub array is different. Can anyone explain this?
Furthermore, if I run only the second row through, I get exactly the same CLS token as the first, despite them containing totally different input_ids/masks:
output2 = model.predict([data_x[0][1], data_x[1][1]])
lhs2 = output2[0]
cls2 = lhs2[0][0]
cls2
>>
[ 0.35013607, -0.5340336 , 0.28577858, ..., -0.03405955,
-0.0165604 , -0.36481357]]
cls1 == cl2
>> True
Edit
BERT sentence embeddings: how to obtain sentence embeddings vector
Above post explains that output[0][:,0,:]
is the correct way to obtain exactly the CLS tokens which makes things easiers.
When I run three rows through, I get consistent results, but any time I run a single row through, I get the result shown in cls1
- why does this not differ each time?