python - BERT - Extracting CLS embedding from multiple outputs vs single

Question

I'm using transformers TFBertModel to classify a bunch of input strings, however I'd like to access the CLS embedding in order to be able to rebalance my data.

When I pass a single element of my data to the predict method of my simplified bert model (in order to get the CLS data), I take the first array of the last_hidden_state, and voila. However, when I pass in more than one row of data, the shape of the output changes as expected, but it seems the actual CLS embedding (of the first row that I first passed in) changes too.

My dataset contains the input ids and the masks, and the model:

from transformers import TFBertModel

model = TFBertModel.from_pretrained('bert-base-multilingual-cased', trainable=False, num_labels=len(le.classes_))

input_ids_layer = Input(shape=(256,), dtype=np.int32)
input_mask_layer = Input(shape=(256,), dtype=np.int32)

bert_layer = model([input_ids_layer, input_mask_layer])

model = Model(inputs=[input_ids_layer, input_mask_layer], outputs=bert_layer)

Then, to get the CLS embeddings I just call the predict method and dig into the result. So for the first row of data (data_x[0] being the input ids, and data_x[1] being the masks)

output1 = model.predict([data_x[0][0], data_x[1][0]])

TFBaseModelOutputWithPooling([('last_hidden_state',
                               array([[[ 0.35013607, -0.5340336 ,  0.28577858, ..., -0.03405955,
                                        -0.0165604 , -0.36481357]],
                               
                                      [[ 0.34572566, -0.5361709 ,  0.281771  , ..., -0.03687727,
                                        -0.01690093, -0.35451806]],
                               
                                      [[ 0.34878412, -0.5399749 ,  0.28948805, ..., -0.03613809,
                                        -0.01503076, -0.35425758]],
                               
                                      ...,

My understanding is that the CLS representation of the sentence is the first array of the last_hidden_state i.e:

lhs1 = output1[0]

lhs1.shape
>> (256, 1, 768)

cls1 = lhs1[0][0]

cls1
>>[0.35013607 ... -0.36481357]` (as above)

So far so good. My confusion arises when I now want to obtain the first 2 of the CLS embeddings from my dataset:

output_both = model.predict([data_x[0][:2], data_x[1][:2]])
lhs_both = output_both[0] # last hidden states

lhs_both.shape
>> (2, 256, 768)

cls_both = lhs_both[0][0] # I thought this would give me two CLS arrays including the first one above

Inspecting cls_both:

array([[[ 0.11075249, -0.02257648, -0.40831113, ...,  0.18384863,
          0.17032738, -0.05989586],
        [-0.22926208, -0.5627498 ,  0.2617012 , ...,  0.20701236,
          0.3141808 , -0.8650396 ],
        [-0.22352833, -0.49676323, -0.5286081 , ...,  0.23819353,
          0.3742358 , -0.69018203],
        ...,
        [ 0.5120927 , -0.09863365,  0.7378716 , ..., -0.19551781,
          0.45915398,  0.22804889],
        [-0.13397002,  0.1617202 ,  0.15663634, ..., -0.511597  ,
          0.3959382 ,  0.30565232],
        [-0.14100523,  0.22792323, -0.15898004, ..., -0.2690729 ,
          0.4730471 ,  0.18431285]],

       [[-0.20033133, -0.08412935, -0.0411438 , ...,  0.34706163,
          0.1919156 , -0.08740871],
        [-0.12536147, -0.44519228,  1.2984221 , ...,  0.07149828,
          0.7915938 ,  0.08048639],
        [ 0.4596323 , -0.3316555 ,  1.2545322 , ..., -0.02128018,
          0.5344383 ,  0.32054782],
        ...,
        [-0.54777217,  0.23129587,  0.5007771 , ...,  0.70299244,
          0.27277255, -0.2848366 ],
        [-0.49410668,  0.37352908,  0.8732239 , ...,  0.6065303 ,
          0.152081  , -0.9312557 ],
        [-0.33172935, -0.35368383,  0.5942321 , ...,  0.7171531 ,
          0.24436645,  0.08909844]]], dtype=float32)

I'm not sure how to interpret this - my expectation was to see the first rows CLS cls1 contained within cls_both, but as you can see, the first row in the first sub array is different. Can anyone explain this?

Furthermore, if I run only the second row through, I get exactly the same CLS token as the first, despite them containing totally different input_ids/masks:

output2 = model.predict([data_x[0][1], data_x[1][1]])
lhs2 = output2[0]
cls2 = lhs2[0][0]


cls2
>>
[ 0.35013607, -0.5340336 ,  0.28577858, ..., -0.03405955,
         -0.0165604 , -0.36481357]]

cls1 == cl2 
>> True

Edit

BERT sentence embeddings: how to obtain sentence embeddings vector

Above post explains that output[0][:,0,:] is the correct way to obtain exactly the CLS tokens which makes things easiers.

When I run three rows through, I get consistent results, but any time I run a single row through, I get the result shown in cls1 - why does this not differ each time?

score 1 · Accepted Answer

I think there was an issue with the shape of the sliced data_x that you passed in.

Since you did not specify the shape of data_x, I first attempted to replicate it below:

text = ['a sample text','another text', 'the third text']

bert_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer_output = bert_tokenizer(text, return_tensors='np', max_length=256, padding='max_length')

data_x = np.array([
    tokenizer_output['input_ids'], 
    tokenizer_output['attention_mask']
])

print(data_x.shape)

where the shape of data_x is (2, 3, 256).

data_x[0][0] is not the correct way to slice

For your first row of data, you prepared the input_ids and attention_mask by slicing it using data_x[0][0] and data_x[1][0], the shape of your input_ids and attention_mask becomes (256,)

print(data_x[0][0].shape) # (256,)
print(data_x[1][0].shape) # (256,)

while the TF model expects the input shape of (batch_size, 256) for both input_ids_layer and input_mask_layer. Note that the shape argument supplied to Input does not include the batch_size, as quoted from its documentation here : shape: A shape tuple (integers), not including the batch size..

In fact, when I attempt to pass the input [data_x[0][0], data_x[1][0]] # Both with shape (256,), I received the following warning from Tensorflow:

WARNING:tensorflow:Model was constructed with shape (None, 256) for input KerasTensor(type_spec=TensorSpec(shape=(None, 256), dtype=tf.int32, name='input_3'), name='input_3', description="created by layer 'input_3'"), but it was called on an input with incompatible shape (32, 1).
...

The correct way to slice the input data

You should be slicing them without changing the tensor dimension, so that your input_ids and attention_mask remains in the form of (batch_size, 256)

# For first sentence only
input_1 = [data[0][0:1], data[1][0:1]] # Shape : (1,256) (1,256)

# For second sentence only
input_2 = [data[0][1:2], data[1][1:2]] # Shape : (1,256) (1,256)

# For first and second sentence
input_12 = [data[0][:2], data[1][:2]]  # Shape : (2,256) (2,256)

And after passing the above data to your model, you will get the CLF embeddings using output[0][:,0,:] as described in the link you shared.

You can confirm that the embeddings for the 1st and 2nd sentences are the same, whether you pass in input_1, input_2 or input_12:

output1 = model.predict(input_1)
output2 = model.predict(input_2)
output12 = model.predict(input_12)

ihs1 = output1[0] # Shape : (1, 256, 768)
ihs2 = output2[0] # Shape : (1, 256, 768)
ihs12 = output12[0] # Shape : (2, 256, 768)

cls1 = ihs1[:,0,:] # Shape : (1, 768)
cls2 = ihs2[:,0,:] # Shape : (1, 768)
cls12 = ihs12[:,0,:] # Shape : (2, 768)

# Check that cls1 is exactly the same as cls12[0]
print((cls1 == cls12[0]).all()) # True

# Likewise, cls2 is exactly the same as cls12[1]
print((cls2 == cls12[1]).all()) # True

Hope this clears thing up for you. Always remember to check the input and output shape of the model when you are in doubt.

python - BERT - Extracting CLS embedding from multiple outputs vs single

Edit

1 回答 1

Related

Reference