In the last self-attention layer of transformer, it seems that the larger attention score between two tokens, the more similar they will be after that layer, i.e. they are very close in the vector space. But I don't know the reason. Can someone explain it?