transformer - Why in transformer the larger attention score between two tokens, the more similar they are after final layer？

问问题 2022-02-23T15:25:57.463

3 次

In the last self-attention layer of transformer, it seems that the larger attention score between two tokens, the more similar they will be after that layer, i.e. they are very close in the vector space. But I don't know the reason. Can someone explain it?

transformer - Why in transformer the larger attention score between two tokens, the more similar they are after final layer？

0 回答 0

Related

Reference