I have an interesting question about BERT.
Can I simplify the architecture of the model by saying that the similarity of two words in different context will depend on the similarity of input embeddings making up different contexts? For example, can I say that the similarity of the embeddings of GLASS in the context DRINK_GLASS and WINE in the context LOVE_WINE will depend on the similarity of the input embeddings GLASS and WINE (last position) and DRINK and LOVE (first position)? Or should I also take into account the similarity between DRINK (first context, first position) and WINE (second context, second position) and LOVE and GLASS (viceversa)?
Thanks for your help, for now it is really difficult for me to understand exactly the architecture of Bert, but I'm trying to make experiments so I need to understand some basics.