I'm trying to understand how ELMo is designed and how it works, and I have a couple of questions:
- Is the ELMo architecture (visualized in the figure below) used for training the model, or for generating the context-dependent embeddings using the pre-trained model? Or is the same for both?
- Before passing the input to the Bi-LSTM layers, it is passed through a convolutional neural network (CNN) to convert the words into raw word vectors (character-based). How CNN does this? Any helpful references?
Thank you.