Although a typical use case, I can't find one simple and clear guide on what is the canonical way to compute loss on a padded minibatch in pytorch, when sent through an RNN.
I think a canonical pipeline could be:
1) The pytorch RNN expects a padded batch tensor of shape: (max_seq_len, batch_size, emb_size)
2) So we give an Embedding layer for example this tensor:
tensor([[1, 1],
[2, 2],
[3, 9]])
9 is the padding index. Batch size is 2. The Embedding layer will make it to be of shape (max_seq_len, batch_size, emb_size). The sequences in the batch are in descending order, so we can pack it.
3) We apply pack_padded_sequence, we apply the RNN, finally we apply pad_packed_sequence. We have at this point (max_seq_len, batch_size, hidden_size)
4) Now we apply the linear output layer on the result and let's say the log_softmax. So at the end we have a tensor for a batch of scores of shape: (max_seq_len, batch_size, linear_out_size)
How should I compute the loss from here, masking out the padded part (with an arbitrary target)? Thanks!