In many seq2seq implementations, I saw that they use accuracy metric in compiling the model and Bleu score only in predictions.
Why they don't use Bleu score in training to be more efficient? if I understand correctly!
Bilingual Evaluation Understudy Score was meant to replace humans, hence the word understudy comes in it's name.
Now, When you are training your data, you already have the targeted value and you can directly compare your generated output with it, but when you predict on a dataset, you don't have a way to measure if the sentence you translated into is correct. That is why you use Bleu, because no human can check after each machine translation if what you predicted is correct or not, and Bleu provides a sanity check.
P.S. Understudy means someone learning from a mentor to replace him if need be, Bleu "learns" from humans and then is able to score the translation.
For further reference check out https://www.youtube.com/watch?v=9ZvTxChwg9A&list=PL1w8k37X_6L_s4ncq-swTBvKDWnRSrinI&index=28
If any queries, comment below.