在 GPT2 的预处理步骤中,我们究竟应该做什么?有什么指导方针吗?
这对于预处理步骤是否合适?
1. Remove any \n from sentence
2. Remove extra spaces from sentence
3. Leave everything else that is part of the sentence but not exactly words (e.g. urls, non-english words that may be added in an english sentence, emojis, etc...)
去掉多余的标点符号或任何非英文字符不是更好吗?