正如所评论的,您的问题在于句子分割,因为您的数据允许任何输入(有/没有正确的标点符号)。但不知何故,你有大写字母很好。因此,您可以尝试以下方法通过大写来分割句子。
免责声明:如果您的句子以 开头I
,那么下面的食谱不会有太大帮助 =)
“有些事情必须改变必须重新安排对不起,我不是故意伤害我的小女孩这超出了我的能力我无法承受沉重的世界的重量所以晚安,晚安,晚安,晚安,晚安,好晚安,晚安,晚安 希望一切顺利到达你我很抱歉我不是故意伤害我的小女孩这超出了我的承受能力我无法承受沉重的世界所以晚安晚安晚安晚安晚安晚安晚安,晚安,晚安,晚安,晚安,晚安,晚安希望一切顺利,是的,谢谢。”
在 Python 中,您可以尝试这样来分割句子:
sentence = "Something gotta change It must be rearranged I'm sorry, I did not mean to hurt my little girl It's beyond me I cannot carry the weight of the heavy world So good night, good night, good night, good night Good night, good night, good night, good night, good night Hope that things work out all right So much to love, so much to learn But I won't be there to teach you Oh, I know I can be close But I try my best to reach you I'm so sorry I didn't not mean to hurt my little girl It's beyond me I cannot carry the weight of the heavy world So good night, good night, good night, good night Good night, good night, good night, good night Good night, good night, good night good night, good night Hope that things work out all right, yeah Thank you."
temp = []; sentences = []
for i in sentence.split():
if i[0].isupper() and i != "I":
sentences.append(" ".join(temp))
temp = [i]
else:
temp.append(i)
sentences.append(" ".join(temp))
sentences.pop(0)
print sentences
然后再按照这个Stanford Parser 和 NLTK来解析句子。