1

我从命令行使用斯坦福解析器:

java -mx1500m -cp stanford-parser.jar;stanford-parser-models.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat "penn"  edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz {file}

当我在一个 27 个单词的句子上运行命令时,Java 进程正在消耗 100MB 的内存,解析需要 1.5 秒。当我对一个 148 个单词的句子运行命令时,Java 进程正在消耗 1.5GB 的内存,并且解析需要 1.5 分钟。

我使用的机器是带有 intel i5 2.53GH 的 windows 7。

这些处理时间是否合理?解析器是否有任何官方的性能基准?

4

1 回答 1

2

正如所评论的,您的问题在于句子分割,因为您的数据允许任何输入(有/没有正确的标点符号)。但不知何故,你有大写字母很好。因此,您可以尝试以下方法通过大写来分割句子。

免责声明:如果您的句子以 开头I,那么下面的食谱不会有太大帮助 =)

“有些事情必须改变必须重新安排对不起,我不是故意伤害我的小女孩这超出了我的能力我无法承受沉重的世界的重量所以晚安,晚安,晚安,晚安,晚安,好晚安,晚安,晚安 希望一切顺利到达你我很抱歉我不是故意伤害我的小女孩这超出了我的承受能力我无法承受沉重的世界所以晚安晚安晚安晚安晚安晚安晚安,晚安,晚安,晚安,晚安,晚安,晚安希望一切顺利,是的,谢谢。”

在 Python 中,您可以尝试这样来分割句子:

sentence = "Something gotta change It must be rearranged I'm sorry, I did not mean to hurt my little girl It's beyond me I cannot carry the weight of the heavy world So good night, good night, good night, good night Good night, good night, good night, good night, good night Hope that things work out all right So much to love, so much to learn But I won't be there to teach you Oh, I know I can be close But I try my best to reach you I'm so sorry I didn't not mean to hurt my little girl It's beyond me I cannot carry the weight of the heavy world So good night, good night, good night, good night Good night, good night, good night, good night Good night, good night, good night good night, good night Hope that things work out all right, yeah Thank you."

temp = []; sentences = []
for i in sentence.split():
  if i[0].isupper() and i != "I":
      sentences.append(" ".join(temp))
      temp = [i]
  else:
    temp.append(i)
sentences.append(" ".join(temp))
sentences.pop(0)
print sentences

然后再按照这个Stanford Parser 和 NLTK来解析句子。

于 2013-06-14T10:50:20.833 回答