你真的应该使用像BeautifulSoup这样的解析器来完成这项工作。BeautifulSoup 可以解析非常不正确的 HTML/XML 并尝试使它们看起来正确。您的代码可能如下所示(我假设您在错误标签之前和之后有一些标签Story
,否则您将遵循 David 评论中的建议):
from BeautifulSoup import BeautifulStoneSoup
html = '''
<Document>
<PrevTag></PrevTag>
<Story>
<Sentence id="1"> some text </Sentence>
<Sentence id="2"> some text </Sentence>
<Sentence id="3"> some text </Sentence>
<EndTag></EndTag>
</Document>
'''
# Parse the document:
soup = BeautifulStoneSoup(html)
看看 BeautifulSoup 是如何解析它的:
print soup.prettify()
#<document>
# <prevtag>
# </prevtag>
# <story>
# <sentence id="1">
# some text
# </sentence>
# <sentence id="2">
# some text
# </sentence>
# <sentence id="3">
# some text
# </sentence>
# <endtag>
# </endtag>
# </story>
#</document>
请注意,BeautifulSoup 在关闭它的标签(文档)之前关闭了故事,因此您必须将结束标签移动到最后一句话旁边。
# Find the last sentence:
last_sentence = soup.findAll('sentence')[-1]
# Find the Story tag:
story = soup.find('story')
# Move all tags after the last sentence outside the Story tag:
sib = last_sentence.nextSibling
while sib:
story.parent.append(sib.extract())
sib = last_sentence.nextSibling
print soup.prettify()
#<document>
# <prevtag>
# </prevtag>
# <story>
# <sentence id="1">
# some text
# </sentence>
# <sentence id="2">
# some text
# </sentence>
# <sentence id="3">
# some text
# </sentence>
# </story>
# <endtag>
# </endtag>
#</document>
最终结果应该正是您想要的。请注意,此代码假定文档中只有一个 Story——如果没有,则应稍作修改。祝你好运!