python - xml解析莫名其妙终止

Question

我有一个文件，其中包含用格式良好的 XML 包装的句子（xmllint 和 tidylib 是这样说的）。所以xml看起来像这样：

<a id="100" attr1="text" attr1="text" attr1="text">
<tagname id="1">
This is my sentence.
</tagname>
</a>
<a id="101" attr1="text" attr1="text" attr1="text">
<tagname id="1">
This is my sentence.
</tagname>
</a>

等等。

我使用以下代码提取具有属性的句子（在本例中从 id 1 到 85）

a1 = open(r"file.xml",'r')
a = a1.readlines()
a1.close()
soup = BeautifulSoup(str(a))
for i in range(1,85):
    a = soup.find('a', {'id': i})
    achild = a.find('tagname')
    tagnametext = achild.contents
    print tagnametext

一切都打印得很好，直到第 84 句我收到错误：achild = a.find('tagname') AttributeError: 'NoneType' object has no attribute 'find'

每组 ... 都是使用 for 循环生成的，因此 xml 都是相同的。我尝试过使用不同数量的句子的不同文件。发生错误的 id 也会发生变化。这是beautifulsoup的限制吗？它不能扫描超过一定数量的行？

score 0 · Accepted Answer

它在最后一行失败。这可能是文件编码问题，该行包含一些有趣的 EOF 字符，或者该行没有被解释为字符串。你能在它失败之前打印出最后一行，看看它是什么类型吗？

score 0 · Accepted Answer

很可能a = soup.find('a', {'id': i})with84不会返回您期望的结果。如果未找到标签，则find()返回，从而解释NoneAttributeError

另外，在您的代码中，您似乎正在 BeautifulSouping 一个列表（表示为一个字符串）。

soup = BeautifulSoup(str(a))

你正在串起一个列表，然后给列表加汤，这很愚蠢。如果有一个标签，那么汤整个文件然后循环遍历每个标签id怎么样？

from bs4 import BeautifulSoup
with open('file.xml', 'r') as myfile:
    soup = BeautifulSoup(myfile.read())
    for i in soup.find_all('a', id=True):
        print i.tagname.contents

印刷：

[u'\nThis is my sentence.\n']
[u'\nThis is my sentence.\n']

python - xml解析莫名其妙终止

2 回答 2

Related

Reference