python - xml定义中的Python BeautifulSoup双问号

Question

我认为这一定是一个错误，所以我在这里发布了一个错误报告。另一方面，我可能会遗漏一些东西，所以我需要再看一下代码。

问题是，当我使用 .xhtml 文件的内容初始化 BeautifulSoup 时，xml 定义的末尾会出现两个问号。

你能重现这个问题吗？有没有办法避免它？我是否缺少函数、方法、参数或其他东西？

Edit0：它是 Python 2.x 上的 BeautifulSoup 4。

Edit1：为什么要投票？

问题：

<?xml version="1.0" encoding="UTF-8"??>

终端输出：

>>> from bs4 import BeautifulSoup as bs
>>> with open('example.xhtml', 'r') as f:
...     txt = f.read()
...     soup = bs(txt)
... 
>>> print txt
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <meta charset="utf-8"/>
    </head>
    <body>
    </body>
</html>

>>> print soup
<?xml version="1.0" encoding="UTF-8"??>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8"/>
</head>
<body>
</body>
</html>

score 2 · Accepted Answer

这是一个错误。我已经提交了一个修复，它将在 Beautiful Soup 的下一个版本中发布。

根本原因：

HTMLParser 类使用 SGML 句法规则来处理指令。使用尾随 '?' 的 XHTML 处理指令会导致'?' 被包含在数据中。

一般来说，正如 ThiefMaster 建议的那样，使用“xml”解析器解析 XHTML 会得到更好的结果。

score 1 · Accepted Answer

1

考虑使用 XML 解析器：

soup = bs(txt, 'xml')

于 2012-04-18T06:42:03.917 回答

score 0 · Accepted Answer

在 example.xhtml 文件中使用“txt”中的变量内容我无法重现 Python2.7 和相应的 BeautifulSoup 模块（不是 bs4）的问题。对我来说工作得很好而且花花公子。

>>> soup = BeautifulSoup.BeautifulSoup(txt)
>>> print soup
<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="utf-8" />
</head>
<body>
</body>
</html>

你遇到的问题是什么，你的最终目标是什么，也许somoone可以建议一个解决方法

python - xml定义中的Python BeautifulSoup双问号

3 回答 3

Related

Reference