我正在使用 python 的 lxml,我正在尝试读取 xml 文档,修改并将其写回,但原始 doctype 和 xml 声明消失了。我想知道是否有一种简单的方法可以通过 lxml 或其他解决方案将其放回原处?
问问题
7464 次
2 回答
13
tl;博士
# adds declaration with version and encoding regardless of
# which attributes were present in the original declaration
# expects utf-8 encoding (encode/decode calls)
# depending on your needs you might want to improve that
from lxml import etree
from xml.dom.minidom import parseString
xml1 = '''\
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root SYSTEM "example.dtd">
<root>...</root>
'''
xml2 = '''\
<root>...</root>
'''
def has_xml_declaration(xml):
return parseString(xml).version
def process(xml):
t = etree.fromstring(xml.encode()).getroottree()
if has_xml_declaration(xml):
print(etree.tostring(t, xml_declaration=True, encoding=t.docinfo.encoding).decode())
else:
print(etree.tostring(t).decode())
process(xml1)
process(xml2)
以下将包括 DOCTYPE 和 XML 声明:
from lxml import etree
from StringIO import StringIO
tree = etree.parse(StringIO('''<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "eggs"> ]>
<root>
<a>&tasty;</a>
</root>
'''))
docinfo = tree.docinfo
print etree.tostring(tree, xml_declaration=True, encoding=docinfo.encoding)
请注意,如果您创建一个(例如使用),它tostring
不会保留,它仅在您使用处理 XML 时才有效。DOCTYPE
Element
fromstring
parse
更新:正如JF Sebastian所指出的,我的断言fromstring
是不正确的。
Element
下面是一些代码来突出和ElementTree
序列化之间的差异:
from lxml import etree
from StringIO import StringIO
xml_str = '''<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "eggs"> ]>
<root>
<a>&tasty;</a>
</root>
'''
# get the ElementTree using parse
parse_tree = etree.parse(StringIO(xml_str))
encoding = parse_tree.docinfo.encoding
result = etree.tostring(parse_tree, xml_declaration=True, encoding=encoding)
print "%s\nparse ElementTree:\n%s\n" % ('-'*20, result)
# get the ElementTree using fromstring
fromstring_tree = etree.fromstring(xml_str).getroottree()
encoding = fromstring_tree.docinfo.encoding
result = etree.tostring(fromstring_tree, xml_declaration=True, encoding=encoding)
print "%s\nfromstring ElementTree:\n%s\n" % ('-'*20, result)
# DOCTYPE is lost, and no access to encoding
fromstring_element = etree.fromstring(xml_str)
result = etree.tostring(fromstring_element, xml_declaration=True)
print "%s\nfromstring Element:\n%s\n" % ('-'*20, result)
输出是:
--------------------
parse ElementTree:
<?xml version='1.0' encoding='iso-8859-1'?>
<!DOCTYPE root SYSTEM "test" [
<!ENTITY tasty "eggs">
]>
<root>
<a>eggs</a>
</root>
--------------------
fromstring ElementTree:
<?xml version='1.0' encoding='iso-8859-1'?>
<!DOCTYPE root SYSTEM "test" [
<!ENTITY tasty "eggs">
]>
<root>
<a>eggs</a>
</root>
--------------------
fromstring Element:
<?xml version='1.0' encoding='ASCII'?>
<root>
<a>eggs</a>
</root>
于 2012-10-19T03:14:19.157 回答
7
您还可以使用以下方法保留 DOCTYPE 和 XML 声明fromstring()
:
import sys
from StringIO import StringIO
from lxml import etree
xml = r'''<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>example</title>
</head>
<body>
<p>This is an example</p>
</body>
</html>'''
tree = etree.fromstring(xml).getroottree() # or etree.parse(file)
tree.write(sys.stdout, xml_declaration=True, encoding=tree.docinfo.encoding)
输出
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>example</title>
</head>
<body>
<p>This is an example</p>
</body>
</html>
请注意存在 xml 声明(使用正确的编码)和 doctype。它甚至(可能不正确)在 xml 声明中使用'
而不是添加到."
Content-Type
<head>
对于@John Keyes 的示例输入,它会产生与答案相同etree.tostring()
的结果。
于 2012-10-19T03:43:42.017 回答