如何在使用 lxml 解析期间解析外部未解析实体?
这是我的代码示例:
import io
from lxml import etree
content = b"""\
<?xml version="1.0"?>
<!DOCTYPE sample [
<!NOTATION jpeg SYSTEM "image/jpeg">
<!ENTITY ref1 SYSTEM "python-logo-small.jpg" NDATA jpeg>
<!ELEMENT sample EMPTY>
<!ATTLIST sample src ENTITY #REQUIRED>
]>
<sample src="ref1"/>
"""
parser = etree.XMLParser(dtd_validation=True, resolve_entities=True)
doc = etree.parse(io.BytesIO(content), parser=parser)
print(etree.tostring(doc))
注意:我正在使用 lxml >= 3.4
目前我有以下结果:
<!DOCTYPE sample [
<!NOTATION jpeg SYSTEM "image/jpeg" >
<!ENTITY ref1 SYSTEM "python-logo-small.jpg" NDATA jpeg>
<!ELEMENT sample EMPTY>
<!ATTLIST sample src ENTITY #REQUIRED>
]>
<sample src="ref1"/>
在这里,ref1
实体未解析为“python-logo-small.jpg”。我预计会有<sample src="python-logo-small.jpg"/>
。有什么不对?
我也尝试:
parser = etree.XMLParser(dtd_validation=True, resolve_entities=True, load_dtd=True)
但我有同样的结果。
或者,我想自己解决实体。为此,我尝试以这种方式列出实体:
for entity in doc.docinfo.internalDTD.iterentities():
msg_fmt = "{entity.name!r}, {entity.content!r}, {entity.orig!r}"
print(msg_fmt.format(entity=entity))
但我只得到实体和符号的名称,而不是实体的定义:
'ref1', 'jpeg', None
如何访问实体的定义?