我有一个 XML(实际上是一个 XML 样式表)。使用 Python,我想从中删除所有标签,只保留标签之间的文本。
最简单的解决方案是什么?我在这里看到了一个类似的问题: 如何从下载的页面中删除所有 html 标签
但由于某种原因,这在这种情况下似乎不起作用。请注意,我不希望在标签中保留由引号分隔的文本 - 我真的想删除以“<”开头并以“>”结尾的所有内容。
我有一个 XML(实际上是一个 XML 样式表)。使用 Python,我想从中删除所有标签,只保留标签之间的文本。
最简单的解决方案是什么?我在这里看到了一个类似的问题: 如何从下载的页面中删除所有 html 标签
但由于某种原因,这在这种情况下似乎不起作用。请注意,我不希望在标签中保留由引号分隔的文本 - 我真的想删除以“<”开头并以“>”结尾的所有内容。
你可以使用xml.parsers.expat
:
from xml.parsers.expat import ParserCreate
def char_data(data):
if data.strip(): # skip empty text if you want
print data
parser = ParserCreate()
parser.CharacterDataHandler = char_data
parser.Parse(doc,True)
或者xml.sax
:
from xml.sax import make_parser, handler
class extract_text(handler.ContentHandler):
def characters(self,data):
if data.strip():
print data
parser = make_parser()
parser.setContentHandler(extract_text())
parser.feed(doc)
如果它不是格式正确的 XML,您也可以尝试HTMLParser
:
from HTMLParser import HTMLParser
class extract_text(HTMLParser):
def handle_data(self,data):
if data.strip():
print data
parser = extract_text()
parser.feed(doc)
使用ElementTree
API(或更快的 API 等效项lxml
),然后使用etree.totext(tree, method='text')
函数将树序列化回文本内容:
>>> from xml.etree import ElementTree as ET
>>> doc='''\
... <?xml-stylesheet href="common.css"?>
... <?xml-stylesheet href="modern.css"
... title="Modern" media="screen"
... type="text/css"?>
... <?xml-stylesheet href="classic.css"
... alternate="yes" title="Classic"
... media="screen, print" type="text/css"?>
... <ARTICLE>
... <HEADLINE>Fredrick the Great meets
... Bach</HEADLINE>
... <AUTHOR>Johann Nikolaus Forkel</AUTHOR>
... <PARA>
... One evening, just as he was
... getting his
... <INSTRUMENT>flute</INSTRUMENT>
... ready and his musicians were
... assembled, an officer brought him a
... list of the strangers who had arrived.
... </PARA>
... </ARTICLE>
... '''
>>> tree = ET.fromstring(doc)
>>> ET.tostring(tree, method='text')
'\n Fredrick the Great meets\n Bach\n Johann Nikolaus Forkel\n \n One evening, just as he was\n getting his\n flute\n ready and his musicians were\n assembled, an officer brought him a\n list of the strangers who had arrived.\n \n'
Lxml 可能会出现问题,您可以使用标准库中的 ElementTree 或 C 版本 cElementTree 执行 Martijn Pieters 所说的确切操作。
>>> from xml.etree import ElementTree
>>> doc='''
... <?xml-stylesheet href="common.css"?>
... <?xml-stylesheet href="modern.css"
... title="Modern" media="screen"
... type="text/css"?>
... <?xml-stylesheet href="classic.css"
... alternate="yes" title="Classic"
... media="screen, print" type="text/css"?>
... <ARTICLE>
... <HEADLINE>Fredrick the Great meets
... Bach</HEADLINE>
... <AUTHOR>Johann Nikolaus Forkel</AUTHOR>
... <PARA>
... One evening, just as he was
... getting his
... <INSTRUMENT>flute</INSTRUMENT>
... ready and his musicians were
... assembled, an officer brought him a
... list of the strangers who had arrived.
... </PARA>
... </ARTICLE>
... '''
>>> xml = ElementTree.fromstring(doc)
>>> xml
<Element 'ARTICLE' at 0x9295e6c>
>>> ElementTree.tostring(xml,method='text')
'\n Fredrick the Great meets\n Bach\n Johann Nikolaus Forkel\n \n One evening, just as he was\n getting his\n flute\n ready and his musicians were\n assembled, an officer brought him a\n li
st of the strangers who had arrived.\n \n '
请注意,cElementTree 更快,它在标准库中,但我认为 UTF8 存在一些问题,所以如果你需要 utf8,请使用“ElementTree”