python - LXML：删除 x

Question

我正在使用 LXML 创建站点地图解析器，并希望提取带有其值的标签。然而，结果标签总是包含 xmlns 信息，例如{http://www.sitemaps.org/schemas/sitemap/0.9}loc.

body = cStringIO.StringIO(item['body'])
parser = etree.XMLParser(recover=True, load_dtd=True, ns_clean=True)
tree = etree.parse(body, parser)

for sitemap in tree.xpath('./*'):
    print sitemap.xpath('./*')[0].tag
    # prints: {http://www.sitemaps.org/schemas/sitemap/0.9}loc

站点地图字符串：

<sitemap xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <loc>http://www.some_page.com/sitemap-page-2010-11.xml</loc>
  <lastmod>2011-12-22T15:46:17+00:00</lastmod>
</sitemap>

我只想提取标签 - 这里是 'loc'，没有{http://www.sitemaps.org/schemas/sitemap/0.9}. LXML 中有没有办法以这种方式配置解析器或 LXML？

注意：我知道我可以使用简单的正则表达式替换 - 如果一个实现感觉比它应该的更复杂，一位朋友告诉我寻求帮助。

score 2 · Accepted Answer

在一个完美的世界中，您将使用 XML 解析或 html 抓取库来解析您的 html，以确保您在上下文中拥有所需的确切标签。在这种情况下，只需使用正则表达式来匹配您需要的内容，几乎可以肯定会更简单、更快且足够好。

>>> import re
>>> samp = """<sitemap xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
...     <loc>http://www.some_page.com/sitemap-page-2010-11.xml</loc>
...     <lastmod>2011-12-22T15:46:17+00:00</lastmod>
... </sitemap>"""
>>> re.findall(r'<loc>(.*)</loc>', samp)
['http://www.some_page.com/sitemap-page-2010-11.xml']

score 1 · Accepted Answer

不确定这是最好的方法，但它lxml按照您的要求使用并且有效：

import cStringIO
from lxml import etree


text = """<sitemap xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <loc>http://www.some_page.com/sitemap-page-2010-11.xml</loc>
    <lastmod>2011-12-22T15:46:17+00:00</lastmod>
</sitemap>"""

body = cStringIO.StringIO(text)
parser = etree.XMLParser(recover=True, load_dtd=True, ns_clean=True)
tree = etree.parse(body, parser)

for item in tree.xpath("./*"):
    if 'loc' in item.tag:
        print item.text

印刷

http://www.some_page.com/sitemap-page-2010-11.xml

希望有帮助。

score 0 · Accepted Answer

我会用这个工具试试这个。

htmlparser.sourceforge.net/

一个朋友告诉我这很简单，确实！比beautifulsoup 或类似的东西好多了。

from ehp import *

data = '''
<sitemap xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <loc>http://www.some_page.com/sitemap-page-2010-11.xml</loc>
  <lastmod>2011-12-22T15:46:17+00:00</lastmod>
</sitemap>'''

html = Html()
dom  = html.feed(data)
seq  = [ind.text() for ind in dom.find('loc')]

print seq

# It gives me.
# ['http://www.some_page.com/sitemap-page-2010-11.xml']

score 0 · Accepted Answer

我不确定您是否要删除标签并留下文本。所以它是另一个答案。

from ehp import *

data = '''
<sitemap xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <loc>http://www.some_page.com/sitemap-page-2010-11.xml</loc>
  <lastmod>2011-12-22T15:46:17+00:00</lastmod>
</sitemap>'''

html = Html()
dom  = html.feed(data)

for root, ind in dom.find_with_root('loc'):
    root.remove(ind)
    root.append(Data(ind.text()))


# It would give me.
print dom



""" <sitemap xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" >

  <lastmod >2011-12-22T15:46:17+00:00</lastmod>
http://www.some_page.com/sitemap-page-2010-11.xml</sitemap>
"""

python - LXML：删除 x

4 回答 4

Related

Reference