0

我有一个格式如下的 xhtml 文件。我正在尝试按顺序提取标签之间的所有文本。我可以通过调用 mythis_list = get_e('td')然后将该列表传递给另一个函数来获取除 BAC 之外的所有内容以获取文本为get_text(this_list). 我想知道是否可以对我的函数进行轻微修改以获取所有文本。任何人都可以提供一些建议吗?

<tr>
  <td colspan="1" rowspan="1" class="lft">
    <a shape="rect" href="http://www.usatoday.idmanagedsolutions.com/stocks/new/quote.idms?SYMBOL_US=BAC">
        BAC</a>
  </td>
  <td colspan="1" rowspan="1" class="lft">
    Bank Of America Corporation</td>
  <td colspan="1" rowspan="1">
    9.79
   </td>
  <td colspan="1" rowspan="1">
    -0.07
  </td>
  <td colspan="1" rowspan="1">
    <span class="neg-arrw">
        -0.71%
    </span>
   </td>
   <td colspan="1" rowspan="1">
    71,370,166
   </td>
</tr>
<tr class="evenrow">
   <td colspan="1" rowspan="1" class="lft">
    VALE
   </td>
   <td colspan="1" rowspan="1" class="lft">
    Vale S A
   </td>
<td colspan="1" rowspan="1">
    17.52
   </td>
   <td colspan="1" rowspan="1">
    +0.09
   </td>
   <td colspan="1" rowspan="1">
    <span class="pos-arrw">
        +0.49%
    </span>
   </td>
   <td colspan="1" rowspan="1">
    15,461,788</td>
</tr>

我正在使用以下功能

def get_e(tag):
    l=[]
    els=dom.getElementsByTagName(tag)
    for e in els:
        for child_el in els.childNode:
            lst.append(child_el)
    return l

def get_text(els):
    l=[]
    for e in els
        if e.nodeType == e.TEXT_NODE:
            l.append(e.data)
    return lst
4

1 回答 1

2

The get_text function expects input that has just text nodes. Some of your td's have embedded a's which are element nodes. I've updated this to call get_e recursively on seeing element nodes.

from xml.dom import minidom
import pdb

def get_e(dom, tag):
    l=[]
    els=dom.getElementsByTagName(tag)
    for e in els:
        for child_el in e.childNodes:
            # if this was an element node get its children
            if child_el.nodeType == e.ELEMENT_NODE:
                l.extend(get_e(e, child_el.tagName))
            else:
                l.append(child_el)
    return l

def get_text(els):
    l=[]
    for e in els:
        if e.nodeType == e.TEXT_NODE:
            l.append(e.data)
    return l

dom = minidom.parse('s.xml')
print get_text(get_e(dom, 'td'))

Or perhaps you could consider the shorter :-

import xml.etree.ElementTree as ET
et = ET.parse('s.xml')
print [e.findtext('.') for e in et.findall('.//*')]
于 2012-12-06T10:25:29.583 回答