python - 从 lxml 中的 html 解析日期字符串

Question

 s = """
      <tbody>
      <tr>
       <td style="border-bottom: none">
       <span class="graytext" style="font-weight: bold;"> Reply #3 - </span>
        <span class="graytext" style="font-size: 11px">
        05/13/09  2:02am
        <br>
       </span>
      </td>
     </tr>
    </tbody>
 """

在 HTML 字符串中，我需要取出日期字符串。

我试过这样

  import lxml
  doc = lxml.html.fromstring(s)
  doc.xpath('//span[@class="graytext" and @style="font-size: 11px"]')

但这不起作用。我应该只需要日期字符串。

score 1 · Accepted Answer

您的查询正在选择span，您需要从中获取文本：

>>> doc.xpath('//span[@class="graytext" and @style="font-size: 11px"]')
[<Element span at 1c9d4c8>]

大多数查询返回一个序列，我通常使用获取第一项的辅助函数。

from lxml import etree
s = """
<tbody>
 <tr>
   <td style="border-bottom: none">
   <span class="graytext" style="font-weight: bold;"> Reply #3 - </span>
    <span class="graytext" style="font-size: 11px">
    05/13/09  2:02am
    <br>
   </span>
  </td>
 </tr>
</tbody>
"""
doc = etree.HTML(s)

def first(sequence,default=None):
  for item in sequence:
    return item
  return default

然后：

>>> doc.xpath('//span[@class="graytext" and @style="font-size: 11px"]')
[<Element span at 1c9d4c8>]
>>> doc.xpath('//span[@class="graytext" and @style="font-size: 11px"]/text()')
['\n    05/13/09  2:02am\n    ']
>>> first(doc.xpath('//span[@class="graytext" and @style="font-size: 11px"]/text()'),'').strip()
'05/13/09  2:02am'

score 0 · Accepted Answer

尝试以下而不是最后一行：

print doc.xpath('//span[@class="graytext" and @style="font-size: 11px"]/text()')[0]

xpath 表达式的第一部分是正确的，//span[@class="graytext" and @style="font-size: 11px"]选择所有匹配的 span 节点，然后您需要指定要从节点中选择的内容。text()这里使用的选择节点的内容。

python - 从 lxml 中的 html 解析日期字符串

2 回答 2

Related

Reference