python - 使用 lxml 解析 html - 如何指定 1 - 3 位通配符以使我的代码不那么脆弱？

Question

我正在尝试使用 xml 从 yahoo Finance 中抓取“部门”和“行业”字段。

我注意到 href url 始终是http://biz.yahoo.com/ic/xyz .html，其中xyz是数字。

您能否建议包含 1 个或多个数字的通配符的方法？我尝试了几种基于谷歌和堆栈搜索的方法，但没有任何效果。

import lxml.html
url = 'http://finance.yahoo.com/q?s=AAPL'
root = lxml.html.parse(url).getroot()
for a in root.xpath('//a[@href="http://biz.yahoo.com/ic/' + 3 digit integer wildcard "     +'.html"]')
    print a.text

score 5 · Accepted Answer

纯 XPath 1.0 解决方案（无扩展功能）：

//a[starts-with(@href, 'http://biz.yahoo.com/ic/')
  and
    substring(@href, string-length(@href)-4) = '.html'
  and
    string-length
      (substring-before
          (substring-after(@href, 'http://biz.yahoo.com/ic/'), 
           '.')
      ) = 3
  and
    translate(substring-before
               (substring-after(@href, 'http://biz.yahoo.com/ic/'), 
                '.'),
              '0123456789',
              ''
              )
     = ''
   ]

这个 XPath 表达式可以像这样“用英文阅读”：

选择a文档中的any，其href属性的字符串值以字符串开头，以字符串"'http://biz.yahoo.com/ic/"结尾".html"，并且在开始和结束子字符串之间的子字符串的长度为3，并且同一子字符串仅由数字组成。

基于 XSLT 的验证：

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
   "//a[starts-with(@href, 'http://biz.yahoo.com/ic/')
      and
        substring(@href, string-length(@href)-4) = '.html'
      and
        string-length
          (substring-before
              (substring-after(@href, 'http://biz.yahoo.com/ic/'),
               '.')
          ) = 3
      and
        translate(substring-before
                   (substring-after(@href, 'http://biz.yahoo.com/ic/'),
                    '.'),
                  '0123456789',
                  ''
                  )
         = ''
       ]
   "/>
 </xsl:template>
</xsl:stylesheet>

当此转换应用于以下 XML 文档时：

<html>
  <body>
    <a href="http://biz.yahoo.com/ic/123.html">Link1</a>
    <a href="http://biz.yahoo.com/ic/1234.html">Incorrect</a>
    <a href="http://biz.yahoo.com/ic/x23.html">Incorrect</a>
    <a href="http://biz.yahoo.com/ic/621.html">Link2</a>
  </body>
</html>

计算 XPath 表达式并将选定节点复制到输出：

<a href="http://biz.yahoo.com/ic/123.html">Link1</a>
<a href="http://biz.yahoo.com/ic/621.html">Link2</a>

正如我们所见，只选择了正确的、想要的a元素。

score 1 · Accepted Answer

root.xpath(r'''//a[re:match(@href, "http://biz\.yahoo\.com/ic/[0-9]{1,3}\.html")]''',
           namespaces={'re': 'http://exslt.org/regular-expressions'})

XPath 表达式匹配a正则表达式匹配的所有标签。如果属性以开头、以 1 到 3 位数字 ( ) 继续并以 . 结尾，re:match则将返回 true 。hrefhttp://biz.yahoo.com/ic/[0-9]{1,3}.html

我使用了\.因为.会匹配任何字符，但是通过在它前面放置一个反斜杠，它被视为一个普通的点。

r'''...'''意味着字符串是原始的（Python 不会以任何方式解释它，例如它不会关心\），它甚至可以包含'，因为分隔符是'''.

归功于Stack Overflow 的另一个答案。

python - 使用 lxml 解析 html - 如何指定 1 - 3 位通配符以使我的代码不那么脆弱？

2 回答 2

Related

Reference