0

使用以下代码如何解析 html 表格结果?可以在更前面找到 html 的示例。

import requests
from lxml import etree
import StringIO

def http_request():

    try:
        url = "http://somehost/somehtml.html"
        r = requests.get(url, auth=("theUser", "thepass"))
        r.encoding ='ISO-8859-1'
        html = r.content
        parse_result(html)
    except requests.HTTPError, e:
        return False
        sys.exit(1)

def parse_result(result):
    parser = etree.HTMLParser()
    tree = etree.parse(StringIO.StringIO(result), parser)

    # Here should be the logic to parse the html result :)


if __name__ == '__main__':
    http_request()

这是html:

<!DOCTYPE html PUBLIC "-//W3C//Dtd XHTML 1.0 Strict//EN"
    "http://www.w3.org/tr/xhtml1/Dtd/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta name="generator" content=
  "HTML Tidy for Linux/x86 (vers 25 March 2009), see www.w3.org" />

  <title></title>
</head>

<body>
  <table border="1">
    <tr>
    <td valign="top"><B>name</B></td>
    <td>result name a</td>
    </tr>
    <tr>
    <td valign="top"><B>inUse</B></td>
    <td>false</td>
    </tr>
  </table>
  <table border="1">
    <tr>
    <td valign="top"><B>name</B></td>
    <td>result name b</td>
    </tr>
    <tr>
    <td valign="top"><B>inUse</B></td>
    <td>false</td>
    </tr>
  </table>
  <table border="1">
    <tr>
    <td valign="top"><B>name</B></td>
    <td>result name c</td>
    </tr>
    <tr>
    <td valign="top"><B>inUse</B></td>
    <td>true</td>
    </tr>
  </table>
</body>
</html>

并且预期的结果将检索名称inUse字段结果,即“结果名称”和“假”。

4

1 回答 1

0

假设您要拉入的 html 正是这种格式:

nodes = etree.XPath("/html/body/table")
for node in nodes(tree):
    print '%s %s' % (node[0][1].text, node[1][1].text)

从您的示例 html 中,输出:

result name a false
result name b false
result name c true

如果示例 html 之外的格式会有所不同,那么您可能必须对XPath更有创意并添加更多输入检查。

于 2013-05-06T17:02:18.130 回答