python - Python/lxml：如何在 HTML 表中捕获一行？

Question

对于我的股票筛选工具，我必须在脚本中从 BeautifulSoup 切换到 lxml。在我的 Python 脚本下载了我需要处理的网页后，BeautifulSoup 能够正确解析它们，但处理速度太慢。分析一只股票的资产负债表、损益表和现金流量表需要 BeautifulSoup 大约 10 秒，考虑到我的脚本有超过 5000 只股票要分析，这个速度慢得让人无法接受。

根据一些基准测试（http://www.crummy.com/2012/1/22/0），lxml 比 BeautifulSoup 快近 100 倍。因此，lxml 应该能够在 10 分钟内完成一项需要 BeautifuSoup 14 小时的工作。

如何使用 HTML 来捕获 HTML 表格中一行的内容？我的脚本已下载并需要解析的 HTML 页面示例位于http://www.smartmoney.com/quote/FAST/?story=financials&opt=YB

使用 BeautifulSoup 解析这个 HTML 表格的源代码是：

    url_local = local_balancesheet (symbol_input)
    url_local = "file://" + url_local
    page = urllib2.urlopen (url_local)
    soup = BeautifulSoup (page)
    soup_line_item = soup.findAll(text=title_input)[0].parent.parent.parent
    list_output = soup_line_item.findAll('td') # List of elements

如果我正在寻找现金和短期投资，title_input = "现金和短期投资"。

如何在 lxml 中执行相同的功能？

score 1 · Accepted Answer

您可以将 lxml 解析器与 BeautifulSoup 一起使用，所以我不知道您为什么要这样做。

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

soup = BeautifulSoup(markup, "lxml")

编辑：这里有一些代码可以玩。这对我来说大约需要六秒钟。

def get_page_data(url):
    f = urllib2.urlopen(url)
    soup = BeautifulSoup(f, 'lxml')
    f.close()
    trs = soup.findAll('tr')
    data = {}
    for tr in trs:
        try:
            if tr.div.text.strip() in ('Cash & Short Term Investments', 'Property, Plant & Equipment - Gross',
                               'Total Liabilities', 'Preferred Stock (Carrying Value)'):
                data[tr.div.text] = [int(''.join(e.text.strip().split(','))) for e in tr.findAll('td')]
        except (AttributeError, ValueError):
            # headers dont have a tr tag, and thus raises AttributeError
            # 'Fiscal Year Ending in 2011' raises ValueError
            pass
    return data

python - Python/lxml：如何在 HTML 表中捕获一行？

1 回答 1

Related

Reference