html - python-在不丢失轴标题的情况下提取html表

Question

Q1。有什么方法可以从表中提取数据，但仍然能够追踪轴标题？Q2。哪种方法会更好地从 html 表中提取数据？HTMLParser 或 beautifulsoup 还是其他？

我试图提取这张收入表 http://investing.businessweek.com/research/stocks/financials/financials.asp?ticker=TSCO:LN

我想成为

"以百万英镑计的货币","2009","2010","2011","2012"

“收入”、“53,898.0”、“56,910.0”、“60,455.0”、“64,539.0”

“总收入”、“53,898.0”、“56,910.0”、“60,455.0”、“64,539.0”

同时我想知道“56,910.0”是2009年的收入

但我遇到了两个问题：

HTMLParser.HTMLParseError：格式错误的开始标记，位于第 1148 行，第 47 列或 HTMLParser.HTMLParseError：错误的结束标记：“”，位于第 225 行，第 104 列
无法跟踪轴标题

非常感谢

score 0 · Accepted Answer

我做了相当多的抓取，BeautifulSoup 很少让人失望。


from BeautifulSoup import BeautifulSoup 
URL = "http://investing.businessweek.com/research/stocks/financials/financials.asp?ticker=TSCO:LN"
from urllib import urlopen
HTML = urlopen ( URL )
soup = BeautifulSoup ( HTML )
statement = soup . find ( 'table', { 'class' : "financialStatement" } )
rows = statement . findAll ( 'tr' )

在这一点上，我想你会发现 rows 的长度为 25，它的第一项是标题，最后一项是所需表的最后一行。

html - python-在不丢失轴标题的情况下提取html表

1 回答 1

Related

Reference