python - 如何使用 BeautilSoup 提取表信息？

Question

我正在尝试从此类页面中抓取信息。

我需要Internship, Residency,下包含的信息Fellowship。我可以从表中提取值，但在这种情况下，我无法决定使用哪个表，因为标题（如Internship）div作为简单的纯文本出现在表外的标记下，然后表出现了我需要的值提取。而且我有很多这样的页面，没有必要每个页面都有这些值，就像在某些页面中Residency可能根本不存在一样。（这会减少页面中的表格总数）。此类页面的一个示例是this。在此页面Internship中根本不存在。

我面临的主要问题是所有表都具有相同的属性值，因此我无法决定将哪个表用于不同的页面。如果页面中不存在我感兴趣的任何值，我必须为该值返回一个空字符串。

我在 Python 中使用 BeautifulSoup。有人可以指出，我该如何继续提取这些值。

score 1 · Accepted Answer

看起来标题和数据的 id 都有唯一的值和标准的后缀。您可以使用它来搜索适当的值。这是我的解决方案：

from BeautifulSoup import BeautifulSoup

# Insert whatever networking stuff you're doing here. I'm going to assume
# that you've already downloaded the page and assigned it to a variable 
# named 'html'

soup = BeautifulSoup(html)
headings = ['Internship', 'Residency', 'Fellowship']
values = []
for heading in headings:
    x = soup.find('span', text=heading)
    if x:
        span_id = x.parent['id']
        table_id = span_id.replace('dnnTITLE_lblTitle', 'Display_HtmlHolder')        
        values.append(soup.find('td', attrs={'id': table_id}).text)
    else:
        values.append('')

print zip(headings, values)

python - 如何使用 BeautilSoup 提取表信息？

1 回答 1

Related

Reference