1

我通过爬取朋友(结构化,如果笨重)网站的过程来获得我的 BeautifulSoup 和 python 轴承,长期目标是将整个东西迁移到内容管理系统中。

soup = BeautifulSoup(urllib2.urlopen("http://www.bicyclepaintings.com/archive/index.html")如果我在控制台中准确地拉出一个单元格( ):

cell = soup.find_all('td',{'valign':'bottom'})[3]

我可以玩弄拉出子串。这些都可以正常工作:cell.br.next_sibling, cell.find('b').text. 但是当我尝试使用 for 循环遍历所有单元格时:

def parse_archive(url):
    soup = get_soup(url)
    paintings = []
    for cell in soup.find_all('td',{'valign':'bottom'}):
        painting_title = cell.find('b').text
        painting_media = cell.br.next_sibling 
        record = painting_title, painting_media
        paintings.append(record)
    return paintings

我收到一个属性错误 ( AttributeError: 'NoneType' object has no attribute 'text')。我可以通过循环获得一些相同的信息:

    for item in cell.find_all('b'):
        painting_title = item.text

但是我没有找到一种方法来获取兄弟姐妹,<br/>并且(更重要的是)我不明白为什么如果我拉出一个项目它会起作用,但如果我尝试通过 for 循环访问它们则不会。我在这里想念什么?

4

1 回答 1

0

您的问题是您尝试抓取的网站末尾有一堆<td>不包含标签的<b>标签:

<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use &quot; for quotes --></font></p></td>
<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use &quot; for quotes --></font></p></td>
<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use &quot; for quotes --></font></p></td>
<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use &quot; for quotes --></font></p></td>
<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use &quot; for quotes --></font></p></td>
<td nowrap valign="bottom"><!-- painting image -->
<p><font><!-- painting data, use &quot; for quotes --></font></p></td>

您只需要修改代码以忽略这些标签:

for cell in soup.find_all('td',{'valign':'bottom'}):
    title = cell.find('b')
    if title is None:
        continue
    painting_title = title.text
    painting_media = cell.br.next_sibling 
    record = painting_title, painting_media
    paintings.append(record)

至于匹配painting_media你可以使用:

painting_media = list(cell.br.children)
painting_media = painting_media[0].strip() if painting_media else ''
于 2012-10-26T04:28:45.400 回答