8

I am trying to retrieve few <p> tags in the following html code. Here is only the part of it

<td class="eelantext">
    <a class="fBlackLink"></a>
    <center></center>
    <span> … </span><br></br>
    <table width="402" vspace="5" cellspacing="0" cellpadding="3" 
        border="0" bgcolor="#ffffff" align="Left">
    <tbody> … </tbody></table>
      <!--edstart-->
    <p> … </p>
    <p> … </p>
    <p> … </p>
    <p> … </p>
    <p> … </p>
</td>

You can find the webpage here

My Python code is the following

soup = BeautifulSoup(page)
div = soup.find('td', attrs={'class': 'eelantext'})
print div
text = div.find_all('p') 

But the text variable is empty and if I print the div variable, I have exactly the same html from above except the <p> tags.

4

1 回答 1

20

BeautifulSoup 可以使用不同的解析器来处理 HTML 输入。这里的 HTML 输入有点破,默认的HTMLParser解析器处理得不是很好。

请改用html5lib解析器

>>> len(BeautifulSoup(r.text, 'html').find('td', attrs={'class': 'eelantext'}).find_all('p'))
0
>>> len(BeautifulSoup(r.text, 'lxml').find('td', attrs={'class': 'eelantext'}).find_all('p'))
0
>>> len(BeautifulSoup(r.text, 'html5lib').find('td', attrs={'class': 'eelantext'}).find_all('p'))
22
于 2013-09-04T12:55:36.740 回答