0

我正在尝试使用 beautifulsoup 解析表格。我页面上的第一个很简单,但我无法解析同一页面上的类似表格。我不懂为什么。

这是代码。在此先感谢您的帮助。

import urllib2
from bs4 import BeautifulSoup


url = urllib2.urlopen("https://dl.dropboxusercontent.com/u/956261/poftext.html")
contentHTML = url.read()

soup = BeautifulSoup(contentHTML)

tableUserDetails = soup.find("table", {"class" : "user-details"})

i = 0
tableUserDetailsList = []
for row in tableUserDetails.findAll('tr'):
    for col in row.findAll('td'):
        contentTd = col.contents[0].string.strip()

        if contentTd:
            print "TD Number %d : %s" % (i, contentTd)
            tableUserDetailsList.append(contentTd)
            i += 1

# This first table is OK
print tableUserDetailsList


# But now this one
tableUserDetails = soup.find("table", {"class" : "secondpart"})

i = 0
tableUserDetailsList = []
for row in tableUserDetails.findAll('tr'):
    for col in row.findAll('td'):
        contentTd = col.contents[0].string.strip()

        if contentTd:
            print "TD Number %d : %s" % (i, contentTd)
            tableUserDetailsList.append(contentTd)
            i += 1

print tableUserDetailsList

# The list is empty :(

这是我试图解析的 HTML 代码的简化版本:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <title>
        French.Kiss
        Sorties, Sport, Voyages, Nouvelles Expériences</title> 

</head>
<body style='background-color: #fff;' leftMargin='0' topMargin='0' marginwidth='0' marginheight='0' link='#1E55D6' vlink='#1E55D6'  TEXT='#6551b0'>

            <table class="user-details">
                <tr>
                    <td class="headline txtBlue size15" style="width:80px">
                        About
                    </td>
                    <td style="width:10px">
                        &nbsp;
                    </td>
                    <td class="txtGrey size15">
                        Fume occasionnellement with Silhouette mince
                    </td>
                    <td width="25px;">
                        &nbsp;
                    </td>
                    <td class="headline txtBlue size15">
                        City
                    </td>
                    <td class="txtGrey size15">
                        Paris Ile-de-France
                    </td>
                </tr>
                <tr>
                    <td class="headline txtBlue size15">
                        Details
                    </td>
                    <td style="width:10px">
                        &nbsp;
                    </td>
                    <td class="txtGrey size15">
                        26 year old Un homme, 185cm, Sans religion
                    </td>
                    <td>
                    </td>
                    <td class="headline txtBlue size15">
                        Ethnicity
                    </td>
                    <td class="txtGrey size15">
                        Caucasienne Balance with Châtains
                    </td>
                </tr>
                <tr>
                    <td class="headline txtBlue size15">
                        Intent
                    </td>
                    <td style="width:10px">
                        &nbsp;
                    </td>
                    <td class="txtGrey size15">
                        French.Kiss Cherche une relation amoureuse.
                    </td>
                    <td>
                    </td>
                    <td class="headline txtBlue size15" style="width:90px">
                        Education
                    </td>
                    <td class="txtGrey size15">
                        Diplôme universitaire/Licence
                    </td>
                </tr>

                <tr>
                    <td class="headline txtBlue size15">
                        Personnalité
                    </td>
                    <td style="width:10px">
                        &nbsp;
                    </td>
                    <td class="txtGrey size15">

                    </td>   <td>
                    </td>
                <td>
                            <span class="headline txtBlue size15">Profession </span>
                        </td>
                        <td>
                            <span class="txtGrey size15">
                                Visioconférence</span>
                        </td>
                </tr>

            </table> 





















                <table width="85%" class="secondpart">
                    <tr height="25px">
                        <td width="200px">
                            <span class="headline txtBlue size14">I am Seeking a</span>
                        </td>
                        <td width="300px">
                            <span class="txtGrey size14">
                                Une femme</span>
                        </td>
                        <td width="25px">
                        </td>
                        <td width="200px">
                            <span class="headline txtBlue size14">For</span>
                        </td>
                        <td width="200px">
                            <span class="txtGrey size14">
                                Sorties</span>
                        </td>
                    </tr>
                    <tr height="25px">
                        <td>
                            <span class="headline txtBlue size14"><a href='needs_test.aspx'>Needs Test</a></span>
                        </td>
                        <td>
                            <span class="txtGrey size14"><a href='needs_test.aspx'>


                                <a href="needs_view.aspx?id=38028200">View
                                    his
                                    relationship needs</a></a></span>
                        </td>
                        <td>
                        </td>
                        <td>
                            <span class="headline txtBlue size14"><a href='poftest.aspx'>Chemistry</a></span>
                        </td>
                        <td>
                            <span class="txtGrey size14"><a href='poftest.aspx'>

                                <a href="personality.aspx?id=26&user_id=41724176" rel="nofollow">View
                                    his
                                    chemistry results</a></a></span>
                        </td>
                    </tr>
                    <tr height="25px">
                        <td>
                            <span class="headline txtBlue size14">Do you drink?</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Occasionnellement</span>
                        </td>
                        <td>
                        </td>
                        <td>
                            <span class="headline txtBlue size14">Do you want children?</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Non divulgué</span>
                        </td>
                    </tr>
                    <tr height="25px">
                        <td>
                            <span class="headline txtBlue size14">Marital Status</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Célibataire</span>
                        </td>
                        <td>
                        </td>
                        <td>
                            <span class="headline txtBlue size14">Do you do drugs?</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Non</span>
                        </td>
                    </tr>

                    <tr height="25px">
                        <td>
                            <span class="headline txtBlue size14">Pets </span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Aucun</span>
                        </td>
                        <td>
                        </td>
                        <td>
                            <span class="headline txtBlue size14">Eye Color</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Bruns</span>
                        </td>
                    </tr>

                    <tr height="25px">
                        <td>
                            <span class="headline txtBlue size14">Do you have a car? </span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                N/A</span>
                        </td>
                        <td>
                        </td>
                        <td>
                            <span class="headline txtBlue size14">Do you have children?</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Non</span>
                        </td>
                    </tr>
                    <tr height="25px">
                        <td>
                         <span class="headline txtBlue size14">Longest Relationship</span>
                        </td>

                        <td>
                            <span class="txtGrey size14">
                                Plus de 2 ans</span>
                        </td>
                        <td>
                        </td>
                        <td>

                        </td>
                        <td>

                        </td>
                    </tr>

                </table> 
</body>
</html>

两个表的 tableUserDetails.content、tableUserDetails 和 tableUserDetailsList:

*第一张桌子*

打印 tableUserDetails.content = none

打印 tableUserDetails =

  <table class="user-details">
                <tr>
                    <td class="headline txtBlue size15" style="width:80px">
                        About
                    </td>
                    <td style="width:10px">
                        &nbsp;
                    </td>
                    <td class="txtGrey size15">
                        Fume occasionnellement with Silhouette mince
                    </td>
                    <td width="25px;">
                        &nbsp;
                    </td>
                    <td class="headline txtBlue size15">
                        City
                    </td>
                    <td class="txtGrey size15">
                        Paris Ile-de-France
                    </td>
                </tr>
                <tr>
                    <td class="headline txtBlue size15">
                        Details
                    </td>
                    <td style="width:10px">
                        &nbsp;
                    </td>
                    <td class="txtGrey size15">
                        26 year old Un homme, 185cm, Sans religion
                    </td>
                    <td>
                    </td>
                    <td class="headline txtBlue size15">
                        Ethnicity
                    </td>
                    <td class="txtGrey size15">
                        Caucasienne Balance with Châtains
                    </td>
                </tr>
                <tr>
                    <td class="headline txtBlue size15">
                        Intent
                    </td>
                    <td style="width:10px">
                        &nbsp;
                    </td>
                    <td class="txtGrey size15">
                        French.Kiss Cherche une relation amoureuse.
                    </td>
                    <td>
                    </td>
                    <td class="headline txtBlue size15" style="width:90px">
                        Education
                    </td>
                    <td class="txtGrey size15">
                        Diplôme universitaire/Licence
                    </td>
                </tr>

                <tr>
                    <td class="headline txtBlue size15">
                        Personnalité
                    </td>
                    <td style="width:10px">
                        &nbsp;
                    </td>
                    <td class="txtGrey size15">

                    </td>   <td>
                    </td>
                <td>
                            <span class="headline txtBlue size15">Profession </span>
                        </td>
                        <td>
                            <span class="txtGrey size15">
                                Visioconférence</span>
                        </td>
                </tr>

            </table> 

print tableUserDetailsList = [u'About', u'Fume chancenellement with Silhouette mince', u'City', u'Paris Ile-de-France', u'Details', u'26 岁 Un homme, 185cm, Sans 宗教', u'Ethnicity', u'Caucasienne Balance with Ch\xe2tains', u'Intent', u'French.Kiss Cherche une relationship amoureuse.', u'Education', u'Dipl\xf4me universitaire/Licence', u'Personnalit\xe9']

*第二张表*

打印 tableUserDetails.content = none

打印 tableUserDetails =

 <table width="85%" class="secondpart">
                    <tr height="25px">
                        <td width="200px">
                            <span class="headline txtBlue size14">I am Seeking a</span>
                        </td>
                        <td width="300px">
                            <span class="txtGrey size14">
                                Une femme</span>
                        </td>
                        <td width="25px">
                        </td>
                        <td width="200px">
                            <span class="headline txtBlue size14">For</span>
                        </td>
                        <td width="200px">
                            <span class="txtGrey size14">
                                Sorties</span>
                        </td>
                    </tr>
                    <tr height="25px">
                        <td>
                            <span class="headline txtBlue size14"><a href='needs_test.aspx'>Needs Test</a></span>
                        </td>
                        <td>
                            <span class="txtGrey size14"><a href='needs_test.aspx'>


                                <a href="needs_view.aspx?id=38028200">View
                                    his
                                    relationship needs</a></a></span>
                        </td>
                        <td>
                        </td>
                        <td>
                            <span class="headline txtBlue size14"><a href='poftest.aspx'>Chemistry</a></span>
                        </td>
                        <td>
                            <span class="txtGrey size14"><a href='poftest.aspx'>

                                <a href="personality.aspx?id=26&user_id=41724176" rel="nofollow">View
                                    his
                                    chemistry results</a></a></span>
                        </td>
                    </tr>
                    <tr height="25px">
                        <td>
                            <span class="headline txtBlue size14">Do you drink?</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Occasionnellement</span>
                        </td>
                        <td>
                        </td>
                        <td>
                            <span class="headline txtBlue size14">Do you want children?</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Non divulgué</span>
                        </td>
                    </tr>
                    <tr height="25px">
                        <td>
                            <span class="headline txtBlue size14">Marital Status</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Célibataire</span>
                        </td>
                        <td>
                        </td>
                        <td>
                            <span class="headline txtBlue size14">Do you do drugs?</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Non</span>
                        </td>
                    </tr>

                    <tr height="25px">
                        <td>
                            <span class="headline txtBlue size14">Pets </span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Aucun</span>
                        </td>
                        <td>
                        </td>
                        <td>
                            <span class="headline txtBlue size14">Eye Color</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Bruns</span>
                        </td>
                    </tr>

                    <tr height="25px">
                        <td>
                            <span class="headline txtBlue size14">Do you have a car? </span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                N/A</span>
                        </td>
                        <td>
                        </td>
                        <td>
                            <span class="headline txtBlue size14">Do you have children?</span>
                        </td>
                        <td>
                            <span class="txtGrey size14">
                                Non</span>
                        </td>
                    </tr>
                    <tr height="25px">
                        <td>
                         <span class="headline txtBlue size14">Longest Relationship</span>
                        </td>

                        <td>
                            <span class="txtGrey size14">
                                Plus de 2 ans</span>
                        </td>
                        <td>
                        </td>
                        <td>

                        </td>
                        <td>

                        </td>
                    </tr>

                </table> 

打印 tableUserDetailsList = []

4

2 回答 2

1

这有效:

tableUserDetailsList = []
for row in tableUserDetails.findAll('tr'):
    for col in row.findAll('td'):
        contents = list(col.stripped_strings)
        if contents:
            contentTd = contents[0]
            print "TD Number %d : %s" % (i, contentTd)
            tableUserDetailsList.append(contentTd)
            i += 1

问题是您的第二个表包含spans. 之前的换行符span也被解释为内容并在col.contents列表中返回。

它也适用于第一张桌子。正如 Anubhav 评论的那样,您真的应该考虑迭代表而不是两次使用相同的代码。

于 2013-05-15T07:17:40.060 回答
0

而是使用 table = soup.find('table')

使用 table = soup.find_all('table')

这将在您的 html 中返回一个表格列表,然后您可以从列表中选择正确的一个。

于 2014-04-16T22:34:27.080 回答