2

我正在尝试使用 beautifulsoup 解析一个简单的 html 表,但我遇到了一些问题

这是我的输入

<table id="people" class="tt" width="99%" border="0" cellpadding="0" cellspacing="1">
 <tr>
  <td colspan="3" bgcolor="#d3d3d3">
   <p align="center" style="border: 1px solid #c0c0c0; padding: 0.02in">
    <a name="faculty">
    </a>
    <b>
     Faculty
    </b>
   </p>
  </td>
 </tr>
 <tr>
  <td>
   <p align="center">
    <font color="#000080">
     <a href="http://www.website.com/%7Empop">
      <font color="#000080">
       <img src="images/mpop.jpg" name="graphics1" align="bottom" width="70" height="85" border="1" />
      </font>
     </a>
    </font>
   </p>
  </td>
  <td>
   <p>
    <b>
     John Doe, Ph.D.
    </b>
    <br />
    Associate Professor, Computer
                Science
    <br />

   </p>
  </td>
  <td>
   <p>
    Office:  Sciences Bldg.
    <br />
    Phone:
                xxx-xxx-xxxx
    <br />
    jd [at] website.com
    <br />
       </p>
  </td>
 </tr>
 <tr>
  <td>
   <p align="center">
    <font color="#000080">
     <a href="http://www.website.com/%7Ercolwell">
      <font color="#000080">
       <img src="images/rcolwell.jpg" name="graphics2" align="bottom" width="70" height="97" border="1" />
      </font>
     </a>
    </font>
   </p>
  </td>
  <td>
   <p>
    <b>
     Jane Doe, Ph.D.
    </b>
    <br />
     Professor
    <br />
  School of Public Health
    <br />
   </p>
  </td>
  <td>
   <p>
    Sciences Bldg
    <br />
    jd [at]
                website.com
    <br />

    </a>
   </p>
  </td>
 </tr>
</table>

这是我的代码

t = soup.findAll("table",id="people")
for table in t:
    rows = table.findAll("tr")
    for tr in rows:
        cols = tr.findAll("td")
        for td in cols:
            print(str(td.find(text=True))) # tried also print(td.find(text=True))
            print(",")
        print("\n")

这将生成只有逗号而实际上没有文本的输出,但是当我print(td)找到我需要输出的信息但以 html 格式输出所有标签时,谁能指出我在这里做正确的事情?我只想提取单元格内容。

干杯

4

1 回答 1

0

也许你正在寻找这样的 st:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("<table id=people><tr><td>x<a>y</a>z</td><td>x<a>y</a>z</td></tr></table>")
t = soup.findAll("table",id="people")
for table in t:
   rows = table.findAll("tr")
   for tr in rows:
      cols = tr.findAll("td")
      print(','.join([td.text for td in cols]))

或者,您可以u''.join(map(unicode, td.contents))根据您想要打印的内容来使用。

于 2012-07-13T19:29:01.350 回答