0

我正在尝试从此页面获取所有评论信息(http://www.amazon.com/Learning-Java-Patrick-Niemeyer/dp/1449319246%3FSubscriptionId%3DAKIAIZJQKUHUCXRLH6MQ%26tag%3Dyuplayit-20%26linkCode%3Dxm2%26camp %3D2025%26creative%3D165953%26creativeASIN%3D1449319246),标签内的文本,<div class=“drkgry”&gt;....</div>但它总是显示返回[]。我不知道发生了什么。

Python:

import bs4 from BeautifulSoup
data = open("example_1.html").read()
soup = BeautifulSoup(data)
soup.find_all("div",class="drkgry")

我也尝试过soup.findall("div",class="drkgry"), soup.find_all('div', attrs ={'class':'drkgry'}),,但它们只是不起作用。

我要抓取的数据源:

</div>  <div class="txtsmall mt4 fvavp"><span class="inlineblock formatVariation"><span class="gr3 gry formatKey">Format:</span><span class="formatValue">Paperback</span></span></div>  <div class="mt9 reviewText">






<div class="drkgry">
  Learning Java (Fourth Edition) is book for Java practitioner as reference book. This covers lot of topics.<br><br>This is an excellent book for someone who knows basics of programming. This book is not beginners. This book lacks examples and exercises which may disappoint few people.<br><br>Book has 24 chapters covering almost all of basic Java.  The chapter one talks about historical aspects. Second chapter is brief introduction of java but it assumes that reader is aware of programming, OOP, threading etc which is difficult for any beginner.
</div>

</div>  <div class="clearboth txtsmall gt9 vtStripe">    <div class="fl cmt">

有人帮我解决问题吗?

4

2 回答 2

2

我运行了这个确切的脚本:

import urllib
from bs4 import BeautifulSoup as BS

html =urllib.urlopen('http://www.amazon.com/dp/1449319246/?tag=stackoverfl08-20').read()

soup = BS(html)

print soup.findAll('div',{'class':'drkgry'})[1].get_text()

它打印:

学习Java(第四版)是Java从业者的参考书。这涵盖了很多主题。对于了解编程基础知识的人来说,这是一本极好的书。这本书不是初学者。本书缺少示例和练习,可能会让少数人失望。本书有 24 章,几乎涵盖了所有基本 Java。第一章讨论历史方面。第二章是对java的简要介绍,但假设读者了解编程、OOP、线程等,这对任何初学者来说都是困难的。

如果您在没有索引的情况下运行它,soup.findAll那么它会为您提供评论中所有信息的列表

于 2013-10-09T03:46:00.940 回答
0

利用:

class_="drkgry"

代替:

class = "drkgry"

这就是我想的。

于 2013-10-09T03:43:19.807 回答