我想提取:
image
来自标签的以下 src 的文本和div
类数据内的锚标记的文本
我成功地提取了 img src,但无法从锚标记中提取文本。
<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&ie=UTF8&qid=1343628292&sr=1-1&keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>
这是整个HTML 页面的链接。
这是我的代码:
for div in soup.findAll('div', attrs={'class':'image'}):
print "\n"
for data in div.findNextSibling('div', attrs={'class':'data'}):
for a in data.findAll('a', attrs={'class':'title'}):
print a.text
for img in div.findAll('img'):
print img['src']
我想要做的是提取图像 src (链接)和里面的标题div class=data
,例如:
<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&ie=UTF8&qid=1343628292&sr=1-1&keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>
应该提取:
Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)