47

我想提取:

  • image来自标签的以下 src 的文本和
  • div类数据内的锚标记的文本

我成功地提取了 img src,但无法从锚标记中提取文本。

<a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

这是整个HTML 页面的链接。

这是我的代码:

for div in soup.findAll('div', attrs={'class':'image'}):
    print "\n"
    for data in div.findNextSibling('div', attrs={'class':'data'}):
        for a in data.findAll('a', attrs={'class':'title'}):
            print a.text
    for img in div.findAll('img'):
        print img['src']

我想要做的是提取图像 src (链接)和里面的标题div class=data,例如:

 <a class="title" href="http://www.amazon.com/Nikon-COOLPIX-Digital-Camera-NIKKOR/dp/B0073HSK0K/ref=sr_1_1?s=electronics&amp;ie=UTF8&amp;qid=1343628292&amp;sr=1-1&amp;keywords=digital+camera">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 

应该提取:

Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)

4

7 回答 7

78

这将有助于:

from bs4 import BeautifulSoup

data = '''<div class="image">
        <a href="http://www.example.com/eg1">Content1<img  
        src="http://image.example.com/img1.jpg" /></a>
        </div>
        <div class="image">
        <a href="http://www.example.com/eg2">Content2<img  
        src="http://image.example.com/img2.jpg" /> </a>
        </div>'''

soup = BeautifulSoup(data)

for div in soup.findAll('div', attrs={'class':'image'}):
    print(div.find('a')['href'])
    print(div.find('a').contents[0])
    print(div.find('img')['src'])

如果您正在研究亚马逊产品,那么您应该使用官方 API。至少有一个 Python 包可以缓解您的抓取问题,并使您的活动保持在使用条款范围内。

于 2012-07-30T12:00:42.287 回答
29

就我而言,它是这样工作的:

from BeautifulSoup import BeautifulSoup as bs

url="http://blabla.com"

soup = bs(urllib.urlopen(url))
for link in soup.findAll('a'):
        print link.string

希望能帮助到你!

于 2015-12-01T11:37:00.637 回答
8

我建议走 lxml 路线并使用 xpath。

from lxml import etree
# data is the variable containing the html
data = etree.HTML(data)
anchor = data.xpath('//a[@class="title"]/text()')
于 2012-07-30T13:19:15.090 回答
3

以上所有答案确实帮助我构建了我的答案,因此我投票支持其他用户提出的所有答案:但我最终对我正在处理的确切问题提出了自己的答案:

正如问题明确定义的那样,我必须访问 dom 结构中的一些兄弟姐妹及其孩子:此解决方案将遍历 dom 结构中的图像并使用产品标题构造图像名称并将图像保存到本地目录。

import urlparse
from urllib2 import urlopen
from urllib import urlretrieve
from BeautifulSoup import BeautifulSoup as bs
import requests

def getImages(url):
    #Download the images
    r = requests.get(url)
    html = r.text
    soup = bs(html)
    output_folder = '~/amazon'
    #extracting the images that in div(s)
    for div in soup.findAll('div', attrs={'class':'image'}):
        modified_file_name = None
        try:
            #getting the data div using findNext
            nextDiv =  div.findNext('div', attrs={'class':'data'})
            #use findNext again on previous object to get to the anchor tag
            fileName = nextDiv.findNext('a').text
            modified_file_name = fileName.replace(' ','-') + '.jpg'
        except TypeError:
            print 'skip'
        imageUrl = div.find('img')['src']
        outputPath = os.path.join(output_folder, modified_file_name)
        urlretrieve(imageUrl, outputPath)

if __name__=='__main__':
    url = r'http://www.amazon.com/s/ref=sr_pg_1?rh=n%3A172282%2Ck%3Adigital+camera&keywords=digital+camera&ie=UTF8&qid=1343600585'
    getImages(url)
于 2012-07-30T21:40:39.387 回答
1
>>> txt = '<a class="title" href="http://rads.stackoverflow.com/amzn/click/B0073HSK0K">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> '
>>> fragment = bs4.BeautifulSoup(txt)
>>> fragment
<a class="title" href="http://rads.stackoverflow.com/amzn/click/B0073HSK0K">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a> 
>>> fragment.find('a', {'class': 'title'})
<a class="title" href="http://rads.stackoverflow.com/amzn/click/B0073HSK0K">Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)</a>
>>> fragment.find('a', {'class': 'title'}).string
u'Nikon COOLPIX L26 16.1 MP Digital Camera with 5x Zoom NIKKOR Glass Lens and 3-inch LCD (Red)'
于 2012-07-30T17:57:38.823 回答
1
print(link_addres.contents[0])

它将打印锚标签的上下文

例子:

 statement_title = statement.find('h2',class_='briefing-statement__title')
 statement_title_text = statement_title.a.contents[0]
于 2020-12-16T12:21:15.577 回答
1

要从锚标记中获取 hreftag.get("href")并获取您使用的 img src tag.img.get("src")

例如,使用此数据:

data = """
            <div class="image">
            <a href="http://www.example.com/eg1">Content1<img src="http://image.example.com/img1.jpg" /></a>
            </div>
            <div class="image">
            <a href="http://www.example.com/eg2">Content2<img src="http://image.example.com/img2.jpg" /> </a>
            </div>
        """

获取链接和文本:

import requests
from bs4 import BeautifulSoup

def get_soup(url):
    response = requests.get(url)
    if response.ok:
        return BeautifulSoup(response.text, features="html.parser")

def get_links(soup):
    links = []
    for tag in soup.findAll("a", href=True):
        if img := tag.img:
            img = img.get("src")
        links.append(dict(url=tag.get("href"), text=tag.text, img=img))
    return links

# soup = get_soup('www.example.com')
soup = BeautifulSoup(data, features="html.parser")
links = get_links(soup)

输出:

[{'url': 'http://www.example.com/eg1', 'text': 'Content1', 'img': 'http://image.example.com/img1.jpg'},
{'url': 'http://www.example.com/eg2', 'text': 'Content2 ', 'img': 'http://image.example.com/img2.jpg'}]
于 2021-05-07T11:07:52.287 回答