python - 正则表达式或仅获取图像网址的方法

Question

我想从以下页面下载图像http://wordpandit.com/learning-bin/visual-vocabulary/page/2/ 我使用 urllib 下载它并使用 BeautifulSoup 解析。它包含许多 url，我只想要那些以 .jpg 结尾的 url，它们也有 rel="prettyPhoto[gallery]" 标签。如何使用 Beautifulsoup 做到这一点？例如链接http://wordpandit.com/wp-content/uploads/2013/02/Obliterate.jpg

#http://wordpandit.com/learning-bin/visual-vocabulary/page/2/
import urllib
import BeautifulSoup
import lxml
baseurl='http://wordpandit.com/learning-bin/visual-vocabulary/page/'
count=2


for count in range(1,2):
    url=baseurl+count+'/'
    soup1=BeautifulSoup.BeautifulSoup(urllib2.urlopen(url))#read will not be needed
    #find all links to imgs
    atag=soup.findAll(rel="prettyPhoto[gallery]")
    for tag in atag:
        soup2=BeautifulSoup.BeautifulSoup(tag)
        imgurl=soup2.find(href).value
        urllib2.urlopen(imgurl)

score 0 · Accepted Answer

你的代码有很多不必要的东西。也许您稍后会使用它们，但是像分配countas2然后将其用作for range循环中的计数器是没有意义的。这是可以执行您想要的代码的代码：

import urllib2
from bs4 import BeautifulSoup
baseurl='http://wordpandit.com/learning-bin/visual-vocabulary/page/'

for count in range(1,2):
    url = baseurl + str(count) + "/"
    html_page = urllib2.urlopen(url)
    soup = BeautifulSoup(html_page)
    atag = soup.findAll(rel="prettyPhoto[gallery]", href = True)
    for tag in atag:
        if tag['href'].endswith(".jpg"):
            imgurl = tag['href']
            img = urllib2.urlopen("http://wordpandit.com" + imgurl)
            with open(imgurl.split("/")[-1], "wb") as local_file:
                local_file.write(img.read())

python - 正则表达式或仅获取图像网址的方法

1 回答 1

Related

Reference