python - 无法打开使用美丽汤库下载的图像

Question

我有一个使用 BeautifulSoup 库从网页下载图像的脚本。当我使用http://www.google.com之类的网站时，图像会正确下载到我桌面上的文件夹中，我可以打开它并查看它。但是，当我使用诸如https://sites.google.com/site/imagesizetesting/one-1之类的站点时，图像会显示下载到正确的文件夹桌面，但我收到一条错误消息，提示“Paint 无法读取此文件。这不是一个有效的位图文件，或者它的格式目前不受支持。" 我认为这可能与 google 主页的 html 文件中的文件路径是相对的这一事实有关，它是 /images/srpr/logo4w.png，而图像的路径包含在https://sites.google 上。 com/site/imagesizetesting/one-1不是相对的，它是/rsrc/1370373631437/one-1/Test.png">https://sites.google.com/site/imagesizetesting//rsrc/1370373631437/one-1/Test.png。我没有知道图像来源的差异是导致它的原因还是其他原因。有什么想法吗？这是我用于解析和下载图像的代码。

for image in soup.findAll("img"):
        print "Old Image Path: %(src)s" % image
        #Get file name
        filename = image["src"].split("/")[-1]
        #Get full path name if url has to be parsed
        parsedURL[2] = image["src"]
        image["src"] = '%s\%s' % (phonepath,filename)
        print 'New Path: %s' % image["src"]
        outpath = os.path.join(out, filename)

        #retrieve images
        if image["src"].lower().startswith("http"):
            urlretrieve(image["src"], outpath)
            print image["src"].lower()
        else:
            urlretrieve(urlparse.urlunparse(parsedURL), outpath) #Constructs URL from tuple (parsedURL)
            print image["src"].lower()

score 0 · Accepted Answer

我想到了！这是我更新的代码，以防其他人遇到类似问题。

for image in soup.findAll("img"):
        print "Old Image Path: %(src)s" % image
        #Get file name
        filename = image["src"].split("/")[-1]
        #Get full path name if url has to be parsed
        parsedURL[2] = image["src"]
        image["src"] = '%s\%s' % (phonepath,filename)
        #Old File path (local to computer)
        #image["src"] = '%s\%s' % (out,filename)
        print 'New Path: %s' % image["src"]
        #       print image
        outpath = os.path.join(out, filename)

        #retrieve images
        if parsedURL[2].lower().startswith("http"):
            #urlretrieve(image["src"], outpath)
            urlretrieve(parsedURL[2], outpath)
            print image["src"].lower()
        else:
            print "HTTP INFO " + urlparse.urlunparse(parsedURL)
            print "HTTP INFO " + image["src"].lower()
            urlretrieve(urlparse.urlunparse(parsedURL), outpath) #Constructs URL from tuple (parsedURL)
            #print image["src"].lower()

python - 无法打开使用美丽汤库下载的图像

1 回答 1

Related

Reference