-1

Why doesn't BeautifulSoup manage to download information from wix? I'm trying to use BeautifulSoup in order to download images from my website, while other sites do work (example of the code actually working) wix does not work... Is there anything I can change in my site's settings in order for it to work?

EDIT: CODE

from bs4 import BeautifulSoup
import urllib2
import shutil
import requests
from urlparse import urljoin
import time


def make_soup(url):
    req = urllib2.Request(url, headers={'User-Agent': "Magic Browser"})
    html = urllib2.urlopen(req)
    return BeautifulSoup(html, 'html.parser')


def get_images(url):
    soup = make_soup(url)
    images = [img for img in soup.findAll('img')]
    print (str(len(images)) + " images found.")
    print 'Downloading images to current working directory.'
    image_links = [each.get('src') for each in images]
    for each in image_links:
        try:
            filename = each.strip().split('/')[-1].strip()
            src = urljoin(url, each)
            print 'Getting: ' + filename
            response = requests.get(src, stream=True)
            # delay to avoid corrupted previews
            time.sleep(1)
            with open(filename, 'wb') as out_file:
                shutil.copyfileobj(response.raw, out_file)
        except:
            print '  An error occurred. Continuing.'
    print 'Done.'


def main():
    url = HIDDEN ADDRESS
    get_images(url)

if __name__ == '__main__':
    main()
4

2 回答 2

1

BeautifulSoup 只能解析 html。Wix 网站由加载页面时运行的 javascript 生成。当您通过 urllib 请求页面的 html 时,您不会获得呈现的 html,您只需获取带有脚本的基本 html 来构建呈现的 html。为了做到这一点,您需要像 selenium 或无头 chrome 浏览器这样的东西来通过它的 javascript 呈现网站,然后获取呈现的 html 并将其提供给 beautifulsoup。

这是一个 wix 网站主体的示例,您可以看到除了通过 javascript 填充的单个 div 之外没有其他内容。

...
    <body>
        <div id="SITE_CONTAINER"></div>









    </body>
...
于 2018-03-29T22:10:20.190 回答
0

对于那些试图从 wix 网站下载图像的人来说,我设法想出了一个简单的想法。在您的页面中打开一个 HTML 代码框架,并在您的代码中链接您网站中图片的 img src。当您在 HTML 代码的 URL 上使用 BeautifulSoup 时,将下载所有图像(链接在代码中)!

于 2018-03-30T12:06:09.513 回答