3

我正在尝试抓取一个包含 javascript 代码和使用 javascript 代码准备的网站内容的网站。

安装了 Scrapy 和 Splash。

Splash 正在使用此代码运行

sudo docker run -p 8050:8050 -v /etc/splash/proxy-profiles:/etc/splash/proxy-profiles scrapinghub/splash
2015-08-21 07:21:06+0000 [-] Log opened.
2015-08-21 07:21:06.483344 [-] Splash version: 1.7
2015-08-21 07:21:06.490230 [-] Qt 4.8.1, PyQt 4.9.1, WebKit 534.34, sip 4.13.2, Twisted 15.2.1, Lua 5.2
2015-08-21 07:21:06.490505 [-] Open files limit: 524288
2015-08-21 07:21:06.490745 [-] Open files limit increased from 524288 to 1048576
2015-08-21 07:21:06.699607 [-] Xvfb is started: ['Xvfb', ':1087', '-screen', '0', '1024x768x24']
2015-08-21 07:21:06.808450 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
2015-08-21 07:21:06.929580 [-] verbosity=1
2015-08-21 07:21:06.929964 [-] slots=50
2015-08-21 07:21:06.930484 [-] Web UI: enabled, Lua: enabled (sandbox: enabled), Proxy Server: enabled
2015-08-21 07:21:06.931420 [-] Site starting on 8050
2015-08-21 07:21:06.931640 [-] Starting factory <twisted.web.server.Site instance at 0x1b5b3f8>
2015-08-21 07:21:06.938232 [-] SplashProxyServerFactory starting on 8051
2015-08-21 07:21:06.938468 [-] Starting factory <splash.proxy_server.SplashProxyServerFactory instance at 0x1b5bcf8>

当我想获取网站代码时,render.html 显示“Javascript 未启用。请在浏览器中启用 JavaScript”。

import scrapy

class xxxxxSpider(scrapy.Spider):
    start_urls = ["xxxxx"]
    name = "sahibinden"
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 0.5, 'proxy':'xxxxx'}
                }
            })

    def parse(self, response):
        with open("result.txt", "a") as myfile:
            myfile.write(str(response.css('body').extract()))

所有设置都OK。

DOWNLOADER_MIDDLEWARES = {
    'scrapyjs.SplashMiddleware': 725,
}

SPLASH_URL = 'http://localhost:8050/'

DUPEFILTER_CLASS = 'scrapyjs.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapyjs.SplashAwareFSCacheStorage'

我曾经成功地报废了该网站。然后我收到“您的浏览器中未启用Javascript”错误。

如果它有助于解决问题,这是我渲染页面时的飞溅输出。

2015-08-21 08:06:09.838076 [-] "172.17.42.1" - - [21/Aug/2015:08:06:09
+0000] "POST /render.html HTTP/1.1" 200 4048 "-" "Scrapy/1.0.3.post1+g83a06ed (+http://scrapy.org)"

我不明白有什么问题。有什么帮助吗?

更多信息

我已经删除了虚拟机。IP地址改变了,然后我又试了一次。它第一次成功地得到了结果。但是,第二次请求它什么也得不到。我认为该网站阻止了我的 IP 地址。

4

0 回答 0