2

当我在浏览器中调出一个 mindbodyonline 客户端的时间表时,我可以毫不费力地将 Xpath 获取到我想从页面中抓取的项目。但是,当我尝试使用 scrapy shell 抓取站点时,我的 XPaths 永远不会返回任何对象。

例如,我尝试从 scrapy shell 中抓取以下 URL:

$ scrapy shell https://clients.mindbodyonline.com/ASP/adm/home.asp?studioid=2260

2013-07-15 15:50:45-0700 [scrapy] INFO: Scrapy 0.14.4 started (bot: scrapybot)
2013-07-15 15:50:46-0700 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2013-07-15 15:50:46-0700 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-15 15:50:46-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-15 15:50:46-0700 [scrapy] DEBUG: Enabled item pipelines: 
2013-07-15 15:50:46-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-15 15:50:46-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-15 15:50:46-0700 [default] INFO: Spider opened
2013-07-15 15:50:53-0700 [default] DEBUG: Redirecting (302) to <GET https://clients.mindbodyonline.com/ASP/ws.asp?studioid=2260> from <GET https://clients.mindbodyonline.com/ASP/adm/home.asp?studioid=2260>
2013-07-15 15:50:55-0700 [default] DEBUG: Redirecting (302) to <GET https://clients.mindbodyonline.com/ASP/ws.asp?studioid=2260&sessionChecked=true> from <GET https://clients.mindbodyonline.com/ASP/ws.asp?studioid=2260>
2013-07-15 15:51:01-0700 [default] DEBUG: Crawled (200) <GET https://clients.mindbodyonline.com/ASP/ws.asp?studioid=2260&sessionChecked=true> (referer: None)
[s] Available Scrapy objects:
[s]   hxs        <HtmlXPathSelector xpath=None data=u'<html>\r\n\t<head>\r\n\t<title>Yoga Now Online'>
[s]   item       {}
[s]   request    <GET https://clients.mindbodyonline.com/ASP/adm/home.asp?studioid=2260>
[s]   response   <200 https://clients.mindbodyonline.com/ASP/ws.asp?studioid=2260&sessionChecked=true>
[s]   settings   <CrawlerSettings module=None>
[s]   spider     <BaseSpider 'default' at 0x99480ac>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
Python 2.7.4 (default, Apr 19 2013, 18:32:33) 

In [1]: response.body
Out[1]: '\r\n\t<html>\r\n\t<head>\r\n\t<title>Yoga Now Online</title>\r\n\t<meta http-equiv="Content-Type" content="text/html">\r\n\t<LINK REL="ICON" HREF="/favicon.ico">\r\n\t<LINK REL="SHORTCUT ICON" HREF="/favicon.ico">\r\n\t<script type="text/javascript">\r\n\r\nvar _gaq = _gaq || [];\r\n_gaq.push([\'_setAccount\', \'UA-19985881-2\']);\r\n_gaq.push([\'_setDomainName\', \'none\']);\r\n_gaq.push([\'_setAllowLinker\', true]);\r\n_gaq.push([\'_trackPageview\']);\r\n\r\n(function() {var ga = document.createElement(\'script\'); ga.type = \'text/javascript\'; ga.async = true;\r\nga.src = (\'https:\' == document.location.protocol ? \'https://ssl\' : \'http://www\') + \'.google-analytics.com/ga.js\';\r\nvar s = document.getElementsByTagName(\'script\')[0]; s.parentNode.insertBefore(ga, s);\r\n})();\r\n\r\n</script><link rel="stylesheet" type="text/css" href="https://static.mindbodyonline.com/v33438/styles/jquery.tooltip.css"  /><link rel="stylesheet" type="text/css" href="https://static.mindbodyonline.com/v33438/styles/base/jquery.ui.all.css"  /><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-1.8.2.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.cookie-1.0.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.mb.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.libasync.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.core.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.widget.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.mouse.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.draggable.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.droppable.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.resizable.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.dialog.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.autocomplete.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.position.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.effects.core.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.effects.highlight.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/jquery-ui-1.8.23/jquery.ui.datepicker.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.tooltip.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.ba-resize.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.lightboxLib.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.hoverIntent.js"></script><script type="text/javascript" src="https://static.mindbodyonline.com/v33438/scripts/plugins/jquery.smartFocus-0.1.js"></script>\r\n\r\n\r\n<script type="text/javascript">\r\n// filePath must be absolute with leading slash\r\nfunction contentUrl(filePath) {\r\n\t\r\n\treturn "https://static.mindbodyonline.com/v33438" + filePath;\r\n\t\r\n}\r\n\r\n(function ($) {\r\n\t//$.fn.extend({\r\n\t$.contentUrl = function (filePath) {\r\n\t\t//contentUrl: function (filePath) {\r\n\t\t\t\r\n\t\t\treturn "https://static.mindbodyonline.com/v33438" + filePath;\r\n\t\t\t\r\n\t};\r\n})(jQuery);\r\n\r\n$(function() {\r\n\r\n\t\r\n\t\t// init tooltips\r\n\t\t$("img[title],span[title],select[title],input[title],legend[title]").tooltip({\r\n\t\t\ttrack: true,\r\n\t\t\tshowURL: false,\r\n\t\t\tfade: 250\r\n\t\t});\r\n\t\t\r\n\t\r\n\t$(\'fieldset.collapsible\').setCollapseEvents();\r\n\t\r\n});\r\n</script>\r\n\r\n\r\n<script type="text/javascript">\r\n\r\nfunction launchHome() {\r\n\t\r\n\t\t\tdocument.wsLaunch.action = "home.asp?studioid=2260";\r\n\t\t\r\n\t\tdocument.wsLaunch.submit();\r\n\t}\r\n\t</script>\r\n\t</head>\r\n\t<body onLoad="launchHome();">\r\n\t<form name="wsLaunch" action="home.asp?studioid=2260" method="post">\r\n\t<input type="hidden" name="tg" value="" /> <input type="hidden" name="vt" value="" /> <input type="hidden" name="lvl" value="" /> <input type="hidden" name="stype" value="" /> <input type="hidden" name="qParam" value="" /> <input type="hidden" name="view" value="" /> <input type="hidden" name="trn" value="0" /> <input type="hidden" name="page" value="" /> <input type="hidden" name="catid" value="" /> <input type="hidden" name="prodid" value="" /> <input type="hidden" name="date" value="7/16/2013" /> <input type="hidden" name="classid" value="0" /> <input type="hidden" name="sSU" value="" /> <input type="hidden" name="optForwardingLink" value="" /> \r\n\t<input type="hidden" name="launchGUID" value="" />\r\n\t<input type="hidden" name="launchUID" value="" />\r\n\t<input type="hidden" name="launchPWDChange" value="" />\r\n\t<input type="hidden" name="launchPWDChangeKey" value="" />\r\n\t<input type="hidden" name="launchLostPWD" value="" />\r\n\t\r\n\t\r\n\t<input type="hidden" name="extLink" value="" />\r\n\t</form>\r\n\t<noscript>\r\n\tYou must have javascript enabled to use Yoga Now Online.\r\n\t</noscript>\r\n\t</body>\r\n\t</html>\r\n'

对不起,你需要整理那个 HTML,我稍后会尝试附加一个漂亮的版本。但关键是,我需要的数据不在响应中scrapy crawl。但是,当我手动访问 URL 时,甚至使用view(response)

存在以下 HTML(这是我要抓取的数据):

<tr class="oddRow" style="width: 929px;">
<td style="width: 90px;">&nbsp;&nbsp;&nbsp;4:00&nbsp;pm </td><td style="width: 167px;"></td>
<td style="width: 172px;"><a class="modalClassDesc" name="cid617" href="javascript:;">Vinyasa (Level 1-2)</a></td>
<td style="width: 172px;"><a class="modalBio" name="bio100000375" href="javascript:;">Dietrich McGaffey</a></td>
<td style="width: 106px;">Main Yoga Room</td><td style="width: 162px;">&nbsp;1&nbsp;hour&nbsp;&amp;&nbsp;30&nbsp;minutes</td></tr>

所以前面是大局,我希望你对我想要完成的事情有一个好主意。我想抓取的 HTML 在浏览器中可用,但不能通过 scrapy shell。我知道 Scrapy 正在被重定向。根据我调查的时间,我认为问题在于该网站有 javascript 检测来阻止机器人,或者可能是scrapy 没有正确处理 cookie。

为了进一步混淆自己,这是 cURL 的输出:

curl https://clients.mindbodyonline.com/ASP/adm/home.asp?studioid=2260
<head><title>Object moved</title></head>
<body><h1>Object Moved</h1>This object may be found <a HREF="/ASP/ws.asp?studioid=2260">here</a>.</body>

当我从 cURL 跟踪链接时,它似乎向我发送了对象移动链接的无限循环。

抱歉,因为冗长,但我想彻底描述我的问题。如果有人有解决方案或指示如何进一步调查,我会重视您的意见。感谢您花时间整理并帮助我。

4

1 回答 1

1

使用 Chrome,我从https://clients.mindbodyonline.com/ASP/adm/home.asp?studioid=2260重定向 到https://clients.mindbodyonline.com/ASP/home.asp?studioid=2260 (参见编辑下面的解释)

Sitll 使用 Chrome,查看源代码:https ://clients.mindbodyonline.com/ASP/home.asp?studioid=2260显示页面包含框架集

<frameset id="mainFrameset" frameborder="0" framespacing="0" NORESIZE>   
  <frame name="mainFrame" src="main_class.asp?tg=&amp;vt=&amp;lvl=&amp;stype=&amp;view=&amp;trn=0&amp;page=&amp;catid=&amp;prodid=&amp;date=7%2F16%2F2013&amp;classid=0&amp;sSU=&amp;optForwardingLink=&amp;qParam=&amp;justloggedin=&amp;nLgIn=&amp;pMode=" frameborder="10"  scrolling="YES" width="320">
</frameset>
<noframes> 
<body style="background-color:#FFFFFF;" text="#000000">
</body>
</noframes> 
</html>

所以我认为你需要获取frame[@name="mainFrame"]的@src属性对应的页面

仍在 Chrome 下,查看源代码:https ://clients.mindbodyonline.com/ASP/main_class.asp?tg=&vt=&lvl=&stype=&view=&trn=0&page=&catid=&prodid=&date=7%2F16%2F2013&classid= 0&sSU=&optForwardingLink=&qParam=&justloggedin=&nLgIn=&pMode= 确实有<table id="classSchedule-mainTable" class="" cellspacing="0">你要找的


编辑:我用这样的scrapy shell测试了这个(我喜欢直接使用lxml.etree)

  import lxml.etree
  import lxml.html
  doc = lxml.etree.fromstring(response.body, parser=lxml.html.HTMLParser())
  print lxml.etree.tostring(doc.xpath('head')[0], pretty_print=True)

并且它发生在浏览器中的重定向来自一点点Javascript(我不确定这是做什么的,但它似乎与行为相匹配)

    <script type="text/javascript">&#13;
&#13;
function launchHome() {&#13;
    &#13;
            document.wsLaunch.action = "home.asp?studioid=2260";&#13;
        &#13;
        document.wsLaunch.submit();&#13;
    }&#13;
    </script>
  </head>
  <body onload="launchHome();">&#13;

response.url存在:

  response.url
  'https://clients.mindbodyonline.com/ASP/ws.asp?studioid=2260&sessionChecked=true'

您将重定向到https://clients.mindbodyonline.com/ASP/home.asp?studioid=2260

于 2013-07-17T01:38:20.880 回答