0

I am trying to crawl this page https://www.stickyguide.com/dispensaries/leaf-lab/ using scrapy. I am now having trouble crawling reviews from this page for a long time. If any one has any experience dealing with Ajax or Javascript, please share your thoughts.

1) I can easily get the Xpath for the review:

response.xpath('//*[@id="reviews_section"]/div')    

However, I believe the review part of the page is loaded by javascript. Every time when I crawled this page, I got the following value of Xpath:

<Selector xpath='//*[@id="reviews_section"]/div' data=u'<div id="loader">\n<div class="loader"></'>

If there any method I can use to assure that scrapy crawls before javascript has been loaded? When I looked up the method online, using selenium package may be a solution, but it may be not efficient.

2) Another problem I met is that I only want to crawl the data from dispensaries. I need to choose the option "VIEW: Dispensary Only" from the dropdown menu next to the Review module. I took a look at the HTML code and it tends out to be an Ajax object.

<select id="sort" name="sort" onchange="new Ajax.Request('/update_reviews_section/2487', {asynchronous:true, evalScripts:true, parameters:'sort_by=' + $('sort').value + '&amp;authenticity_token=' + encodeURIComponent('69BgJpnnj0tx/0lYwjIk75iFj0/l2R9EDj1No1FJX9o=')})">

If there any method I can use to request the content of the option "VIEW: Dispensary Only"? I have tried a lot of methods on stackoverflow but I still can't work this out.

Thank you in advance

4

1 回答 1

0

您需要打开您的开发工具 (F12),查找这段 html 的加载位置,然后请求获取它。我可以看到评论已加载此请求(POST 到 /update_review_section,表单正文中包含三个键和值),请记住在开发工具中保留复选框“preserve_log”,这将允许您查看页面加载时发生的情况。在屏幕截图中查看突出显示的请求

在此处输入图像描述

于 2014-11-17T18:30:43.417 回答