2

我做了一个非常小的测试:

var page = require('webpage').create()
  , fs   = require('fs');

page.open("http://www.google.it/search?q=web+design", function(status){

    if (status === 'success')
    {
        page.render('google.png');
        fs.write("source.html", page.content, 'w'); 
    }

    phantom.exit(); 
})

如您所见,我在 google.it 上搜索“网页设计”

现在,查看 source.html,我注意到 PhantomJS 生成的源代码和真实的(Chrome 的元素检查器)html 之间的差异。

在我的源代码中,结果具有以下代码:

<li class="g">
   <h3 class="r"><a href="/url?q=http://www.html.it/web-design/&amp;sa=U&amp;ei=Z2LZUbSaBcGV7Abm54BI&amp;ved=0CCwQFjAB&amp;usg=AFQjCNGagkxLs36cXSzGjyhnBX7duCI6dA"><b>WebDesign</b> - Guide e approfondimenti per webdesigner - HTML.it</a></h3>
   <div class="s">
      <div class="kv" style="margin-bottom:2px"><cite>www.html.it/<b>web</b>-<b>design</b>/</cite><span class="flc"> - <a href="/url?q=http://webcache.googleusercontent.com/search%3Fq%3Dcache:3GWnT4NPDr0J:http://www.html.it/web-design/%252Bweb%2Bdesign%26hl%3Dit%26ct%3Dclnk&amp;sa=U&amp;ei=Z2LZUbSaBcGV7Abm54BI&amp;ved=0CC0QIDAB&amp;usg=AFQjCNE_1Gt5RL9WQAGZpM_3f-oxZ1VR9w">Copia cache</a></span></div>
      <span class="st">WebDesign: progettazione Web, User Experience, Architettura dell'informazione, <br>  i consigli di esperti designer in guide e articoli di approfondimento in italiano.</span><br>
   </div>
</li>

但真正的来源(通过 Chrome 的 Element Inspect 读取)是:

<li class="g">
   <!--m-->
   <div data-hveid="55" class="rc">
      <span style="float:left"></span>
      <h3 class="r"><a href="/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=2&amp;cad=rja&amp;ved=0CDgQFjAB&amp;url=http%3A%2F%2Fwww.html.it%2Fweb-design%2F&amp;ei=wmTZUfHdOYSO7AagwIHwDw&amp;usg=AFQjCNFaDZWWczDbce8TlYh9oqYluJ-E5g&amp;bvm=bv.48705608,d.ZGU" onmousedown="return rwt(this,'','','','2','AFQjCNFaDZWWczDbce8TlYh9oqYluJ-E5g','','0CDgQFjAB','','',event)"><em>WebDesign</em> - Guide e approfondimenti per webdesigner - HTML.it</a></h3>
      <div class="s">
         <div>
            <div class="f kv" style="white-space:nowrap">
               <cite>www.html.it/<b>web</b>-<b>design</b>/</cite>‎
               <div class="action-menu ab_ctl">
                  <a href="#" data-ved="0CDkQ7B0wAQ" class="clickable-dropdown-arrow ab_button" id="am-b1" aria-label="Dettagli risultato" jsaction="ab.tdd; keydown:ab.hbke; keypress:ab.mskpe" role="button" aria-haspopup="true" aria-expanded="false"><span class="mn-dwn-arw"></span></a>
                  <div data-ved="0CDoQqR8wAQ" class="action-menu-panel ab_dropdown" jsaction="keydown:ab.hdke; mouseover:ab.hdhne; mouseout:ab.hdhue" role="menu" tabindex="-1">
                     <ul>
                        <li class="action-menu-item ab_dropdownitem" role="menuitem"><a href="http://webcache.googleusercontent.com/search?q=cache:3GWnT4NPDr0J:www.html.it/web-design/+&amp;cd=2&amp;hl=it&amp;ct=clnk&amp;gl=it&amp;client=ubuntu" onmousedown="return rwt(this,'','','','2','AFQjCNEaothLaL83HBobw4UE8q_OpkIPrw','','0CDsQIDAB','','',event)" class="fl">Copia&nbsp;cache</a></li>
                     </ul>
                  </div>
               </div>
            </div>
            <div class="f slp"></div>
            <span class="st"><em>WebDesign</em>: progettazione Web, User Experience, Architettura dell'informazione, i consigli di esperti designer in guide e articoli di approfondimento in italiano.</span>
         </div>
      </div>
   </div>
   <!--n-->
</li>

如您所见,最后的代码更完整。

所以我的问题是:

为什么这些结果有不同的代码?

我读到 PhantomJS 会像浏览器一样执行页面内的所有 JS,那么为什么会有这些差异呢?

谢谢!

4

2 回答 2

2

因为 PhantomJS 有不同的用户代理。如果您将用户代理更改为 Google Chrome,您将收到与 Google Chrome 相同的结果。

page.settings.userAgent您可以通过属性更改用户代理。

于 2013-07-07T13:27:04.170 回答
1

也许尝试等待 Google 的 js 代码进行的所有 DOM 转换都已执行……例如,这可以通过等待.action-menu元素可用来实现(免责声明:作为 casperjs 的作者,我在这里使用的是 casperjs):

var fs = require('fs');

require('casper').create()
    .start("http://www.google.it/search?q=web+design")
    .waitForSelector(".action-menu", function() {
        this.capture('google.png');
        fs.write("source.html", this.getPageContent(), 'w'); 
    }).run();
于 2013-07-07T14:30:29.230 回答