node.js - Node.js 保存 GET 请求的 HTML 响应

Question

显然，我对 Javascript 的了解比我愿意承认的要新一些。我正在尝试使用 Node.js 拉网页并将内容保存为变量，因此我可以按照自己的意愿对其进行解析。

在 Python 中，我会这样做：

from bs4 import BeautifulSoup # for parsing
import urllib

text = urllib.urlopen("http://www.myawesomepage.com/").read()

parse_my_awesome_html(text)

我将如何在 Node 中执行此操作？我已经做到了：

var request = require("request");
request("http://www.myawesomepage.com/", function (error, response, body) {
    /*
     Something here that lets me access the text
     outside of the closure

     This doesn't work:
     this.text = body;
    */ 
})

score 10 · Accepted Answer

var request = require("request");

var parseMyAwesomeHtml = function(html) {
    //Have at it
};

request("http://www.myawesomepage.com/", function (error, response, body) {
    if (!error) {
        parseMyAwesomeHtml(body);
    } else {
        console.log(error);
    }
});

编辑：正如 Kishore 所指出的，有很好的解析选项可用。如果您在 Windows 上遇到 jsdom 的 python/gyp 问题，也请参阅 Cheerio。github 上的 Cheerio

score 3 · Accepted Answer

该request()调用是异步的，因此响应仅在回调内部可用。您必须从中调用 parse 函数：

function parse_my_awesome_html(text){
    ...
}

request("http://www.myawesomepage.com/", function (error, response, body) {
    parse_my_awesome_html(body)
})

习惯于链接回调，这本质上就是任何 I/O 将在 javascript 中发生的方式:)

score 2 · Accepted Answer

如果你想解析响应， JsDom可以很好地实现这样的事情。

    var request = require('request'),
    jsdom = require('jsdom');

request({ uri:'http://www.myawesomepage.com/' }, function (error, response, body) {
  if (error && response.statusCode !== 200) {
    console.log('Error when contacting myawesomepage.com')
  }

  jsdom.env({
    html: body,
    scripts: [
      'http://code.jquery.com/jquery-1.5.min.js'
    ]
  }, function (err, window) {
    var $ = window.jQuery;

    // jQuery is now loaded on the jsdom window created from 'agent.body'
    console.log($('body').html());
  });
});

此外，如果您的页面有很多正在加载的 javascript/ajax 内容，您可能需要考虑使用phantomjs 源http://blog.nodejitsu.com/jsdom-jquery-in-5-lines-on-nodejs/

node.js - Node.js 保存 GET 请求的 HTML 响应

3 回答 3

Related

Reference