我正在尝试使用cheerio从页面中抓取数据并通过以下方式请求:
- 1) 转到 url 1a ( http://example.com/0 )
- 2) 提取 url 1b ( http://example2.com/52 )
- 3) 转到网址 1b
- 4)提取一些数据并保存
- 5) 转到 url 1a+1 ( http://example.com/1,我们称之为 2a)
- 6) 提取 url 2b ( http://example2.com/693 )
- 7) 转到网址 2b
- 8)提取一些数据并保存等......
我正在努力解决如何做到这一点(注意,我只熟悉节点 js 和 Cheerio/request 来完成这个任务,即使它可能并不优雅,所以我没有在寻找替代库或语言来做到这一点,抱歉) . 我想我错过了一些东西,因为我什至想不出这怎么可能。
编辑
让我换一种方式试试。这是代码的第一部分:
var request = require('request'),
cheerio = require('cheerio');
request('http://api.trove.nla.gov.au/result?key=6k6oagt6ott4ohno&zone=book&l-advformat=Thesis&sortby=dateDesc&q=+date%3A[2000+TO+2014]&l-availability=y&l-australian=y&n=1&s=0', function(error, response, html) {
if (!error && response.statusCode == 200) {
var $ = cheerio.load(html, {
xmlMode: true
});
var id = ($('work').attr('id'))
var total = ($('record').attr('total'))
}
});
返回的第一个页面如下所示
<response>
<query>date:[2000 TO 2014]</query>
<zone name="book">
<records s="0" n="1" total="69977" next="/result?l-advformat=Thesis&sortby=dateDesc&q=+date%3A%5B2000+TO+2014%5D&l-availability=y&l-australian=y&n=1&zone=book&s=1">
<work id="189231549" url="/work/189231549">
<troveUrl>http://trove.nla.gov.au/work/189231549</troveUrl>
<title>
Design of physiological control and magnetic levitation systems for a total artificial heart
</title>
<contributor>Greatrex, Nicholas Anthony</contributor>
<issued>2014</issued>
<type>Thesis</type>
<holdingsCount>1</holdingsCount>
<versionCount>1</versionCount>
<relevance score="0.001961126">vaguely relevant</relevance>
<identifier type="url" linktype="fulltext">http://eprints.qut.edu.au/65642/</identifier>
</work>
</records>
</zone>
</response>
上面的 URL 需要递增 s=0、s=1 等“总”次。'id' 需要在第二个请求中输入到下面的 url:
request('http://api.trove.nla.gov.au/work/" +(id)+ "?key=6k6oagt6ott4ohno&reclevel=full', function(error, response, html) {
if (!error && response.statusCode == 200) {
var $ = cheerio.load(html, {
xmlMode: true
});
//extract data here etc.
}
});
例如,当使用第一个请求返回的 id="189231549" 时,第二个返回的页面如下所示
<work id="189231549" url="/work/189231549">
<troveUrl>http://trove.nla.gov.au/work/189231549</troveUrl>
<title>
Design of physiological control and magnetic levitation systems for a total artificial heart
</title>
<contributor>Greatrex, Nicholas Anthony</contributor>
<issued>2014</issued>
<type>Thesis</type>
<subject>Total Artificial Heart</subject>
<subject>Magnetic Levitation</subject>
<subject>Physiological Control</subject>
<abstract>
Total Artificial Hearts are mechanical pumps which can be used to replace the failing natural heart. This novel study developed a means of controlling a new design of pump to reproduce physiological flow bringing closer the realisation of a practical artificial heart. Using a mathematical model of the device, an optimisation algorithm was used to determine the best configuration for the magnetic levitation system of the pump. The prototype device was constructed and tested in a mock circulation loop. A physiological controller was designed to replicate the Frank-Starling like balancing behaviour of the natural heart. The device and controller provided sufficient support for a human patient while also demonstrating good response to various physiological conditions and events. This novel work brings the design of a practical artificial heart closer to realisation.
</abstract>
<language>English</language>
<holdingsCount>1</holdingsCount>
<versionCount>1</versionCount>
<tagCount>0</tagCount>
<commentCount>0</commentCount>
<listCount>0</listCount>
<identifier type="url" linktype="fulltext">http://eprints.qut.edu.au/65642/</identifier>
</work>
所以我现在的问题是如何将这两个部分(循环)联系在一起以实现结果(下载并解析大约 70000 页)?
我不知道如何在 JavaScript 中为 Node.js 编写代码。我是 JavaScript 新手