1

我试图解析一些 html 页面:

<div class="gs_r"><h3 class="gs_rt"><span class="gs_ctc">[BOOK]</span> <a href="http://exampleA.com" onmousedown="return scife_clk(this.href,'','res','1')">titleA</a></h3><div class="gs_ggs gs_fl"><a href="http://exampleApdf.pdf" onmousedown="return scife_clk(this.href,'gga','gga','1')">
<div class="gs_r"><h3 class="gs_rt"><span class="gs_ctc">[BOOK]</span> <a href="http://exampleB.com" onmousedown="return scife_clk(this.href,'','res','1')">titleB</a></h3><div class="gs_ggs gs_fl"><a href="http://exampleB.doc" onmousedown="return scife_clk(this.href,'gga','gga','1')">

从那个html页面,我们可以得到信息:页面链接(http://exampleA.com,http://exampleB.com),标题(titleA,titleB),文档链接(http://exampleApdf.pdf, http://exampleB.doc) 但是,我只想获取具有 pdf 链接的文档的信息。所以从那个例子中,我只想得到:http://exampleA.com, titleA ,http://exampleApdf.pdf。我正在尝试,但它给了我空白的结果。我怎么能他们?谢谢你 !:) 这是代码:

<?php

include 'simple_html_dom.php';
$url = 'http://scholar.google.com/scholar?hl=en&q=data+mining&btnG=&as_sdt=1%2C5&as_sdtp=';
$html = file_get_html($url);
foreach($html->find('div[class=gs_ggs gs_fl]')as $pdfLink){
    if (preg_match('/\.pdf$/i', $pdfLink)) {
       $html2->find('span[class=gs_ctc]');
       echo $html2.$pdfLink;
    }
 }

?>
4

1 回答 1

0

You cannot determine from the URL what kind of resource will be returned.

Not everyone serves up PDF files with .pdf extensions. Not all web services reveal the file names of files on disk. Only the Content-Type HTTP response header should be used for determining the type of the resource.

You can get this efficiently by doing a HEAD request for each URL you find.

于 2012-07-18T01:19:04.753 回答