1

我正在尝试使用 symfony DOM 爬虫解析这部分 HTML。

<div class="chunk">

<h4><img src="./handler_image.php?i=fbcaab25f6277c5e73b5ff8e5038211f" alt="Favicon" class="favicon" />
<a href="http://feedproxy.google.com/~r/TheNextWeb/~3/QgVH5ADY3nE/">Russia is building its own mobile OS based on Jolla’s Sailfish</a>&nbsp;<span class="footnote">18 May 2015, 6:26 pm</span>
</h4>
<img width="520" height="245" src="http://cdn1.tnwcdn.com/wp-content/520x245.jpg" alt="P1040853-730x547" title="Russia is building its own mobile OS based on Jolla's Sailfish" data-id="719843" />
<br />Some Description here.<br /><br />
<a href="article link">This story continues</a> at The Next Web
<div>
    <a href="http://feeds.feedburner.com/"><img src="http://feeds.feedburner.com/" border="0"></img>
    </a>
    <a href="http://feeds.feedburner.com/"><img src="http://feeds.feedburner.com/" border="0"></img>
    </a>
    <a href="http://feeds.feedburner.com/"><img src="http://feeds.feedburner.com/" border="0"></img>
    </a>
</div>
<img src="http://feeds.feedburner.com/" height="1" width="1" alt="" />
<div align="center">
    <p>
        <a href="http://cdn1.tnwcdn.com/wp-content/gitl.jpg" class="download"><img src="/images/mini_podcast.png" class="download" border="0" title="Download the Podcast (jpg; 0 MB)" />
        </a>
    </p>
    <p class="footnote" align="center">(image/jpeg; 0 MB)</p>
</div>
<p class="footnote favicons" align="center">
    <a href="share link here!" title="Title here!"><img src="blinklist.png" alt="Blinklist" /></a>
    8 more link sharing services in same format.
</p>

我只想提取这部分:

<br />Some Description here.<br /><br />
<a href="article link">This story continues</a> at The Next Web

我已经尝试了很多东西,比如

$crawler->filter('#sp_results .chunk > div')->each(function ($node, $i) use(&$divs){
$divs[] = $node->html(); });

$crawler->filter('#sp_results .chunk .favicons')->each(function ($node, $i) use(&$footnote){
$footnote[] = $node->html();
});

$crawler->filter('#sp_results .chunk')->each(function ($node, $i) use(&$answer, &$divs, &$footnote){
$html = str_replace($footnote[$i],'',$node->html());
$html = str_replace($divs[2*$i],'',$node->html());
$html = str_replace($divs[2*$i+1],'',$node->html());
$answer[] = $html;
});

有什么办法可以得到我上面提到的部分。到目前为止,我写的任何东西都不起作用。要么我在底部得到一些额外的图像和链接,要么我留下一些我打算获得文章链接的部分。

任何帮助,将不胜感激。

4

0 回答 0