2

我怎么能在这里用Domcrawler实现这个解决方案

<?php
use Symfony\Component\DomCrawler\Crawler;

$crawler = new Crawler();
$content = file_get_contents('http://example.com/somepage.html');
$crawler->addHtmlContent($content, 'UTF-8');
$content = $crawler->filter('#main-content');

// Remove content by tag and by css selector.

?>
4

4 回答 4

6
    $crawler = new Crawler($html,$url);

    $document = new \DOMDocument('1.0', 'UTF-8');
    $root = $document->appendChild($document->createElement('_root'));
    $crawler->rewind();
    $root->appendChild($document->importNode($crawler->current(), true));
    $domxpath = new \DOMXPath($document);

    foreach ($selectorsToRemove as $selector) {
        $crawlerInverse = $domxpath->query(CssSelector::toXPath($selector));
        foreach ($crawlerInverse as $elementToRemove) {
            $parent = $elementToRemove->parentNode;
            $parent->removeChild($elementToRemove);
        }
    }
    $crawler->clear();
    $crawler->add($document);
于 2013-06-14T16:35:09.207 回答
1

该类Crawler扩展\SplObjectStorage,当 Crawler 接收到 HTML 时,它使用该attach()方法将每个元素添加到存储中。

这意味着detach()爬虫对象上也有一个方法可用。我没有测试以下内容,但我认为这可能会完成这项工作。

$crawlerInverse = $crawler->filter('script');

foreach ($crawlerInverse as $elementToRemove) {
    if ($crawler->contains($elementToRemove)) {
       $crawler->detach($elementToRemove);
    }
}
于 2013-06-12T19:08:35.040 回答
1

文档中所述

DomCrawler 组件简化了 HTML 和 XML 文档的 DOM 导航。

并且:

尽管可能,DomCrawler 组件并不是为操作 DOM 或重新转储 HTML/XML 而设计的。

DomCrawler 旨在从 DOM 文档中提取细节而不是修改它们。

然而...

由于 PHP 通过引用传递对象,而Crawler基本上是DOMNode的包装器,因此在技术上可以修改底层 DOM 文档:

// will remove all divs with a class .toRemove
$crawler->filter('div.toRemove')->each(function ($node) {
    foreach ($crawler as $node) {
        $node->parentNode->removeChild($node);
    }
});

这是一个工作示例:https ://gist.github.com/jakzal/8dd52d3df9a49c1e5922

于 2015-04-01T21:22:42.150 回答
0

使用一个常见的功能,如:

function removeCrawlerNode($crawler_node) {

    foreach($crawler_node as $node) {
        $node->parentNode->removeChild($node);
    }
}

然后找到您要在其中搜索的爬虫代码部分(例如类 .sample_section),如果存在,则使用您要删除的所有标签创建一个 remove_tag_array:

if($crawler->filter('.sample_section')->count() > 0) {

    $remove_tag_array = array("br", "b", "img", "div", "u", "i");

    $sub_crawler = $crawler->filter('.sample_section');

    foreach ($remove_tag_array as $tag) {
        $sub_crawler->filter($tag)->each(function ($node) {
            removeCrawlerNode($node);
        });
    }
}
于 2015-06-01T19:41:26.103 回答