php - 如何从 php 中的抓取网页中获取特定数据

Question

可能重复：
如何使用 PHP 解析和处理 HTML？

您好，我已经抓取了一个网页

  <div class="col blue">
        <img  src="/en/media/Dentalscreenresized.jpg" />
        <h4>This is line i want to scrape</h4>
        <p class="date">12 Sep
            <span class="year">2012</span></p>
        <p>13 people were diagnosed with oral cancer after last year&rsquo;s Mouth Cancer Awareness Day. Ring 021-4901169 to arrange for a free screening on the 19th September.</p>
        <p class="readmore"><a href="/en/news/abcd.html">Read More</a></p>
        <p class="rightreadmore"><a href="http://www.xyz.ie/en/news/">See all News&nbsp;&nbsp;&nbsp;</a></p>
    </div>

现在我想显示的<h4>标签。class="col blue"我在网上看到使用preg_match_all()我不熟悉正则表达式......请帮助

score 1 · Accepted Answer

不要使用正则表达式来解析 HTML。使用库和专用解决方案似乎很困难。您可以在那里找到很多“不要使用正则表达式”的答案。

我推荐使用简单的 SimpleHTMLDOM。

    <?php
// include necessary classes first.
// e.g. include('simple_html_dom.php'); // assuming the file is in same folder as the php file. Or include it at first or you will get a fatal error.
    $html = str_get_html('<div class="col blue">
            <img  src="/en/media/Dentalscreenresized.jpg" />
            <h4>This is line i want to scrape</h4>
            <p class="date">12 Sep
                <span class="year">2012</span></p>
            <p>13 people were diagnosed with oral cancer after last year&rsquo;s Mouth Cancer Awareness Day. Ring 021-4901169 to arrange for a free screening on the 19th September.</p>
            <p class="readmore"><a href="/en/news/abcd.html">Read More</a></p>
            <p class="rightreadmore"><a href="http://www.xyz.ie/en/news/">See all News&nbsp;&nbsp;&nbsp;</a></p>
        </div>
    ');
    
    $h4 = $html->find('h4.col.blue');
    ?>

现在 $h4 包含所有带有 col 和 blue 类的 h4 标签的元素。

score 1 · Accepted Answer

好吧，在生活中，您通常有两个选择（我假设抓取页面的内容存储在$content变量中）：

的方式~~（克苏鲁）~~正则表达式：

$matches = array();
preg_match_all('#<div class="col blue">.+?<h4>([^<]+)#is', $content, $matches);
var_dump($matches[1]);

DOM解析方式：

$dom = new DOMDocument();
$dom->loadHTML($content);
$xpath = new DOMXpath($dom);
$elements = $xpath->query('//div[@class="col blue"]/h4');
foreach ($elements as $el) {
   var_dump($el->textContent);
}

当然，真正的问题是选择哪种方式。

第一个选项简短，简洁，总体上非常诱人。我承认我会使用它一次、两次或 ( pony he comes) 甚至更多——如果且仅当我知道我使用的 HTML 将始终归一化并且我可以应对其结构以不可预测的方式突然变化的情况。

第二个选项稍大一些，可能看起来太笼统了。然而，在我看来，它对源的变化更加灵活和有弹性。

例如，考虑如果源 HTML 中的某些“蓝色”div 可能没有<h4>元素出现会发生什么。为了在这种情况下正常工作，正则表达式必须变得更加复杂。XPath 查询呢？不会改变——哪怕是一点点。

score 0 · Accepted Answer

不要使用正则表达式从 HTML 解析/抓取信息，尝试使用 PHP 内置的 DOM 解析器。

score 0 · Accepted Answer

使用 DOM 和 Xpath。将您的 html 数据放入 $html。

$dom = new DOMDocument('1.0', 'UTF-8');
@$dom->loadHTML($html);
$xmlElements = simplexml_import_dom($dom);

$divs = $xmlElements->xpath("//div[@class='col blue']");
foreach($divs as $div)
{
  $heading = $div->h4;
  var_dump($heading);
}

附加说明：

Don't use regular expressions to parse/scrape info from HTML. Its a Bad technique

php - 如何从 php 中的抓取网页中获取特定数据

4 回答 4

Related

Reference