php - 正则表达式从网站上提取数据

Question

我在找工作。我正在编写一个每天 cron 一次的脚本。它从网站中提取文本和链接。当谈到正则表达式模式时，我很无助。

这是我从中提取数据的示例：

<div class="cat-list-item job-list-item">

<h3 class="expressway full-width"><a href="/about/careers/network_engineer_voip_telephony">Network Engineer - VoIP Telephony</a></h3>

<div class="career-summary">

    <p>
        Provide daily support, proactive maintenance and independent troubleshooting, and identify capacity/performance issues to ensure
    </p>

</div>

<p class="locations-heading"><b>Locations</b></p>

<ul class="locations-list normal">


    <li>
        Elizabethtown Headquarters
    </li>

</ul>

<div class="list-bottom">
    <a class="learn-more replace" href="/about/careers/network_engineer_voip_telephony">Learn More</a>
</div>

这是我到目前为止所拥有的：

<?php
$url = "http://bluegrasscellular.com/about/careers/";
$input = @file_get_contents($url) or die("Could not access file: $url");
$regexp = "<h3 class=\"expressway full-width\"><a\s[^>]*href=\"\/about\/careers\/(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
if (preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
    foreach ($matches as $match) {
        // $match[2] = link address
        // $match[3] = link text
        echo "<a href='http://bluegrasscellular.com/about/careers/{$match[2]}'>{$match[3]}</a><br>";
    }
}
?>

然而，所做的只是将文本和 href 从 . 我还想抓住以下内容：

提供日常支持、主动维护和独立故障排除，并识别容量/性能问题以确保
伊丽莎白镇总部

我最终希望将这些存储在数据库中并通知我任何新职位。我不知道该怎么做。任何帮助是极大的赞赏。

score 2 · Accepted Answer

使用 Dom 文档类。从以下内容开始：

$doc = new DOMDocument();
//load HTML string into document object
if ( ! @$doc->loadHTML($html)){
    return FALSE;
}
//create XPath object using the document object as the parameter
$xpath = new DOMXPath($doc);

然后您需要为要提取的每个元素编写一个查询。要获取“职业路径” div 中的文本，您可以使用以下 xpath 查询：

$query = "//div[@class='career-summary']";
//XPath queries return a NodeList
$res = $xpath->query($query);
$text = trim($res->item(0)->nodeValue);

我没有测试它，但这是一般的想法。以下查询应从指定的列表元素中获取文本：

$query = "//ul[@class='locations-list normal']";

为了做这样的事情，学习 xpath 查询是非常值得的。在处理 HTML 或 XML 时，它们比正则表达式要好得多。

编辑：

要访问多个项目，您可能必须更改查询。例如，如果有多个列表项，您可以将查询更改如下：

$query = "//ul[@class='locations-list normal']/li";

“/li”表示您想要“ul”标签中具有指定类的列表项。获得结果后，您可以使用 foreach 循环遍历它们：

$out = array;
foreach ($res as $node){
    $out[] = $node->nodeValue;
}

php - 正则表达式从网站上提取数据

1 回答 1

Related

Reference