php - 使用 php 将 URL 匹配到模式

Question

我必须为我的项目使用爬虫。

我使用简单的 dom 类来获取页面中的所有链接。

现在我只想过滤那些形式为"/questions/3904482/<title of the question".

这是我的尝试：

include_once('simple_html_dom.php');
$html = new simple_html_dom();
$html->load_file('http://stackoverflow.com/questions?sort=newest');
$pat='#^/question/([0-9]+)/#';
foreach($html->find('a') as $link)
{
    echo preg_match($pat, $link->href);
    {
        echo $link->href."<br>";
    }
}

所有链接都会被过滤掉。

score 1 · Accepted Answer

你说 url 是 question* s * 但你的模式显示 no s

此外，看起来你应该不if使用echo

include_once('simple_html_dom.php');
$html = new simple_html_dom();
$html->load_file('http://stackoverflow.com/questions?sort=newest');
$pat='#^/questions/([0-9]+)/#';
foreach($html->find('a') as $link)
{

    if ( preg_match($pat, $link->href) )
    {
        echo $link->href."<br>";
    }
}

score 1 · Accepted Answer

您可以利用 DOM 和 XPath：

<?php

$dom = new DOMDocument;
@$dom->loadHTMLFile('http://stackoverflow.com/questions?sort=newest');
$xpath = new DOMXPath($dom);
$questions = $xpath->query("//a[contains(@href, '/questions/') and not(contains(@href, '/tagged/')) and not(contains(@href, '/ask'))]");

foreach ($questions as $question) {
    print "{$question->getAttribute('href')} => {$question->nodeValue}";
}

php - 使用 php 将 URL 匹配到模式

2 回答 2

Related

Reference