php - 正则表达式从任何网页中打印 url 中带有特定单词的 url

Question

我正在使用下面的代码从网页中提取 url，它工作得很好，但我想过滤它。它将显示该页面中的所有网址，但我只想要那些包含“超级”一词的网址

     $regex='|<a.*?href="(.*?)"|';
preg_match_all($regex,$result,$parts);
$links=$parts[1];
foreach($links as $link){
echo $link."<br>";

       }

所以它应该只在出现 super 的地方回显 uls。例如它应该忽略 url

       http://xyz.com/abc.html

但它应该回响

        http://abc.superpower.com/hddll.html

因为它由 url 中所需的单词 super 组成

score 1 · Accepted Answer

使您的正则表达式不贪婪，它应该可以工作：

$regex = '|<a.*?href="(.*?super[^"]*)"|is';

然而，要解析和废弃 HTML，最好使用 php 的 DOM 解析器。

更新：这是使用 DOM 解析器的代码：

$request_url ='1900girls.blogspot.in/';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $request_url);    
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); 
$result = curl_exec($ch);

$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($result); // loads your html
$xpath = new DOMXPath($doc);
$needle = 'blog';

$nodelist = $xpath->query("//a[contains(@href, '" . $needle . "')]");
for($i=0; $i < $nodelist->length; $i++) {
    $node = $nodelist->item($i);
    echo $node->getAttribute('href') . "\n";
}

php - 正则表达式从任何网页中打印 url 中带有特定单词的 url

1 回答 1

更新：这是使用 DOM 解析器的代码：

Related

Reference