php - 正则表达式查找 HTML 字符串中的所有路径

Question

我有一个字符串，带有一个 htmlentities 编码的 HTML 代码。

我想要做的是找到文档中的所有路径，介于：

href="XXX"，src="XXX"。

我确实有一个正则表达式，它可以找到所有以 http、https、ftp 和文件开头的链接，以免我对其进行迭代：

"/\b(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)[-A-Z0-9+&@#\/%=~_|$?!:,.]*[A-Z0-9+&@#\/%=~_|$]/i"

任何的想法？

score 5 · Accepted Answer

更新：用正则表达式做是不可靠的。src=".." 或 href=".." 语句可以是注释或 javascript 语句的一部分。为了可靠地获取链接，我建议使用 XPath：

<?php

$html = file_get_contents('http://stackoverflow.com/questions/14782334/regex-expression-to-find-all-paths-in-a-html-string/14782594#14782594');
$doc = new DOMDocument();
@$doc->loadHTML($html);
$selector = new DOMXPath($doc);

$result = $selector->query('//a/@href | //@src');
foreach($result as $link) {
    echo $link->value, PHP_EOL;
}

如果使用正则表达式，我会尝试获取"href 或 src 属性的 = 之间的内容。下面是一个如何使用正则表达式从该页面获取链接的示例：

<?php

$html = file_get_contents('http://stackoverflow.com/questions/14782334/regex-expression-to-find-all-paths-in-a-html-string');

preg_match_all('/href="(?P<href>.*)"|src="(?P<src>.*)"/U', $html, $m);
                                                        <--- note the U to make the 
                                                             pattern ungreedy
var_dump($m['href']);
var_dump($m['src']);

score 4 · Accepted Answer

您可以使用 DOM 查找特定标签中的所有链接。例如，要从锚标签获取 url，请执行以下操作（未经测试，但它应该为您指明正确的方向）：

function findPaths($url)
{
   $dom = new DOMDocument();

   //$url of page to search, the "@' is there to suppress warnings
   @$dom->loadHTMLFile($url) 

   $paths = array();
   foreach($dom->getElementsByTagName('a') as $path)
   {
     $paths[] = array('url' => $path->getAttribute('href'), text => $path->nodeValue);
   }
   return $paths;
}

您可以更轻松地使用 XPath 加载和评估 DOM。

php - 正则表达式查找 HTML 字符串中的所有路径

2 回答 2

Related

Reference