1

可能重复:
如何使用 PHP 解析和处理 HTML?

我需要解析 HTML 块,根据描述是否符合特定标准,用链接描述替换一些 href。

我用来识别特定字符串的正则表达式在我的应用程序的其他地方使用:

$regex  = "/\b[FfGg][\.][\s][0-9]{1,4}\b/";
preg_match_all($regex, $html, $matches, PREG_SET_ORDER);

我使用以下 SO 问题作为提取 href 描述的起点:

用文本描述替换 html 链接标签

这个想法是转换任何具有“FfGg.xxxx”类型标识符的链接,并保留其余部分(即,谷歌链接)。

到目前为止,我所拥有的是:

    $html = 'Ten reports <a href="http://google.com">Google!</a> on 14 mice with ABCD 
show that low plasma BCAA, particularly ABC and to a lesser extent DEF, can result in 
severe but reversible epithelial damage to the skin, eye and gastrointestinal tract.
</li><li>Symptoms were reported in conjunction with low plasma ABC levels in 9 case 
reports. In two case reports, ABC levels were between 1.9 and 48 µmol/L (<a 
href="/docpage.php?obscure==100" target="F.100">F.100</a>, <a 
href="/docpage.php?obscure==68" target="F.68">F.68</a>, <a href="/docpage.php?obscure==67" 
target="F.67">F.67</a>, <a href="/docpage.php?obscure==71" target="F.71">F.71</a>, <a 
href="/docpage.php?obscure==122" target="F.122">F.122</a>, <a 
href="/docpage.php?obscure==92" target="F.92">F.92</a>, <a href="/docpage.php?obscure==96" 
target="F.96">F.96</a>);';

这将转换所有链接,包括谷歌:

$html = preg_replace("/<a.*?href=\"(.*?)\".*?>(.*?)<\/a>/i", "$2", $html);

这将返回一个空白的 HTML 字符串:

$html = preg_replace("/<a.*?href=\"(.*?)\".*?>[FfGg][\.][\s][0-9]{1,4}<\/a>/i", "$2", $html);

我相信问题在于我如何在上面的第二个(非工作)示例中嵌入这个正则表达式:

[FfGg][\.][\s][0-9]{1,4}

在我上面的 preg_replace 示例中找到的 HTML 中嵌入 FfGg 表达式的正确方法是什么?

4

3 回答 3

2

您不应该使用正则表达式解析 HTML。您根本无法正确处理所有情况。以下是一些会破坏您的链接查找正则表达式的有效 HTML 示例:

<!-- <a href="www.blah.com">   -->    <a href="www.foo.com">F.100</a>
<area>...</area>  ...  <a href="www.foo.com">F.100</a>
<a href="www.foo.com">F.100</a >

我建议看看这个问题以获得更好的方法:How do you parse and process HTML/XML in PHP?

于 2012-09-21T14:42:07.777 回答
2

这是执行此操作的 DOM(正确)方法:

编辑:改进的正则表达式

<?php

    $html = 'Ten reports <a href="http://google.com">Google!</a> on 14 mice with ABCD show that low plasma BCAA, particularly ABC and to a lesser extent DEF, can result in severe but reversible epithelial damage to the skin, eye and gastrointestinal tract.</li><li>Symptoms were reported in conjunction with low plasma ABC levels in 9 case reports. In two case reports, ABC levels were between 1.9 and 48 µmol/L (<a href="/docpage.php?obscure==100" target="F.100">F.100</a>, <a href="/docpage.php?obscure==68" target="F.68">F.68</a>, <a href="/docpage.php?obscure==67" target="F.67">F.67</a>, <a href="/docpage.php?obscure==71" target="F.71">F.71</a>, <a href="/docpage.php?obscure==122" target="F.122">F.122</a>, <a href="/docpage.php?obscure==92" target="F.92">F.92</a>, <a href="/docpage.php?obscure==96" target="F.96">F.96</a>);';

    // Create a new DOMDocument and load the HTML string
    $dom = new DOMDocument('1.0');
    $dom->loadHTML($html);

    // Create an XPath object for this DOMDocument
    $xpath = new DOMXPath($dom);

    // Loop over all <a> elements in the document
    // Ideally we would combine the regex into the XPath query, but XPath 1.0
    // doesn't support it
    foreach ($xpath->query('//a') as $anchor) {
        // See if the link matches the pattern
        if (preg_match('/^\s*[gf]\s*\.\s*\d{1,4}\s*$/i', $anchor->nodeValue)) {
            // If it does, convert it to a text node (effectively, un-linkify it)
            $textNode = new DOMText($anchor->nodeValue);
            $anchor->parentNode->replaceChild($dom->importNode($textNode), $anchor);
        }
    }

    // Because you are working with partial HTML string, I extract just that
    // string. If you are actually working with a full document, you can
    // replace all the code below this comment with simply:
    // $result = $dom->saveHTML();

    // A string to hold the result
    $result = '';

    // Iterate all elements that are a direct child of the <body> and convert
    // them to strings
    foreach ($xpath->query('/html/body/*') as $node) {
        $result .= $node->C14N();
    }

    // $result now contains the modified HTML string

看到它工作(注意:您看到的错误消息是因为您提供的 HTML 字符串无效)

于 2012-09-21T15:09:53.343 回答
1

你不应该如此依赖不情愿的量词。他们试图消耗尽可能少的字符,但为了实现整体匹配,他们会尽可能多地消耗。如果 HTML 被缩小(特别是,如果它有很少或没有换行符),每个.*?'s 最终可能会尝试消耗整个文档的其余部分,并且他们可能不得不这样做很多次。

当无法匹配时尤其如此。在承认失败之前,它必须遍历文本的所有可能路径。另一个问题是不情愿的量词不会阻止过早开始的匹配。给定这个字符串:

<a href="www.blah.com">...</a> <a href="www.foo.com">F.100</a>

...它将在第一个<a>标签处开始匹配,并在第二个标签的末尾停止。在这个正则表达式中:

'~<a\b[^>]*\bhref="[^"]*"[^>]*>([FG]\.\d{1,4})</a>~i'

...我已经.*?[^>]*[^"]*将匹配的这些部分分别替换为单个标记或属性值。尽管这个正则表达式效果更好,但请注意它并非万无一失——远非如此。但在将 HTML 与正则表达式匹配时,它几乎可以合理地得到。

于 2012-09-21T16:56:50.800 回答