php - 在 PHP 中从 XML 内部解析 HTML 标签

Question

我正在尝试在 PHP 中simplexml_load_string解析时使用创建自己的 RSS 提要（学习目的）。http://uk.news.yahoo.com/rss我被困在阅读标签内的 HTML 标签上<description>。

到目前为止，我的代码如下所示：

$feed = file_get_contents('http://uk.news.yahoo.com/rss');
$rss = simplexml_load_string($feed);

//for each element in the feed
foreach ($rss->channel->item as $item) {
    echo '<h3>'. $item->title . '</h3>'; 

        foreach($item->description as $desc){

             //how to read the href from the a tag???

             //this does not work at all
             $tags = $item->xpath('//a');
             foreach ($tags as $tag) {
                 echo $tag['href'];
             }
       }
}

任何想法如何提取每个 HTML 标记？

谢谢

score 3 · Accepted Answer

描述内容对其特殊字符进行了编码，因此它不被视为 XML 中的节点，而只是一个字符串。您可以解码特殊字符，然后将 HTML 加载到 DOMDocument 中并执行您想做的任何事情。例如：

foreach ($rss->channel->item as $item) {
    echo '<h3>'. $item->title . '</h3>'; 

        foreach($item->description as $desc){

            $dom = new DOMDocument();
            $dom->loadHTML(htmlspecialchars_decode((string)$desc));

            $anchors = $dom->getElementsByTagName('a');
            echo $anchors->item(0)->getAttribute('href');
        }
}

XPath 也可用于 DOMDocument，请参阅DOMXPath。

score 1 · Accepted Answer

RSS 提要的<description>元素包含 HTML。就像如何使用 SimpleXML 解析 XML 的 CDATA HTML 内容？您需要获取该元素（HTML）的节点值并在附加解析器中解析它。

已接受的链接问题的答案已经表明这一点非常冗长，对于 SimpleXML，无论该 RSS 提要使用 CDATA 还是仅像您的情况一样的实体，它在这里都没有发挥太大作用。

$feed = file_get_contents('http://uk.news.yahoo.com/rss');
$rss  = simplexml_load_string($feed);
$dom  = new DOMDocument(); // the HTML parser used for descriptions' HTML

foreach ($rss->channel->item as $item)
{
    echo '<h3>' . $item->title . '</h3>', "\n";

    foreach ($item->description as $desc)
    {
        $dom->loadHTML($desc);

        $html = simplexml_import_dom($dom)->body;

        echo $html->p->a['href'], "\n";
    }
}

示例输出：

...
<h3>Chantal nears hurricane strength in Caribbean</h3>
http://uk.news.yahoo.com/chantal-nears-hurricane-strength-caribbean-220149771.html
<h3>Placido Domingo In Hospital With Blood Clot</h3>
http://uk.news.yahoo.com/placido-domingo-hospital-blood-clot-215427742.html
<h3>Berlusconi's final tax fraud appeal hearing set for July 30</h3>
http://uk.news.yahoo.com/berlusconis-final-tax-fraud-appeal-hearing-set-july-214714122.html
<h3>China: Men Rescued From River Amid Floods</h3>
http://uk.news.yahoo.com/china-men-rescued-river-amid-floods-213005159.html
<h3>Snowden has not yet accepted asylum in Venezuela - WikiLeaks</h3>
http://uk.news.yahoo.com/snowden-not-yet-accepted-asylum-venezuela-wikileaks-190332291.html
<h3>Three US kidnap victims break silence</h3>
http://uk.news.yahoo.com/three-us-kidnap-victims-release-thankyou-video-093832611.html
...

希望这可以帮助。与接受的答案相反，我认为没有理由申请htmlspecialchars_decode，实际上我很确定这会破坏事情。我的示例还展示了如何在 HTML 被解析后通过展示如何将 DOMNode 转回 SimpleXMLElement 来停留在访问更多子项的 SimpleXML 方式中。

score 0 · Accepted Answer

最好的办法是在 $item 上使用 var_dump() 函数。

feed = file_get_contents('http://uk.news.yahoo.com/rss');
$rss = simplexml_load_string($feed);
foreach ($rss->channel->item as $item) {
    var_dump($item);
    exit;
}

一旦你这样做，你会看到你所追求的价值被称为“链接”。因此，要打印出 URL，您将使用以下代码：

echo $item->link;

php - 在 PHP 中从 XML 内部解析 HTML 标签

3 回答 3

Related

Reference