php - 尝试使用 PHP 从 XML 中提取带有标签的内容

Question

我们在我们的机构使用 Acalog，并希望使用他们的（不受支持的）API 将目录内容从他们的网站中提取到我们的网站中。我可以访问他们的文件并提取信息，但格式（段落、粗体、斜体、中断）是作为节点（h:p、h:b、h:i、h:br）完成的。不幸的是，我从搜索 a:content 中提取的文本只带来了直接文本，不包括格式化节点。如何将节点带入文本？我哪里错了？

XML 的开头（我在大约一半时将其断开）

<catalog xmlns="http://acalog.com/catalog/1.0" xmlns:h="http://www.w3.org/1999/xhtml" xmlns:a="http://www.w3.org/2005/Atom" xmlns:xi="http://www.w3.org/2001/XInclude" id="acalog-catalog-6">
<hierarchy>
    <legend>
        <key id="acalog-entity-type-5">
            <name>Department</name>
            <localname>Department</localname>
        </key>
    </legend>
    <entity id="acalog-entity-239">
        <type xmlns:xi="http://www.w3.org/2001/XInclude">
            <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" xi:xpointer="xmlns(c=http://acalog.com/catalog/1.0) xpointer((//c:key[@id='acalog-entity-type-5'])[1])"/>
        </type>
        <a:title xmlns:a="http://www.w3.org/2005/Atom">American Studies</a:title>
        <code/>
        <a:content xmlns:a="http://www.w3.org/2005/Atom" xmlns:h="http://www.w3.org/1999/xhtml">
            <h:p xmlns:h="http://www.w3.org/1999/xhtml">
                <h:span class="dept_intro">
                    <h:i>Chair of the Department of American Studies: </h:i>
                </h:span>
                <h:span class="dept_intro">John Smith</h:span>
                <h:br/>
                <h:span class="dept_intro"> 
                    <h:br/>&#xD;
                    Professors: Jane Smith; Sarah Smith, <h:i class="dept_intro">The Douglas Family Chair in American Culture, History, and Literary and Interdisciplinary Studies</h:i>
                    <h:br/><h:br/>&#xD;Associate Professor: Michael Smith
                </h:span>
                <h:span class="dept_intro"><h:br/></h:span>
            </h:p>
            <h:p xmlns:h="http://www.w3.org/1999/xhtml">
                <h:span class="dept_intro">Assistant Professor: Rebecca Smith</h:span>
            </h:p>
            <h:p xmlns:h="http://www.w3.org/1999/xhtml">
                <h:span class="dept_intro">Lecturer: * Leonard Smith</h:span></h:p>
            <h:p xmlns:h="http://www.w3.org/1999/xhtml">
                <h:span class="dept_intro">Visiting Lecturer: * Robert Smith<h:br/><h:br/><h:br/><h:br/></h:span><h:strong>Department Overview</h:strong></h:p>
            <h:p xmlns:h="http://www.w3.org/1999/xhtml" class="MsoNormal">American studies is an  interdiscipl

这是我到目前为止编写的代码：

$xml = file_get_contents($url);
    if ($xml === false) {
        return false;
    } else {
        // Create an empty DOMDocument object to hold our service response
        $dom = new DOMDocument('1.0', 'UTF-8');
        // Load the XML
        $dom->loadXML($xml);
        // Create an XPath Object
        $xpath = new DOMXPath($dom);
        // Register the Catalog namespace
        $xpath->registerNamespace('h', 'http://www.w3.org/1999/xhtml');
        $xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
        $xpath->registerNamespace('xi', 'http://www.w3.org/2001/XInclude');
        // Check for error
        $status_elements = $xpath->query('//c:status[text() != "success"]');
        if ($status_elements->length > 0) {
            // An error occurred
            return false;
        }
        $x = $dom->documentElement;
        foreach ($x->childNodes AS $item)
          {
          //echo $item->nodeName . " = " . $item->nodeValue . "<br/><br />";
          }
        // Retrieve all catalogs elements
        $pageText = $xpath->query('//a:content');
        if ($pageText->length == 0) {
            // No text found
            return false;
        }

        foreach ($pageText AS $item) {
            $txt = (string) $item->nodeValue;
            $txt = str_replace('<h:i>','<i>',$txt);
            $txt = str_replace('</h:i>','</i>',$txt);
            $txt = str_replace('<h:span class="dept_intro">','<p>',$txt);
            $txt = str_replace('</h:span>','</p>',$txt);
            if(strpos($txt,'Department Overview')) {
                echo '<p>' . str_replace('Department Overview','',$txt) . '</p>';
                break;  
            } else {
                echo '<p>' . $txt . '</p>';
            }
            //echo $pageText->nodeValue;
        }
    }

行 $pageText = $xpath->query('//a:content'); 提取内容，但不提取标签。

php - 尝试使用 PHP 从 XML 中提取带有标签的内容

0 回答 0

Related

Reference