php - PHP：DOMDocument：从嵌套元素中删除不需要的文本

Question

我有以下 xml 文档：

<?xml version="1.0" encoding="UTF-8"?>
<header level="2">My Header</header>
<ul>
    <li>Bulleted style text
        <ul>
            <li>
                <paragraph>1.Sub Bulleted style text</paragraph>
            </li>
        </ul>
    </li>
</ul>
<ul>
    <li>Bulleted style text <strong>bold</strong>
        <ul>
            <li>
                <paragraph>2.Sub Bulleted <strong>bold</strong></paragraph>
            </li>
        </ul>
    </li>
</ul>

我需要删除子项目符号文本之前的数字。 1. 和 2. 在给定的例子中

这是我到目前为止的代码：

<?php
class MyDocumentImporter
{
    const AWKWARD_BULLET_REGEX = '/(^[\s]?[\d]+[\.]{1})/i';

    protected $xml_string = '<some_tag><header level="2">My Header</header><ul><li>Bulleted style text<ul><li><paragraph>1.Sub Bulleted style text</paragraph></li></ul></li></ul><ul><li>Bulleted style text <strong>bold</strong><ul><li><paragraph>2.Sub Bulleted <strong>bold</strong></paragraph></li></ul></li></ul></some_tag>';

    protected $dom;

    public function processListsText( $loop = null ){

        $this->dom = new DomDocument('1.0', 'UTF-8');

        $this->dom->loadXML($this->xml_string);

        if(!$loop){
            //get all the li tags
            $li_set = $this->dom->getElementsByTagName('li');
        }
        else{
            $li_set = $loop;
        }

        foreach($li_set as $li){

            //check for child nodes
            if(! $li->hasChildNodes() ){
                continue;
            }

            foreach($li->childNodes as $child){
                if( $child->hasChildNodes() ){
                    //this li has children, maybe a <strong> tag
                    $this->processListsText( $child->childNodes );
                }
                if( ! ( $child instanceof DOMElement ) ){
                    continue;
                }
                if( ( $child->localName != 'paragraph') ||  ( $child instanceof DOMText )){
                    continue;
                }
                if( preg_match(self::AWKWARD_BULLET_REGEX, $child->textContent) == 0 ){
                    continue;
                }

                $clean_content = preg_replace(self::AWKWARD_BULLET_REGEX, '', $child->textContent);

                //set node to empty
                $child->nodeValue = '';

                //add updated content to node
                $child->appendChild($child->ownerDocument->createTextNode($clean_content));

                //$xml_output = $child->parentNode->ownerDocument->saveXML($child);
                //var_dump($xml_output);

            }
        }
    }
}

$importer = new MyDocumentImporter();
$importer->processListsText();

我可以看到的问题是$child->textContent返回节点的纯文本内容，并去除额外的子标签。所以：

<paragraph>2.Sub Bulleted <strong>bold</strong></paragraph>

变成

<paragraph>Sub Bulleted bold</paragraph>

<strong>标签没有了。

我有点难过...有人能找到去除不需要的字符并保留“内在小孩”<strong>标签的方法吗？

标签可能并不总是<strong>，它也可能是一个超链接<a href="#">，或者<emphasize>。

score 2 · Accepted Answer

假设您的 XML 实际解析，您可以使用 XPath 使您的查询更容易：

$xp = new DOMXPath($this->dom);

foreach ($xp->query('//li/paragraph') as $para) {
        $para->firstChild->nodeValue = preg_replace('/^\s*\d+.\s*/', '', $para->firstChild->nodeValue);
}

它在第一个文本节点而不是整个标记内容上进行文本替换。

score 1 · Accepted Answer

您重置其全部内容，但您想要的只是更改第一个文本节点（请记住文本节点也是节点）。您可能想要查找 xpath //li/paragraph/text()[position()=1]，并处理/替换该 DOMText 节点而不是整个段落内容。

$d = new DOMDocument();
$d->loadXML($xml);
$p = new DOMXPath($d);
foreach($p->query('//li/paragraph/text()[position()=1]') as $text){
        $text->parentNode->replaceChild(new DOMText(preg_replace(self::AWKWARD_BULLET_REGEX, '', $text->textContent),$text);
}

php - PHP：DOMDocument：从嵌套元素中删除不需要的文本

2 回答 2

Related

Reference