php - 如何从复杂的xml中解析文本和图像

Question

我希望你能帮助我。XML 文件如下所示：

<channel><item>
<description>
<div>  <a href="http://image.com">
<span>   
<img src="http://image.com" /> 
</span>
</a>
Lorem Ipsum is simply dummy text of the printing etc... 
</div>
</description>
</item></channel>

我可以获取描述标签的内容，但是当我这样做时，我会得到其中有很多 css 的整个结构，我不想要那个。我真正需要的是仅解析 href 链接和 Lorem Ipsum 文本。我正在尝试使用简单的 XML，但找不到，看起来太复杂了。有任何想法吗？

编辑： 我用来解析 xml 的代码

$file = new SimpleXMLElement($mydata);
{

    foreach($file->channel->item as $post)
{

    echo $post->description; } }

score 1 · Accepted Answer

该 XML 看起来非常像 RSS 或 Atom 提要（或其中的摘录）。该description节点通常会被转义，或放置在标记为的部分中<![CDATA[ ... ]]>，这表明其内容将被视为原始文本，即使它们包含<、>或&。

您的示例并未表明这一点，但如果您echo为您提供包括img标签等在内的全部内容，那么这就是正在发生的事情，并且您的问题类似于Trying to Parse Only the Images from an RSS Feed - 您需要抓取整个description内容，并将其解析为自己的文档。

如果由于某种原因 HTML 没有被转义，并且实际上作为一组子节点包含在 XML 中，则可以直接访问链接的 URL（假设结构始终一致）：

echo (string)$post->description->div->a['href'];

至于文本，SimpleXML 将连接特定元素的所有文本内容（但不是来自其子元素），如果您使用(string)(echo自动转换为字符串“强制转换为字符串”，但我猜你会想做其他事情而不是echo最终使用它）。

在您的示例中，您想要的文本位于第一个（也是唯一一个）div 中，因此将显示它：

echo (string)$post->description->div;

但是，您提到了“很多 CSS”，我想您为了简单起见已将其排除在示例之外，因此我不确定您的真实内容是否一致。

score 0 · Accepted Answer

那会很复杂。~~那里没有 XML，只有 html。一个区别是一个标签不能包含另一个标签和 XML 中的一些文本。这就是为什么~~我会使用 PHP 的DOM（我还没有使用过，但它类似于纯 JavaScript）。

这是我一起破解的（未经测试）：

// first create our document
$doc = new DOMDocument('1.0', 'utf-8');
$doc->loadHTML("your html here"); // there is also a loadHTMLFile

// this tries to get an a element which has a href and returns that href
function getAHref ( $doc ) {
    // now get all a elements to get the one with a href
    $aElements = $doc->getElementsByTagName( "a" );
    foreach ( $aElements as $a ) {
        // has this element a href? than return
        if ( $a->hasAttribute( "href" ) ) {
            return $a->getAttribute( "href" );
        }
    }
    // failed? return false
    return false;
}

// tires to get the text in the node
// in your example the text isn't wrapped in anything so this is going to be difficult
function getTextFromNode ( $doc ) {
    // get and loop all divs (assuming the text is always a child of a div)
    $divs = $doc->getElementsByTagName( "div" ); // do we know it's always in that div?
    foreach ( $divs as $div ) {
        // also loop all child nodes to get the text nodes
        foreach ( $div->childNodes as $child ) {
            // is this a text node?
            if ( $child->nodeType == XML_TEXT_NODE ) {
                // is there something in it (new lines count as text nodes)
                if ( trim( $child->nodeValue ) != "" ) {
                    // *pfew* got it
                    return $child->nodeValue;
                }
            }
        }
    }
    // failed? return false
    return false;
}

score 0 · Accepted Answer

这是回答问题的最终代码。

$xml = simplexml_load_file('myfile.xml');

$descriptions = $xml->xpath('//item/description');

foreach ( $descriptions as $description_node ) {

    $description_dom = new DOMDocument();
    $description_dom->loadHTML( (string)$description_node );

    $description_sxml = simplexml_import_dom( $description_dom );

    $imgs = $description_sxml->xpath('//img');
    $text = $description_sxml->xpath('//div');

    foreach($imgs as $image){

    echo (string)$image['src'];     
       }
    foreach($text as $t){

        echo (string)$t;
       }
    }

这是 IMSoP 的代码，我添加了$text = $description_sxml->xpath('//div');以读取<div>.

在我的例子中，xml 中的一些帖子有多个<div>和<span>标签，所以要解析所有这些，我可能必须->xpath为 the<span>或if... else语句添加另一个，这样如果我没有任何内容<div>，则回显<span>内容。感谢您的回复。

php - 如何从复杂的xml中解析文本和图像

3 回答 3

Related

Reference