1

全部!

如何使用 Symfony2 DomCrawler 组件解析正确描述的 XML 文件?

我需要拆分所有部分并与当前部分一起收集一个内部标签(碑文、p、诗歌等),该部分仅属于本部分。

我有如下描述的标准 FB2 书籍 XML 格式:

<?xml version="1.0" encoding="utf-8"?>
<FictionBook xmlns="http://www.gribuser.ru/xml/fictionbook/2.0" xmlns:l="http://www.w3.org/1999/xlink">
<description></description>
<body>
<section>
    <title><p><strong>Level 1, section 1</strong></p></title>
    <section>
        <title><p><strong>Level 2, section 2</strong></p></title>
        <section>
            <title><p><strong>Level 3, section 3</strong></p></title>
            <p>Level 3, section 3, paragraph 1</p>
            <poem>
                <stanza>
                    <v>bla-bla-bla 1</v>
                    <v>bla-bla-bla 2</v>
                    <v>bla-bla-bla 3</v>
                </stanza>
            </poem>
            <p>Level3, section 3, paragraph 2</p>
            <subtitle><strong>x x x</strong></subtitle>
        </section>
        <section>
            <title><p><strong>Level 3, section 4</strong></p></title>
            <p>Level 3, section 4, paragraph 1</p>
            <p>Level 3, section 4, paragraph 2</p>
            <subtitle><strong>x x x</strong></subtitle>
        </section>
        <section>
            <title><p><strong>Level 3, section 5</strong></p></title>
            <p>Level 3, section 5, paragraph 1</p>
            <p>Level 3, section 5, paragraph 2</p>
            <p>Level 3, section 5, paragraph 3</p>
            <empty-line/>
            <subtitle>This file was created</subtitle>
            <subtitle>with BookDesigner program</subtitle>
            <subtitle>bookdesigner@the-ebook.org</subtitle>
            <subtitle>22.04.2004</subtitle>
        </section>
    </section>
</section>
</body>
</FictionBook>

下面的代码不起作用,有人可以帮我解决这个问题吗?顺便说一句,标题解析正确......但部分的标签不......

private function loadBookSections(Crawler $crawler)
{
    $sections = $crawler->filter('section')->each(function(Crawler $node) {
        $c = $node->filter('section')->reduce(function(Crawler $node, $i) {
            return ($i == 0);
        });

        return array(
            'title' => $node->filter('title')->text(),
            'inner' => $c->html(),
        );
    });

    echo "*******************************************\n";

    foreach($sections as $section ) {
        echo ">>> ".$section['title']."\n";
        echo "!!! ".$section['inner']."\n";
    }
}

并感谢您的帮助!

4

2 回答 2

1

四天后...我通过 XPath 找到了解决方案...

private function loadBookSections(Crawler $crawler)
{

    $sections = $crawler->filter('section')->each(function(Crawler $node) {
        return array(
            'title' => $node->filter('title')->text(),
            'inner' => $node->filterXPath("//*[not(section)]")->html(),
        );
    });

    foreach($sections as $section) {
        echo "TITLE: ".$section['title']."\n";
        echo "INNER: ".$section['inner']."\n";
    }
}
于 2013-11-20T15:12:34.013 回答
-1

如果你减少你的 XML 文件,你会得到这样的东西:

<section>
    <section>
        <!-- ... -->
    </section>
    <section>
        <!-- ... -->
    </section>
    <section>
        <!-- ... -->
    </section>
</section>

您想捕获子section元素,而不是父元素。

目前,您仅在父元素列表上进行迭代section,这意味着您只能获取父section元素的 HTML。

要遍历孩子,您需要选择section section而不是section.


进一步改进代码的辅助信息:不要使用丑陋的reduce调用,只需使用->first()获取节点列表的第一个元素。


总的来说,您的代码将是:

$sections = $crawler->filter('section section')->each(function(Crawler $node) {
    $c = $node->filter('section')->first();

    return array(
        'title' => $node->filter('title')->text(),
        'inner' => $c->html(),
    );
});
于 2013-11-18T12:57:48.710 回答