0

I have an XML file with consecutive tags instead of nested tags, like the following:

<title>
    <subtitle>
        <topic att="TopicTitle">Topic title 1</topic>
        <content att="TopicSubtitle">topic subtitle 1</content>
        <content att="Paragraph">paragraph text 1</content>
        <content att="Paragraph">paragraph text 2</content>
        <content att="TopicSubtitle">topic subtitle 2</content>
        <content att="Paragraph">paragraph text 1</content>
        <content att="Paragraph">paragraph text 2</content>

        <topic att="TopicTitle">Topic title 2</topic>
        <content att="TopicSubtitle">topic subtitle 1</content>
        <content att="Paragraph">paragraph text 1</content>
        <content att="Paragraph">paragraph text 2</content>
        <content att="TopicSubtitle">topic subtitle 2</content>
        <content att="Paragraph">paragraph text 1</content>
        <content att="Paragraph">paragraph text 2</content>
    </subtitle>
</title>

I'm using XQuery in BaseX and I want to convert it to a table with the following columns:

Title      Subtitle      TopicTitle      TopicSubtitle      Paragraph
Irrelevant Irrelevant    Topic title 1   Topic Subtitle 1   paragraph text 1
Irrelevant Irrelevant    Topic title 1   Topic Subtitle 1   paragraph text 2
Irrelevant Irrelevant    Topic title 1   Topic Subtitle 2   paragraph text 1
Irrelevant Irrelevant    Topic title 1   Topic Subtitle 2   paragraph text 2
Irrelevant Irrelevant    Topic title 2   Topic Subtitle 1   paragraph text 1
Irrelevant Irrelevant    Topic title 2   Topic Subtitle 1   paragraph text 2
Irrelevant Irrelevant    Topic title 2   Topic Subtitle 2   paragraph text 1
Irrelevant Irrelevant    Topic title 2   Topic Subtitle 2   paragraph text 2

I'm new to XQuery and XPath but I already understand the basics on how to navigate through nodes and select the ones I need. What I don't know yet is how to work with consecutive data that I want to convert to a nested XML or table (CSV?). Can anyone help?

4

2 回答 2

5

You can transform the flat XML into a nested one using tumbling window (https://www.w3.org/TR/xquery-30/#id-windows) e.g.

for tumbling window $w in title/subtitle/*
    start $t when $t instance of element(topic)
return
    <topic
        title="{$t/@att}">
        {
            for tumbling window $content in tail($w)
                start $c when $c/@att = 'TopicSubtitle'
            return
                <subtopic
                    title="{$c/@att}">
                    {
                        tail($content) ! <para>{node()}</para>
                    }
                </subtopic>
        }
    </topic>

gives

<topic title="TopicTitle">
    <subtopic title="TopicSubtitle">
        <para>paragraph text 1</para>
        <para>paragraph text 2</para>
    </subtopic>
    <subtopic title="TopicSubtitle">
        <para>paragraph text 1</para>
        <para>paragraph text 2</para>
    </subtopic>
</topic><topic title="TopicTitle">
    <subtopic title="TopicSubtitle">
        <para>paragraph text 1</para>
        <para>paragraph text 2</para>
    </subtopic>
    <subtopic title="TopicSubtitle">
        <para>paragraph text 1</para>
        <para>paragraph text 2</para>
    </subtopic>
</topic>

Based on that I think you can then transform the whole to semicolon separated data with

string-join(
<title>
    <subtitle>
        {
            for tumbling window $w in title/subtitle/*
                start $t when $t instance of element(topic)
            return
                <topic
                    title="{$t/@att}"
                    value="{$t}">
                    {
                        for tumbling window $content in tail($w)
                            start $c when $c/@att = 'TopicSubtitle'
                        return
                            <subtopic
                                title="{$c/@att}"
                                value="{$c}">
                                {
                                    tail($content) ! <para>{node()}</para>
                                }
                            </subtopic>
                    }
                </topic>
        }
    </subtitle>
</title>//para ! string-join(ancestor-or-self::* ! (text(), @value, 'Irrelevant')[1], ';'), '&#10;')
于 2017-10-11T18:54:27.720 回答
1

Although positional grouping is the most general approach to this kind of problem (that is, tumbling windows in XQuery 3.0+, for-each-group/@group-starting-with in XSLT 2.0+, as described by Martin Honnen) I don't think it's strictly necessary here, because you're not actually trying to make use of the hierarchic structure implicit in the data.

Specifically, you are converting one flat structure with implicit hierarchy into another flat structure with implicit hierarchy, and you can do that with something along the following lines:

<table>{
    for $para in title/subtitle/content[@att='paragraph']
    return <row>
      <cell>irrelevant</cell>
      <cell>irrelevant</cell>
      <cell>{$para/preceding-sibling::topic[1]/string()}</cell>
      <cell>{$para/preceding-sibling::content[@att='TopicSubtitle'][1]/string()}</cell>
      <cell>{$para/string()}</cell>
    </row>
}</table>
于 2017-10-12T08:30:41.550 回答