xquery - XQuery 翻滚窗口：如何匹配第一次出现的分组键？

Question

我有一个 SQL 转储的 CSV 文件，我正在 BaseX 8.4 中使用它。CSV 标头包含 SQL 表结构的扁平表示。

带有标题和第一行的 CSV：

country_id,country_code,country_name,publisher_id,publisher_name,country id,year_began,year_ended,series_id,series_name,sort_name,publisher_id
2,us,United States,78,Harvard University Press,2,1950,NULL,15,A New Series,New Series,78

BaseX CSV 解析器生成以下 XML 表示：

<csv>
  <record>
    <country_id>2</country_id>
    <country_code>us</country_code>
    <country_name>United States</country_name>
    <publisher_id>78</publisher_id>
    <publisher_name>Harvard University Press</publisher_name>
    <country_id>2</country_id>
    <year_began>1950</year_began>
    <year_ended>NULL</year_ended>
    <series_id>15</series_id>
    <series_name>A New Series</series_name>
    <sort_name>New Series</sort_name>
    <publisher_id>78</publisher_id>
  </record>
</csv>

关于原始数据，我知道一个表的开头是它的唯一ID，但是那些ID名称也会作为外键在其他表中重复出现。

我想创建窗口/组，通过匹配表的唯一 ID 的第一次出现（同时忽略每个后续出现）来重建原始表结构。到目前为止我所拥有的不起作用，因为它匹配 ID 的每一次出现，而不仅仅是第一个：

<tables>{
    for tumbling window $w in /csv/record/*
    start $s at $p when name($s) = ("country_id", 
                                    "publisher_id", 
                                    "series_id", 
                                    "issue_id", 
                                    "id_activity_fact", 
                                    "id_person_dim", 
                                    "id_location_dim", 
                                    "id_phys_loc_dim", 
                                    "id_letter_dim")
    return <table id_name="{name($s)}">{$w}</table>
}</tables>

输出：

<tables>
  <table id_name="country_id">
    <country_id>2</country_id>
    <country_code>us</country_code>
    <country_name>United States</country_name>
  </table>
  <table id_name="publisher_id">
    <publisher_id>78</publisher_id>
    <publisher_name>Harvard University Press</publisher_name>
  </table>
  <table id_name="country_id">
    <country_id>2</country_id>
    <year_began>1950</year_began>
    <year_ended>NULL</year_ended>
  </table>
  <table id_name="series_id">
    <series_id>15</series_id>
    <series_name>A New Series</series_name>
    <sort_name>New Series</sort_name>
  </table>
  <table id_name="publisher_id">
    <publisher_id>78</publisher_id>
  </table>
</tables>

期望的输出：

<tables>
  <table id_name="country_id">
    <country_id>2</country_id>
    <country_code>us</country_code>
    <country_name>United States</country_name>
  </table>
  <table id_name="publisher_id">
    <publisher_id>78</publisher_id>
    <publisher_name>Harvard University Press</publisher_name>      
    <country_id>2</country_id>
    <year_began>1950</year_began>
    <year_ended>NULL</year_ended>
  </table>
  <table id_name="series_id">
    <series_id>15</series_id>
    <series_name>A New Series</series_name>
    <sort_name>New Series</sort_name>      
    <publisher_id>78</publisher_id>
  </table>
</tables>

score 1 · Accepted Answer

我认为您可能需要使用窗口解决方案进行初始分割，然后在结果上使用“分组依据”来合并具有相同键的段。

score 0 · Accepted Answer

经过一段时间的努力，我放弃了，决定简单地用下划线标记随后出现的 ID 名称，如下所示：

<csv>
  <record>
    <country_id>2</country_id>
    <country_code>us</country_code>
    <country_name>United States</country_name>
    <publisher_id>78</publisher_id>
    <publisher_name>Harvard University Press</publisher_name>
    <_country_id>2</_country_id>
    <year_began>1950</year_began>
    <year_ended>NULL</year_ended>
    <series_id>15</series_id>
    <series_name>A New Series</series_name>
    <sort_name>New Series</sort_name>
    <_publisher_id>78</_publisher_id>
  </record>
</csv>

通过这种方式，window表达式可以按预期工作；然后，我只需去掉下划线即可将元素名称恢复为原始形式：

<tables>{
        for tumbling window $w in /csv/record/*
        start $s when $s/name() = ("country_id", 
                                   "publisher_id", 
                                   "series_id", 
                                   "issue_id", 
                                   "id_activity_fact", 
                                   "id_person_dim", 
                                   "id_location_dim", 
                                   "id_phys_loc_dim", 
                                   "id_letter_dim")
        return 
            <table id_name="{$s/name()}">{
                for $e in $w
                return 
                   if (starts-with($e/name(), "_")) then
                       element {$e/substring-after(name(), "_")} { $e/string() }
                   else $e
            }</table>
}</tables>

最后结果：

<tables>
  <table id_name="country_id">
    <country_id>2</country_id>
    <country_code>us</country_code>
    <country_name>United States</country_name>
  </table>
  <table id_name="publisher_id">
    <publisher_id>78</publisher_id>
    <publisher_name>Harvard University Press</publisher_name>
    <country_id>2</country_id>
    <year_began>1950</year_began>
    <year_ended>NULL</year_ended>
  </table>
  <table id_name="series_id">
    <series_id>15</series_id>
    <series_name>A New Series</series_name>
    <sort_name>New Series</sort_name>
    <publisher_id>78</publisher_id>
  </table>
</tables>

xquery - XQuery 翻滚窗口：如何匹配第一次出现的分组键？

2 回答 2

Related

Reference