0

我有一个 SQL 转储的 CSV 文件,我正在 BaseX 8.4 中使用它。CSV 标头包含 SQL 表结构的扁平表示。

带有标题和第一行的 CSV:

country_id,country_code,country_name,publisher_id,publisher_name,country id,year_began,year_ended,series_id,series_name,sort_name,publisher_id
2,us,United States,78,Harvard University Press,2,1950,NULL,15,A New Series,New Series,78

BaseX CSV 解析器生成以下 XML 表示:

<csv>
  <record>
    <country_id>2</country_id>
    <country_code>us</country_code>
    <country_name>United States</country_name>
    <publisher_id>78</publisher_id>
    <publisher_name>Harvard University Press</publisher_name>
    <country_id>2</country_id>
    <year_began>1950</year_began>
    <year_ended>NULL</year_ended>
    <series_id>15</series_id>
    <series_name>A New Series</series_name>
    <sort_name>New Series</sort_name>
    <publisher_id>78</publisher_id>
  </record>
</csv>

关于原始数据,我知道一个表的开头是它的唯一ID,但是那些ID名称也会作为外键在其他表中重复出现。

我想创建窗口/组,通过匹配表的唯一 ID 的第一次出现(同时忽略每个后续出现)来重建原始表结构。到目前为止我所拥有的不起作用,因为它匹配 ID 的每一次出现,而不仅仅是第一个:

<tables>{
    for tumbling window $w in /csv/record/*
    start $s at $p when name($s) = ("country_id", 
                                    "publisher_id", 
                                    "series_id", 
                                    "issue_id", 
                                    "id_activity_fact", 
                                    "id_person_dim", 
                                    "id_location_dim", 
                                    "id_phys_loc_dim", 
                                    "id_letter_dim")
    return <table id_name="{name($s)}">{$w}</table>
}</tables>

输出:

<tables>
  <table id_name="country_id">
    <country_id>2</country_id>
    <country_code>us</country_code>
    <country_name>United States</country_name>
  </table>
  <table id_name="publisher_id">
    <publisher_id>78</publisher_id>
    <publisher_name>Harvard University Press</publisher_name>
  </table>
  <table id_name="country_id">
    <country_id>2</country_id>
    <year_began>1950</year_began>
    <year_ended>NULL</year_ended>
  </table>
  <table id_name="series_id">
    <series_id>15</series_id>
    <series_name>A New Series</series_name>
    <sort_name>New Series</sort_name>
  </table>
  <table id_name="publisher_id">
    <publisher_id>78</publisher_id>
  </table>
</tables>

期望的输出:

<tables>
  <table id_name="country_id">
    <country_id>2</country_id>
    <country_code>us</country_code>
    <country_name>United States</country_name>
  </table>
  <table id_name="publisher_id">
    <publisher_id>78</publisher_id>
    <publisher_name>Harvard University Press</publisher_name>      
    <country_id>2</country_id>
    <year_began>1950</year_began>
    <year_ended>NULL</year_ended>
  </table>
  <table id_name="series_id">
    <series_id>15</series_id>
    <series_name>A New Series</series_name>
    <sort_name>New Series</sort_name>      
    <publisher_id>78</publisher_id>
  </table>
</tables>
4

2 回答 2

1

我认为您可能需要使用窗口解决方案进行初始分割,然后在结果上使用“分组依据”来合并具有相同键的段。

于 2016-03-02T16:00:37.260 回答
0

经过一段时间的努力,我放弃了,决定简单地用下划线标记随后出现的 ID 名称,如下所示:

<csv>
  <record>
    <country_id>2</country_id>
    <country_code>us</country_code>
    <country_name>United States</country_name>
    <publisher_id>78</publisher_id>
    <publisher_name>Harvard University Press</publisher_name>
    <_country_id>2</_country_id>
    <year_began>1950</year_began>
    <year_ended>NULL</year_ended>
    <series_id>15</series_id>
    <series_name>A New Series</series_name>
    <sort_name>New Series</sort_name>
    <_publisher_id>78</_publisher_id>
  </record>
</csv>

通过这种方式,window表达式可以按预期工作;然后,我只需去掉下划线即可将元素名称恢复为原始形式:

<tables>{
        for tumbling window $w in /csv/record/*
        start $s when $s/name() = ("country_id", 
                                   "publisher_id", 
                                   "series_id", 
                                   "issue_id", 
                                   "id_activity_fact", 
                                   "id_person_dim", 
                                   "id_location_dim", 
                                   "id_phys_loc_dim", 
                                   "id_letter_dim")
        return 
            <table id_name="{$s/name()}">{
                for $e in $w
                return 
                   if (starts-with($e/name(), "_")) then
                       element {$e/substring-after(name(), "_")} { $e/string() }
                   else $e
            }</table>
}</tables>

最后结果:

<tables>
  <table id_name="country_id">
    <country_id>2</country_id>
    <country_code>us</country_code>
    <country_name>United States</country_name>
  </table>
  <table id_name="publisher_id">
    <publisher_id>78</publisher_id>
    <publisher_name>Harvard University Press</publisher_name>
    <country_id>2</country_id>
    <year_began>1950</year_began>
    <year_ended>NULL</year_ended>
  </table>
  <table id_name="series_id">
    <series_id>15</series_id>
    <series_name>A New Series</series_name>
    <sort_name>New Series</sort_name>
    <publisher_id>78</publisher_id>
  </table>
</tables>
于 2016-03-03T00:46:32.040 回答