sql-server - 将原始数据转换为关系数据

Question

介绍

我得到了一个凌乱的excel转储直接到一张桌子上。现在我需要把这个烂摊子变成有用的东西。转储有重复和不一致的地方......好时光！

到目前为止，我一直在尝试各种方法:( - 希望你能帮助我。

给定这个示例数据集：

ExcelDump
+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
|  1 |      |      | C    |
|  1 |      | B    | C    |
|  1 | A    | B    | D    |
|  1 | E    | B    | C    |
|  2 | A    | B    | C    |
|  2 | A    | B    | C    |
|  3 | A    | B    | C    |
|  3 | A    | B    | F    |
|  4 | A    | B    | C    |
|  4 | G    | B    | C    |
+----+------+------+------+

一种可能的结果可能是：

OutputTable
+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
|  1 | A    | B    | C    |
|  2 | A    | B    | C    |
|  3 | A    | B    | C    |
|  4 | A    | B    | C    |
+----+------+------+------+

漂亮整洁。唯一 ID 密钥和数据以一种有意义的方式合并在一起。

如何选择正确的数据？

您可能已经注意到另一个可能的结果可能是：

+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
|  1 | E    | B    | C    |
|  2 | A    | B    | C    |
|  3 | A    | B    | F    |
|  4 | G    | B    | C    |
+----+------+------+------+

这就是复杂的地方。我希望能够根据我可以操纵的一些条件选择最有意义的集合。

例如，我想设置一个条件：“选择最常见的（非空）值，如果没有找到最常见的值，则取第一个非空值。” 此条件应应用于按 ID 分组的选择。该条件的结果将是：

+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
|  1 | A    | B    | C    |
|  2 | A    | B    | C    |
|  3 | A    | B    | C    |
|  4 | A    | B    | C    |
+----+------+------+------+

如果我后来发现那个假设是错误的，它应该是：“选择最常见的（非空）值，如果没有找到最常见的值，则取最后一个不为空的值。”

+----+------+------+------+
| ID | Col1 | Col2 | Col3 |
+----+------+------+------+
|  1 | E    | B    | C    |
|  2 | A    | B    | C    |
|  3 | A    | B    | F    |
|  4 | G    | B    | C    |
+----+------+------+------+

所以基本上我想根据每组 ID 的一组条件来选择值。

score 3 · Accepted Answer

如所写，您可以通过简单的方式做到这一点GROUP BY：

SELECT 
    id, 
    Col1 = MAX(Col1),
    Col2 = MAX(Col2),
    Col3 = MAX(Col3)
FROM
   ExcelDump
GROUP BY
   id

此模式将为每个 id 值的每列提供最高的非空值。

score 1 · Accepted Answer

我已经修改了我的解决方案，以考虑到问题中添加的额外信息。下面的查询将为您提供您指定的第二个排序优先级。为了获得第一个，您需要将外部应用中的“max”更改为“min”，并将“sortOrder desc”更改为“sortOrder asc”。请记住，如果您有多个最频繁的关系，比如 A、A、B、B、C 和 A 排在第一位，它会在下面的代码中与 B 一起出现，因为那是最高计数并且排在 2 个 A 之后。

-- setup test table
create table ExcelDump(
    id int
,   Col1 char(1)
,   Col2 char(1)
,   Col3 char(1)
)

insert into ExcelDump values(1,null,null,'C')
insert into ExcelDump values(1,null,'B','C')
insert into ExcelDump values(1,'A','B','D')
insert into ExcelDump values(1,'E','B','C')
insert into ExcelDump values(2,'A','B','C')
insert into ExcelDump values(2,'A','B','C')
insert into ExcelDump values(3,'A','B','C')
insert into ExcelDump values(3,'A','B','F')
insert into ExcelDump values(4,'A','B','C')
insert into ExcelDump values(4,'G','B','C')

-- create temp tables to make it easier to debug
select distinct
    id
into #distinct
from ExcelDump

-- number order isn't guaranteed but should be sorting them as first come first serve from the original table if no indexes exist
select
    row_number() over(order by (select 1)) as numberOrder
,   ID
,   Col1
,   Col2
,   Col3
into #sorted
from ExcelDump

-- actual query
select
    ui.Id
,   col1.Col1
,   col2.Col2
,   col3.Col3
from #distinct ui
  outer apply (
        select top 1
            ed.Col1
        ,   count(*) as cnt
        ,   max(ed.numberOrder) as sortOrder
        from #sorted ed
        where ed.id = ui.id
        and ed.Col1 is not null -- ignore nulls
        group by ed.Col1
        order by cnt desc, sortOrder desc -- get most common value, then get last one found if there are multiple
    ) col1
  outer apply (
        select top 1
            ed.Col2
        ,   count(*) as cnt
        ,   max(ed.numberOrder) as sortOrder
        from #sorted ed
        where ed.id = ui.id
        and ed.Col2 is not null -- ignore nulls
        group by ed.Col2
        order by cnt desc, sortOrder desc -- get most common value, then get last one found if there are multiple
    ) col2
  outer apply (
        select top 1
            ed.Col3
        ,   count(*) as cnt
        ,   max(ed.numberOrder) as sortOrder
        from #sorted ed
        where ed.id = ui.id
        and ed.Col3 is not null -- ignore nulls
        group by ed.Col3
        order by cnt desc, sortOrder desc -- get most common value, then get last one found if there are multiple
    ) col3

score 0 · Accepted Answer

您还可以使用游标遍历临时 ExcelDump 表以过滤每一行。您可以将过滤后的结果存储到另一个临时表中，该表可以有自己的约束，如必要时唯一或非空，并且通过使用游标，您可以编写专门的代码来处理您需要的每个验证。

sql-server - 将原始数据转换为关系数据

介绍

如何选择正确的数据？

3 回答 3

Related

Reference