sql - 跨许多字段获取不同的信息，其中一些字段为 NULL

Question

我有一个超过 6500 万行和 140 列的表。数据来自多个来源，至少每月提交一次。

我寻找一种快速的方法来从这些数据中获取特定字段，仅在它们唯一的地方。问题是，我想处理所有信息以链接发送的发票与哪些识别号以及由谁发送。问题是，我不想迭代超过 6500 万条记录。如果我能得到不同的值，那么我只需要处理 500 万条记录，而不是 6500 万条。请参阅下面的数据描述和示例的SQL Fiddle

如果说客户每个月提交一个invoice_number链接，我只想要出现这个的一行。passport_number_1, national_identity_number_1 and driving_license_1即 4 个字段必须是唯一的

如果他们将上述内容提交 30 个月，那么在他们发送invoice_number链接的第 31 个月passport_number_1, national_identity_number_2 and driving_license_1，我也想选择这一行，因为该national_identity字段是新的，因此整行是唯一的

我的意思是它们出现在linked to同一行
对于所有字段，都可能在某一点出现 Null。
'pivot/composite' 列是 invoice_number 和 submit_by。如果其中任何一个不存在，请删除该行
我还需要将 database_id 包含在上述数据中。即由postgresql数据库自动生成的primary_id
唯一不需要返回的字段是other_column and yet_another_column。请记住该表有 140 列，因此不需要它们
使用结果，创建一个新表来保存这些唯一记录

请参阅此SQL fiddle以尝试重新创建场景。

从那个小提琴中，我期望得到如下结果：

第 1 行、第 2 行和第 11 行：只保留其中一个，因为它们完全相同。最好是最小的那一行id。
第 4 行和第 9 行：其中一个将被删除，因为它们完全相同。
第 5、7 和 8 行：将被删除，因为它们缺少 invoice_number或submitted_by。
结果将具有行（1、2 或 11）、3、（4 或 9）、6 和 10。

score 2 · Accepted Answer

要从具有四个不同字段的组中获取一个代表性行（带有附加字段）：

SELECT 
distinct on (
  invoice_number
  , passport_number
  , national_id_number
  , driving_license_number
)
  * -- specify the columns you want here
FROM my_table
where invoice_number is not null
and submitted_by is not null
;

请注意，除非您指定排序，否则无法预测准确返回哪一行（关于的文档distinct）

编辑：

通过id简单地添加order by id到末尾来排序这个结果是行不通的，但是可以通过使用 CTE 的eiter来完成

with distinct_rows as (
    SELECT 
    distinct on (
      invoice_number
      , passport_number
      , national_id_number
      , driving_license_number
      -- ...
    )
      * -- specify the columns you want here
    FROM my_table
    where invoice_number is not null
    and submitted_by is not null
)
select *
from distinct_rows
order by id;

或使原始查询成为子查询

select *
from (
    SELECT 
    distinct on (
      invoice_number
      , passport_number
      , national_id_number
      , driving_license_number
      -- ...
    )
      * -- specify the columns you want here
    FROM my_table
    where invoice_number is not null
    and submitted_by is not null
) t
order by id;

score 0 · Accepted Answer

仅在它们唯一的情况下从该数据中获取特定字段的快速方法

我不这么认为。我认为你的意思是你想从一个表中选择一组不同的行，它们不是唯一的。

据我从你的描述中可以看出，你只是想要

SELECT distinct invoice_number, passport_number, 
                driving_license_number, national_id_number
FROM my_table
where invoice_number is not null
and submitted_by is not null;

在您的 SQLFiddle 示例中，这会产生 5 行。

sql - 跨许多字段获取不同的信息，其中一些字段为 NULL

2 回答 2

Related

Reference