4

假设我有一张客户表:

CREATE TABLE customers (
    customer_number  INTEGER,
    customer_name    VARCHAR(...),
    customer_address VARCHAR(...)
)

此表没有键。但是,customer_name对于customer_address 任何给定的customer_number.

此表包含许多重复客户的情况并不少见。为了避免这种重复,以下查询仅用于隔离唯一客户:

SELECT
  DISTINCT customer_number, customer_name, customer_address
FROM customers

幸运的是,该表传统上包含准确的数据。也就是说,从来没有冲突customer_namecustomer_addressfor any customer_number。但是,假设有冲突的数据确实进入了表中。我希望编写一个失败的查询,而不是为有customer_number问题的返回多行。

例如,我尝试了这个查询但没有成功:

SELECT
  customer_number, DISTINCT(customer_name, customer_address)
FROM customers
GROUP BY customer_number

有没有办法使用标准 SQL 编写这样的查询?如果没有,在特定于 Oracle 的 SQL 中是否有解决方案?

编辑:奇怪查询背后的理由:

说实话,这个客户表实际上并不存在(谢天谢地)。我创建了它,希望它足够清晰来展示查询的需求。但是,基于该示例,人们(幸运地)意识到对此类查询的需求是我最不担心的。因此,我现在必须剥离一些抽象概念,并希望恢复我提出这种可憎的桌子的声誉......

我从外部系统收到一个包含发票(每行一个)的平面文件。我逐行阅读了这个文件,将它的字段插入到这个表中:

CREATE TABLE unprocessed_invoices (
    invoice_number   INTEGER,
    invoice_date     DATE,
    ...
    // other invoice columns
    ...
    customer_number  INTEGER,
    customer_name    VARCHAR(...),
    customer_address VARCHAR(...)
)

如您所见,来自外部系统的数据是非规范化的。也就是说,外部系统在同一行上同时包含发票数据及其关联的客户数据。多张发票可能会共享同一个客户,因此可能会有重复的客户数据。

在保证所有客户都已在系统中注册之前,系统无法开始处理发票。因此,系统必须识别唯一客户并在必要时对其进行注册。这就是我想要查询的原因:因为我正在处理我无法控制的非规范化数据

SELECT
  customer_number, DISTINCT(customer_name, customer_address)
FROM unprocessed_invoices
GROUP BY customer_number

希望这有助于澄清问题的初衷。

编辑:好/坏数据的例子

澄清一下:customer_name并且customer_address只需要对特定的customer_number.

 customer_number | customer_name | customer_address
----------------------------------------------------
 1               | 'Bob'         | '123 Street'
 1               | 'Bob'         | '123 Street'
 2               | 'Bob'         | '123 Street'
 2               | 'Bob'         | '123 Street'
 3               | 'Fred'        | '456 Avenue'
 3               | 'Fred'        | '789 Crescent'

前两行很好,因为它是相同的customer_name并且customer_address对于customer_number1。

中间两行很好,因为它customer_name与2 相同(即使另一行具有相同的and )。customer_addresscustomer_numbercustomer_numbercustomer_namecustomer_address

最后两行不行,因为 3 有两个不同customer_address的 es customer_number

如果针对所有六行运行,我正在寻找的查询将失败。但是,如果实际仅存在前四行,则视图应返回:

 customer_number | customer_name | customer_address
----------------------------------------------------
 1               | 'Bob'         | '123 Street'
 2               | 'Bob'         | '123 Street'

我希望这能澄清我所说的“冲突customer_namecustomer_address”的意思。它们必须是唯一的customer_number

我感谢那些解释如何从外部系统正确导入数据的人。事实上,我已经在做大部分事情了。我故意隐藏了我正在做的所有细节,以便更容易专注于手头的问题。此查询并不是唯一的验证形式。我只是认为这会是一个很好的画龙点睛(可以说是最后的防守)。这个问题只是为了调查 SQL 的可能性。:)

4

8 回答 8

3

你的方法有缺陷。您不希望成功存储的数据在选择时引发错误 - 这是等待发生的地雷,这意味着您永远不知道选择何时会失败。

我建议您向表中添加一个唯一键,然后慢慢开始修改您的应用程序以使用此键,而不是依赖任何有意义的数据组合。

然后,您可以停止关心重复数据,这些数据一开始并不是真正的重复。完全有可能两个同名的人共享同一个地址。

您还将从这种方法中获得性能改进。

顺便说一句,我强烈建议您规范化您的数据,即将名称分解为 FirstName 和 LastName(也可以选择 MiddleName),并将地址字段分解为每个组件的单独字段(Address1、Address2、City、State、Country , Zip 或其他)

更新:如果我正确理解您的情况(我不确定我是否正确),您希望防止表格中出现重复的姓名和地址组合(即使这在现实生活中可能发生)。这最好通过这两个字段上的唯一约束或索引来完成,以防止插入数据。也就是说,在插入之前捕获错误。这将告诉您导入文件或您生成的应用程序逻辑错误,然后您可以选择采取适当的措施。

我仍然认为,当您查询时抛出错误在游戏中为时已晚,无法对此采取任何措施。

于 2009-06-12T17:42:01.930 回答
2

标量子查询必须只返回一行(每个结果集行...),因此您可以执行以下操作:

选择不同的
       顾客号码,
       (
       选择不同的
              客户地址
         来自客户 c2
        其中 c2.customer_number = c.customer_number
       ) 作为客户地址
  来自客户 c
于 2009-06-12T18:15:50.647 回答
0

使查询失败可能很棘手......

这将显示表中是否有任何重复记录:

select customer_number, customer_name, customer_address
from customers
group by customer_number, customer_name, customer_address
having count(*) > 1

如果您只是为所有三个字段添加唯一索引,则没有人可以在表中创建重复记录。

于 2009-06-12T17:44:02.407 回答
0

事实上的键是名称+地址,所以这就是你需要分组的。

SELECT
  Customer_Name,
  Customer_Address,
  CASE WHEN Count(DISTINCT Customer_Number) > 1
    THEN 1/0 ELSE 0 END as LandMine
FROM Customers
GROUP BY Customer_Name, Customer_Address

如果您想从 Customer_Number 的角度来做,那么这也很好。

SELECT *, 
CASE WHEN Exists((
  SELECT top 1 1
  FROM Customers c2
  WHERE c1.Customer_Number != c2.Customer_Number
    AND c1.Customer_Name = c2.Customer_Name
    AND c1.Customer_Address = c2.Customer_Address
)) THEN 1/0 ELSE 0 END as LandMine
FROM Customers c1
WHERE Customer_Number = @Number
于 2009-06-12T17:53:23.150 回答
0

如果你想让它失败,你将需要一个索引。如果你不想有索引,那么你可以创建一个临时表来完成这一切。

CREATE TABLE #temp_customers 
    (customer_number int, 
    customer_name varchar(50), 
    customer_address varchar(50),
    PRIMARY KEY (customer_number),
     UNIQUE(customr_name, customer_address))

)

INSERT INTO #temp_customers
SELECT DISTINCT customer_number, customer_name, customer_address
FROM customers

SELECT customer_number, customer_name, customer_address
FROM #temp_customers

DROP TABLE #temp_customers

如果有问题,这将失败,但会防止您的重复记录引起问题。

于 2009-06-12T17:55:50.883 回答
0

如果你有脏数据,我会先清理它。

使用它来查找重复的客户记录...

Select * From customers
Where customer_number in 
  (Select Customer_number from customers
  Group by customer_number Having count(*) > 1)
于 2009-06-12T18:07:46.890 回答
0

让我们使用您的不同查询将数据放入临时表或表变量中

select distinct customer_number, customer_name, customer_address, 
  IDENTITY(int, 1,1) AS ID_Num
into #temp 
from unprocessed_invoices

就我个人而言,如果可能的话,我也会在未处理的发票上添加一个不诚实的内容。我从来没有在不创建具有标识列的临时表的情况下进行导入,只是因为删除重复记录更容易。

现在让我们查询该表以查找您的问题记录。我假设你想看看是什么导致了问题,而不仅仅是让他们失望。

Select t1.* from #temp t1
join #temp t2 
  on t1.customer_name = t2.customer_name and t1.customer_address = t2.customer_address 
where t1.customer_number <> t2.customer_number

select t1.* from #temp t1
join 
(select customer_number from #temp group by customer_number having count(*) >1) t2
  on t1.customer_number = t2.customer_number

您可以使用这些查询的变体从#temp 中删除问题记录(取决于您是选择保留一个还是删除所有可能的问题),然后从#temp 插入到您的生产表中。您还可以将问题记录提供给为您提供数据的任何人,以便他们最终修复。

于 2009-06-12T20:27:38.927 回答
-1
Select t1.* from #temp t1
join #temp t2 
  on t1.customer_name = t2.customer_name and t1.customer_address = t2.customer_address 
where t1.customer_number <> t2.customer_number

select t1.* from #temp t1
join 
(select customer_number from #temp group by customer_number having count(*) >1) t2
  on t1.customer_number = t2.customer_number
于 2016-01-05T12:12:00.300 回答