sql - 分区函数 COUNT() OVER 可能使用 DISTINCT

Question

我正在尝试编写以下内容以获得不同的 NumUsers 总数，如下所示：

NumUsers = COUNT(DISTINCT [UserAccountKey]) OVER (PARTITION BY [Mth])

管理工作室似乎对此不太高兴。当我删除DISTINCT关键字时，错误消失了，但它不会是一个不同的计数。

DISTINCT在分区函数中似乎是不可能的。我该如何找到不同的计数？我是否使用更传统的方法，例如相关子查询？

进一步研究一下，也许这些OVER函数与 Oracle 的工作方式不同，因为它们不能用于SQL-Server计算运行总计。

我在SQLfiddle上添加了一个实时示例，我尝试使用分区函数来计算运行总计。

score 203 · Accepted Answer

有一个非常简单的解决方案dense_rank()

dense_rank() over (partition by [Mth] order by [UserAccountKey]) 
+ dense_rank() over (partition by [Mth] order by [UserAccountKey] desc) 
- 1

这将为您提供您所要求的内容：每个月内不同的 UserAccountKeys 的数量。

score 7 · Accepted Answer

死灵术：

通过 DENSE_RANK 用 MAX 模拟 COUNT DISTINCT over PARTITION BY 相对简单：

;WITH baseTable AS
(
    SELECT 'RM1' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM1' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR2' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR2' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR3' AS ADR
    UNION ALL SELECT 'RM3' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM2' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM3' AS RM, 'ADR1' AS ADR
    UNION ALL SELECT 'RM3' AS RM, 'ADR2' AS ADR
)
,CTE AS
(
    SELECT RM, ADR, DENSE_RANK() OVER(PARTITION BY RM ORDER BY ADR) AS dr 
    FROM baseTable
)
SELECT
     RM
    ,ADR

    ,COUNT(CTE.ADR) OVER (PARTITION BY CTE.RM ORDER BY ADR) AS cnt1 
    ,COUNT(CTE.ADR) OVER (PARTITION BY CTE.RM) AS cnt2 
    -- Not supported
    --,COUNT(DISTINCT CTE.ADR) OVER (PARTITION BY CTE.RM ORDER BY CTE.ADR) AS cntDist
    ,MAX(CTE.dr) OVER (PARTITION BY CTE.RM ORDER BY CTE.RM) AS cntDistEmu 
FROM CTE

注意：
这假设有问题的字段是不可为空的字段。
如果字段中有一个或多个 NULL 条目，则需要减去 1。

score 6 · Accepted Answer

我认为在 SQL-Server 2008R2 中这样做的唯一方法是使用相关子查询或外部应用：

SELECT  datekey,
        COALESCE(RunningTotal, 0) AS RunningTotal,
        COALESCE(RunningCount, 0) AS RunningCount,
        COALESCE(RunningDistinctCount, 0) AS RunningDistinctCount
FROM    document
        OUTER APPLY
        (   SELECT  SUM(Amount) AS RunningTotal,
                    COUNT(1) AS RunningCount,
                    COUNT(DISTINCT d2.dateKey) AS RunningDistinctCount
            FROM    Document d2
            WHERE   d2.DateKey <= document.DateKey
        ) rt;

这可以使用您建议的语法在SQL-Server 2012中完成：

SELECT  datekey,
        SUM(Amount) OVER(ORDER BY DateKey) AS RunningTotal
FROM    document

但是，DISTINCT仍然不允许使用，所以如果需要 DISTINCT 和/或如果升级不是一个选项，那么我认为OUTER APPLY是你最好的选择

score 6 · Accepted Answer

我使用的解决方案类似于上面David的解决方案，但如果应将某些行从计数中排除，则会有额外的变化。这假定 [UserAccountKey] 永远不会为空。

-- subtract an extra 1 if null was ranked within the partition,
-- which only happens if there were rows where [Include] <> 'Y'
dense_rank() over (
  partition by [Mth] 
  order by case when [Include] = 'Y' then [UserAccountKey] else null end asc
) 
+ dense_rank() over (
  partition by [Mth] 
  order by case when [Include] = 'Y' then [UserAccountKey] else null end desc
)
- max(case when [Include] = 'Y' then 0 else 1 end) over (partition by [Mth])
- 1

可以在此处找到带有扩展示例的 SQL Fiddle。

score 1 · Accepted Answer

简单的 SQL 中有一个解决方案：

SELECT time, COUNT(DISTINCT user) OVER(ORDER BY time) AS users
FROM users

=>

SELECT time, COUNT(*) OVER(ORDER BY time) AS users
FROM (
    SELECT user, MIN(time) AS time
    FROM users
    GROUP BY user
) t

score 1 · Accepted Answer

我在这里徘徊，与whytheq基本相同的问题并找到了David的解决方案，但随后不得不回顾我关于 DENSE_RANK 的旧自学笔记，因为我很少使用它：为什么 DENSE_RANK 而不是 RANK 或 ROW_NUMBER，以及它是如何做到的实际工作？在此过程中，我更新了该教程以包含我的David针对这个特定问题的解决方案版本，然后认为它可能对 SQL 新手（或像我这样忘记东西的其他人）有所帮助。

整个教程文本可以复制/粘贴到查询编辑器中，然后每个示例查询可以（单独）取消注释并运行，以查看它们各自的结果。（默认情况下，此问题的解决方案在底部未注释。）或者，可以将每个示例单独复制到它们自己的查询编辑实例中，但每个示例都必须包含TBLx CTE。

--WITH /* DB2 version */
--TBLx (Col_A, Col_B) AS (VALUES 
--     (  7,     7  ),
--     (  7,     7  ),
--     (  7,     7  ),
--     (  7,     8  ))

WITH /* SQL-Server version */
TBLx    (Col_A, Col_B) AS
  (SELECT  7,     7    UNION ALL
   SELECT  7,     7    UNION ALL
   SELECT  7,     7    UNION ALL
   SELECT  7,     8)

/*** Example-A: demonstrates the difference between ROW_NUMBER, RANK and DENSE_RANK ***/

  --SELECT Col_A, Col_B,
  --  ROW_NUMBER() OVER(PARTITION BY Col_A ORDER BY Col_B) AS ROW_NUMBER_,
  --  RANK() OVER(PARTITION BY Col_A ORDER BY Col_B)       AS RANK_,
  --  DENSE_RANK() OVER(PARTITION BY Col_A ORDER BY Col_B) AS DENSE_RANK_
  --FROM TBLx

  /* RESULTS:
    Col_A  Col_B  ROW_NUMBER_  RANK_  DENSE_RANK_
      7      7        1          1        1
      7      7        2          1        1
      7      7        3          1        1
      7      8        4          4        2

     ROW_NUMBER: Just increments for the three identical rows and increments again for the final unique row.
                 That is, it’s an order-value (based on "sort" order) but makes no other distinction.
                 
           RANK: Assigns the same rank value to the three identical rows, then jumps to 4 for the fourth row,
                 which is *unique* with regard to the others.
                 That is, each identical row is ranked by the rank-order of the first row-instance of that
                 (identical) value-set.
                 
     DENSE_RANK: Also assigns the same rank value to the three identical rows but the fourth *unique* row is
                 assigned a value of 2.
                 That is, DENSE_RANK identifies that there are (only) two *unique* row-types in the row set.
  */

/*** Example-B: to get only the distinct resulting "count-of-each-row-type" rows ***/

--  SELECT DISTINCT -- For unique returned "count-of-each-row-type" rows, the DISTINCT operator is necessary because
--                  -- the calculated DENSE_RANK value is appended to *all* rows in the data set.  Without DISTINCT,
--                  -- its value for each original-data row-type would just be replicated for each of those rows.
--                  
--    Col_A, Col_B,                
--    DENSE_RANK() OVER(PARTITION BY Col_A ORDER BY Col_B) AS DISTINCT_ROWTYPE_COUNT_
--  FROM TBLx

  /* RESULTS:
    Col_A  Col_B  DISTINCT_ROWTYPE_COUNT_
      7      7            1
      7      8            2
  */

/*** Example-C.1: demonstrates the derivation of the "count-of-all-row-types" (finalized in Example-C.2, below) ***/

--  SELECT
--    Col_A, Col_B,
--    
--    DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B DESC) AS ROW_TYPES_COUNT_DESC_,
--    DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B ASC) AS ROW_TYPES_COUNT_ASC_,
--    
--    -- Adding the above cases together and subtracting one gives the same total count for on each resulting row:
--    
--    DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B DESC)
--       +
--    DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B ASC)
--      - 1   /* (Because DENSE_RANK values are one-based) */
--      AS ROW_TYPES_COUNT_
--  FROM TBLx

  /* RESULTS:
    COL_A  COL_B  ROW_TYPES_COUNT_DESC_  ROW_TYPES_COUNT_ASC_  ROW_TYPES_COUNT_
      7      7            2                     1                    2
      7      7            2                     1                    2
      7      7            2                     1                    2
      7      8            1                     2                    2
      
  */

/*** Example-C.2: uses the above technique to get a *single* resulting "count-of-all-row-types" row ***/

  SELECT DISTINCT -- For a single returned "count-of-all-row-types" row, the DISTINCT operator is necessary because the
                  -- calculated DENSE_RANK value is appended to *all* rows in the data set.  Without DISTINCT, that
                  -- value would just be replicated for each original-data row.
                  
--    Col_A, Col_B, -- In order to get a *single* returned "count-of-all-row-types" row (and field), all other fields
                    -- must be excluded because their respective differing row-values will defeat the purpose of the
                    -- DISTINCT operator, above.
                   
    DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B DESC)
       +
    DENSE_RANK() OVER ( PARTITION BY Col_A ORDER BY Col_B ASC)
      - 1   /* (Because DENSE_RANK values are one-based) */
      AS ROW_TYPES_COUNT_
  FROM TBLx
  
  /* RESULTS:

    ROW_TYPES_COUNT_
          2
  */

sql - 分区函数 COUNT() OVER 可能使用 DISTINCT

6 回答 6

Related

Reference