0

我想对表的列执行数据分析。在这种特殊情况下 - 数据的百分比是日期/整数/数字/位。我正在使用的查询:

SELECT 
CAST(SUM(CASE WHEN TRY_CAST([column1] AS date) IS NOT NULL AND TRY_CAST(TRY_CAST([column1] AS VARCHAR(8000)) AS date) between '1950-01-01' AND '2049-12-31' AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentDate,
    CAST(SUM(CASE WHEN TRY_CAST([column1] AS FLOAT) IS NOT NULL AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentNumeric,
    CAST(SUM(CASE WHEN TRY_CAST([column1] AS BIGINT) IS NOT NULL AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentInteger,
    CAST(SUM(CASE WHEN LOWER(TRY_CAST([column1] AS VARCHAR(MAX))) IN ('1', '0', 't', 'f', 'y', 'n', 'true', 'false', 'yes', 'no') THEN 1 ELSE 0 END) AS NUMERIC(25,2))/(CAST(SUM(CASE WHEN [column1] IS NOT NULL AND LOWER(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) not IN ('null', 'n/a') AND LEN(LTRIM(RTRIM(CAST([column1] AS VARCHAR(MAX))))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))+0.00000001) AS PercentBit
    FROM tbl

即使我只选择前 1 行,此查询的运行速度也很慢。实际上我无法得到任何结果,或者至少我不能等待这么长时间。如果这很重要,我正在检查的列是十进制类型。

在此处输入图像描述

表中的记录数为:37,431,866。这就是为什么我只选择前 1000 个,但仍然没有加载任何结果超过 40 分钟

4

2 回答 2

1

你的问题可以简化。那个部分:

CAST(SUM(CASE WHEN TRY_CAST([column1] AS date) IS NOT NULL AND TRY_CAST(TRY_CAST([column1] AS VARCHAR(8000)) AS date) between '1950-01-01' AND '2049-12-31' AND LEN(RTRIM(CAST([column1] AS VARCHAR(MAX)))) > 0 THEN 1 ELSE 0 END) AS NUMERIC(25,2))

也可以写成:

CAST(SUM(CASE WHEN TRY_CAST(TRY_CAST([column1] AS VARCHAR(8000)) AS date) between '1950-01-01' AND '2049-12-31' THEN 1 ELSE 0 END) AS NUMERIC(25,2))

第二个比第一个更快,结果相同。(AFAIK)

这可能也可以应用于查询中的其他部分。

于 2020-12-18T21:16:07.800 回答
1

如果您希望它运行得更快,那么您不想限制您正在使用的查询中的行。毕竟,没有 no 的聚合查询GROUP BY只返回一行。

而是使用子查询:

SELECT . . .
FROM (SELECT TOP (1000) t.*
      FROM tbl t
     ) t

请注意,这不是随机样本。如果你尝试ORDER BY newid(),你会扼杀性能。获得近似 n% 样本的一种替代方法是使用如下逻辑:

SELECT . . .
FROM (SELECT TOP (1000) t.*
      FROM tbl t
      WHERE RAND(CHECKSUM(NEWID())) < 0.001
     ) t

0.001 大约是 0.1% 的样本。

于 2020-12-18T14:17:06.883 回答