我正在寻找解决方案来对 bigquery 表执行数据分析,涵盖表中每一列的统计信息。一些列是 ARRAY 和 STRUCT,如下所示。
我尝试了多种方法来生成动态查询以涵盖以下场景,但没有运气。我将非常感谢您的帮助/输入。
我想计算这个解决方案的一部分的指标是:
- 最小值
- 最大值
- 最小场地长度
- 最大场地长度
- 每个领域的唯一记录数
- 字段中的空值数
- 字段中的非空值数量。
- 日期或日期时间字段中的最小日期
- 日期或日期时间字段中的最大日期
样本表数据:
我正在寻找解决方案来对 bigquery 表执行数据分析,涵盖表中每一列的统计信息。一些列是 ARRAY 和 STRUCT,如下所示。
我尝试了多种方法来生成动态查询以涵盖以下场景,但没有运气。我将非常感谢您的帮助/输入。
我想计算这个解决方案的一部分的指标是:
样本表数据:
此查询返回数据集中表中的所有列。我排除了 STRUCTS,因为您只需要值列。
SELECT CONCAT('`', table_catalog, '.', table_schema, '.', table_name, '`') as table_name, field_path, data_type
FROM project.dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE table_name = 'table_name'
AND data_type NOT LIKE 'STRUCT%'
使用列表,我们将生成一个 SQL 查询来获取所有这些列。在这里,我只添加了 MIN、MAX 和 COUNT DISTINCT 列。但是,您可以通过向 SELECT 部分添加新行来添加更多。
SELECT
STRING_AGG(
CONCAT(
'SELECT "', field_path, '" as field_path, ',
'CAST(MIN(', field_path, ') as string) as max, ',
'CAST(MAX(', field_path, ') as string) as min ',
'COUNT(DISTINCT ', field_path, ') as count_distinct ',
'FROM ', table_name) ,
' UNION ALL \n'
) as query
FROM columns
最后,我们将使用 EXECUTE IMMEDIATE 运行这个查询,因为它是一个字符串:
EXECUTE IMMEDIATE (
query
)
要将所有这些查询组合在一起,它看起来像这样:
EXECUTE IMMEDIATE (
SELECT
STRING_AGG(
CONCAT(
'SELECT "', field_path, '" as field_path, ',
'CAST(MIN(', field_path, ') as string) as max, ',
'CAST(MAX(', field_path, ') as string) as min ',
'COUNT(DISTINCT ', field_path, ') as count_distinct ',
'FROM ', table_name) ,
' UNION ALL \n'
) as query
FROM (
SELECT CONCAT('`', table_catalog, '.', table_schema, '.', table_name, '`') as table_name, field_path, data_type
FROM project.dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE table_name = 'table_name'
AND data_type NOT LIKE 'STRUCT%'
)
)
PS:它现在只解决结构。你能给我看一个你的 ARRAY 列的例子吗?
我不明白您对Min Length和Max Length的意思,但考虑到提供的数据,您可以执行以下操作。
这个查询基本上有两个步骤:
WITH
使用该子句创建包含平面数据的临时表UNION ALL
将所有内容组合在一个表中。询问:
WITH
t AS(
SELECT
first_name,
dob,
last_name,
a.zip addresses_zip,
a.state addresses_state,
a.city addresses_city,
a.numberOfYears addresses_numberOfYears,
a.status addresses_status,
a.phone.primarynumber addresses_phone_primarynumber,
a.phone.secondary addresses_phone_secondary
FROM
<your-table> t,
t.addresses a
)
SELECT
"first_name" AS column,
COUNT(first_name) total_count,
COUNT(DISTINCT first_name) total_distinct,
SUM(
IF
(first_name IS NULL,
1,
0)) total_null,
CAST(MIN(first_name) AS string) min_value,
CAST(MAX(first_name) AS string) max_value
FROM
t
UNION ALL
SELECT
"dob" AS column,
COUNT(dob) total_count,
COUNT(DISTINCT dob) total_distinct,
SUM(
IF
(dob IS NULL,
1,
0)) total_null,
CAST(MIN(dob) AS string) min_value,
CAST(MAX(dob) AS string) max_value
FROM
t
UNION ALL
SELECT
"last_name" AS column,
COUNT(last_name) total_count,
COUNT(DISTINCT last_name) total_distinct,
SUM(
IF
(last_name IS NULL,
1,
0)) total_null,
CAST(MIN(last_name) AS string) min_value,
CAST(MAX(last_name) AS string) max_value
FROM
t
UNION ALL
SELECT
"addresses.zip" AS column,
COUNT(addresses_zip) total_count,
COUNT(DISTINCT addresses_zip) total_distinct,
SUM(
IF
(addresses_zip IS NULL,
1,
0)) total_null,
CAST(MIN(addresses_zip) AS string) min_value,
CAST(MAX(addresses_zip) AS string) max_value
FROM
t
UNION ALL
SELECT
"addresses.state" AS column,
COUNT(addresses_state) total_count,
COUNT(DISTINCT addresses_state) total_distinct,
SUM(
IF
(addresses_state IS NULL,
1,
0)) total_null,
CAST(MIN(addresses_state) AS string) min_value,
CAST(MAX(addresses_state) AS string) max_value
FROM
t
UNION ALL
SELECT
"addresses.city" AS column,
COUNT(addresses_city) total_count,
COUNT(DISTINCT addresses_city) total_distinct,
SUM(
IF
(addresses_city IS NULL,
1,
0)) total_null,
CAST(MIN(addresses_city) AS string) min_value,
CAST(MAX(addresses_city) AS string) max_value
FROM
t
UNION ALL
SELECT
"addresses.numberOfYears" AS column,
COUNT(addresses_numberOfYears) total_count,
COUNT(DISTINCT addresses_numberOfYears) total_distinct,
SUM(
IF
(addresses_numberOfYears IS NULL,
1,
0)) total_null,
CAST(MIN(addresses_numberOfYears) AS string) min_value,
CAST(MAX(addresses_numberOfYears) AS string) max_value
FROM
t
UNION ALL
SELECT
"addresses.status" AS column,
COUNT(addresses_status) total_count,
COUNT(DISTINCT addresses_status) total_distinct,
SUM(
IF
(addresses_status IS NULL,
1,
0)) total_null,
CAST(MIN(addresses_status) AS string) min_value,
CAST(MAX(addresses_status) AS string) max_value
FROM
t
UNION ALL
SELECT
"addresses.phone.primarynumber" AS column,
COUNT(addresses_phone_primarynumber) total_count,
COUNT(DISTINCT addresses_phone_primarynumber) total_distinct,
SUM(
IF
(addresses_phone_primarynumber IS NULL,
1,
0)) total_null,
CAST(MIN(addresses_phone_primarynumber) AS string) min_value,
CAST(MAX(addresses_phone_primarynumber) AS string) max_value
FROM
t
UNION ALL
SELECT
"addresses.phone.secondary" AS column,
COUNT(addresses_phone_secondary) total_count,
COUNT(DISTINCT addresses_phone_secondary) total_distinct,
SUM(
IF
(addresses_phone_secondary IS NULL,
1,
0)) total_null,
CAST(MIN(addresses_phone_secondary) AS string) min_value,
CAST(MAX(addresses_phone_secondary) AS string) max_value
FROM
t