0

我有数百万个 JSON 存储在 Snowflake 的单个变量列表中。它们采用以下格式,但每个 JSON 的行数会有所不同。

请问有人能给我一些关于如何将数据提取到平面表中的指导吗?我是使用 JSON 文件的新手,在不一致的行数和缺少定义对象名称的指标之间让我感到困惑。

这是一个示例 JSON:

{
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AB2 Weight on Bit": 0.2714572,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AB2 Weight on Bit unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD Diff Press Gain SP": 0,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD Diff Press Gain SP unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD ROP": 0,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD ROP unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Calculated Pipe Displacement": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Calculated Pipe Displacement unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Cumulative Delta Displacement": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Cumulative Delta Displacement unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.FD Svy Quality": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.FD Svy Quality unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.GWEX SampleFlow": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.GWEX SampleFlow unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.MP3_STK": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.MP3_STK unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.PT Correction": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.PT Correction unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Pit 11 Jumps": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Pit 11 Jumps unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.ROP - #1 Ref Time": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.ROP - #1 Ref Time unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK2_VOL": 8.732743,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK2_VOL unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK4_VOL": 16.13105,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK4_VOL unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Time On Slip": 1.3,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Time On Slip unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.WPDA - Mud Motor Torque": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.WPDA - Mud Motor Torque unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Washout Factor": 4.167005,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Washout Factor unit": "",
  "DeviceId": "streamingdevice",
  "EventEnqueuedUtcTime": "2020-05-04T22:12:21.5310000Z",
  "EventProcessedUtcTime": "2020-05-04T22:12:35.6868329Z",
  "IoTHub": {
    "ConnectionDeviceGenerationId": "637199801617320690",
    "ConnectionDeviceId": "streamingdevice",
    "CorrelationId": null,
    "EnqueuedTime": "2020-05-04T22:12:21.0000000",
    "MessageId": null,
    "StreamId": null
  },
  "PartitionId": 1,
  "Timestamp": "2019-10-30 13:48:05.000000"
}

“Edge 93 Belgium 43-23-19 1932”是一个对象名称;每个 JSON 用于单个对象。

“Time_1_Avg.AB2 Weight on Bit”是读数类型,本质上由Tag1.Tag2组成。

该行的最后一部分是读数值。

JSON 底部的时间戳是读取时间。

此部分不是必需的:

  "DeviceId": "streamingdevice",
  "EventEnqueuedUtcTime": "2020-05-04T22:12:21.5310000Z",
  "EventProcessedUtcTime": "2020-05-04T22:12:35.6868329Z",
  "IoTHub": {
    "ConnectionDeviceGenerationId": "637199801617320690",
    "ConnectionDeviceId": "streamingdevice",
    "CorrelationId": null,
    "EnqueuedTime": "2020-05-04T22:12:21.0000000",
    "MessageId": null,
    "StreamId": null
  },
  "PartitionId": 1,

此数据的理想输出是:

在此处输入图像描述

但只是得到这样的东西会非常有帮助:

在此处输入图像描述 感谢您的帮助!

4

1 回答 1

2

假设所需的密钥始终具有 3 个以句点分隔的组件,以下可能是一种解决方案:

  • 使用FLATTENtable 函数VARIANT从表中获取任何类型的列(例如 1 行常量)并将其分解为多行
  • 依赖于生成的THIS列(来自表)为每个分解的行FLATTEN发出一个行常量值 ( )Timestamp
  • 使用NOT IN过滤器排除不需要的键名
  • 使用SPLIT带索引的函数将提取的键分成多列
SELECT
  SPLIT(KEY, '.')[0] AS "Object Name"
, SPLIT(KEY, '.')[1] AS "Tag 1"
, SPLIT(KEY, '.')[2] AS "Tag 2"
, VALUE AS "Value"
, THIS:Timestamp::TIMESTAMP AS "Timestamp"
FROM TABLE(FLATTEN(PARSE_JSON('
{
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AB2 Weight on Bit": 0.2714572,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AB2 Weight on Bit unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD Diff Press Gain SP": 0,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD Diff Press Gain SP unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD ROP": 0,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.AD ROP unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Calculated Pipe Displacement": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Calculated Pipe Displacement unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Cumulative Delta Displacement": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Cumulative Delta Displacement unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.FD Svy Quality": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.FD Svy Quality unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.GWEX SampleFlow": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.GWEX SampleFlow unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.MP3_STK": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.MP3_STK unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.PT Correction": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.PT Correction unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Pit 11 Jumps": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Pit 11 Jumps unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.ROP - #1 Ref Time": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.ROP - #1 Ref Time unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK2_VOL": 8.732743,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK2_VOL unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK4_VOL": 16.13105,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.TANK4_VOL unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Time On Slip": 1.3,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Time On Slip unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.WPDA - Mud Motor Torque": -999.25,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.WPDA - Mud Motor Torque unit": "",
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Washout Factor": 4.167005,
  "Edge 93 Belgium 43-23-19 1932.Time_1_Avg.Washout Factor unit": "",
  "DeviceId": "streamingdevice",
  "EventEnqueuedUtcTime": "2020-05-04T22:12:21.5310000Z",
  "EventProcessedUtcTime": "2020-05-04T22:12:35.6868329Z",
  "IoTHub": {
    "ConnectionDeviceGenerationId": "637199801617320690",
    "ConnectionDeviceId": "streamingdevice",
    "CorrelationId": null,
    "EnqueuedTime": "2020-05-04T22:12:21.0000000",
    "MessageId": null,
    "StreamId": null
  },
  "PartitionId": 1,
  "Timestamp": "2019-10-30 13:48:05.000000"
}
')))
WHERE
  KEY NOT IN ('DeviceId', 'IoTHub', 'PartitionId', 'Timestamp', 'EventEnqueuedUtcTime', 'EventProcessedUtcTime');

这应该会产生类似于您的第一个屏幕截图的结果:

利用句点分隔符将键拆分为单独的列

于 2020-05-20T17:23:53.943 回答