-1

我正在尝试实现一个嵌套查询,以便每个日志提取不超过一个样本,我想我知道如何分别实现它的组件:

  1. 查询一组包含与我的分析相关的数据的日志:
    SELECT  
      runs.object_type as object_type,  
      runs.name as log_name,  
    from project.runs.latest_runs  
    WHERE object_type = "ROCKET"  
    group by object_type, log_name

这会产生一个日志名称列表,例如“log_name_2021_09_01”、“log_name_2021_09_03”等。

  1. 从单个已知日志中查询不超过一个具有特定条件的事件:
    SELECT  
      object.path_meters as pos,  
      object_speed as speed,  
      log.run as run_name,  
    FROM project.events.last30days  
    WHERE log.run = "log_name_2021_10_01"  
      AND object.speed > 0.0  
    LIMIT 1  

上述查询为指定日志返回的样本不超过一个。

如何组合这些查询以从查询 1 返回的一组日志中提取样本,同时每个日志不应超过一个样本?

更新
假设一个数据库包含三个日志:

  1. 日志名称_2021_09_01。与日志关联的 object_type 是 ROCKET。日志包含 100k 个数据样本:其中 90k 的 object.speed = 0.0,其中 10k 的 speed > 0.0。
  2. 日志名称_2021_09_02。与日志关联的 object_type 是 CAR。该日志还包含 100k 个样本,其比例与 log 1 相似。
  3. 日志名称_2021_09_03。与日志关联的 object_type 是 ROCKET。该日志还包含 100k 个样本,其比例与 log 1 相似。

我只对对象类型为 ROCKET 的日志感兴趣。两个日志对应这个条件:log_name_2021_09_01 和 log_name_2021_09_03。这些日志名称可以通过上面描述的查询 1 获得。我只想从两个日志中的每一个中提取一个样本点(速度 > 0)。也就是说,最后我想要一个返回两个样本的查询:一个来自 log_name_2021_09_01,一个来自 log_name_2021_09_03。

4

1 回答 1

1

You question omits actual example data, so we're forced to infer much from your description. It is strongly recommended that your questions include sample data and the results you'd want from that sample data. This allows us both a concrete example to base our understanding on, and provides a test-set for us to use when developing an answer.

  • Would you trust any code you've written without having run it against test data?

That said, the following should be something like what you're looking for... (For each log, it only selects the one row with the highest pos.)

WITH
  rocket_logs AS
(
  SELECT DISTINCT
    runs.object_type AS object_type,
    runs.name        AS log_name
  FROM
    project.runs.latest_runs  
  WHERE
    object_type = "ROCKET"  
),
  sorted_logs AS
(
  SELECT  
    log.run              AS run_name,
    object.path_meters   AS pos,
    object_speed         AS speed,
    ROW_NUMBER()
      OVER (
        PARTITION BY log.run
            ORDER BY object.path_meters DESC
      )
                         AS seq_num
  FROM
    project.events.last30days  
  WHERE
    object.speed > 0.0
)
SELECT
  *
FROM
  rocket_logs   r
INNER JOIN
  sorted_logs   s
    ON s.run_name = r.log_name
WHERE
  s.seq_num = 1

For a more exact answer, please give:

  • example data, for both tables
  • example results for that data
  • both being sufficient to demonstrate all necessary behaviours
于 2021-10-01T18:46:53.493 回答