0

我正在尝试查询候选人、项目、spark SQL 合同的最早时间戳。

spark.sql(
      """
        |SELECT
        | DISTICT
        | timestamp,
        | candidate_id,
        | project_id,
        | contract_id
        |FROM candidatesHistory
        |GROUP BY timestamp, candidate_id, project_id, contract_id
        |ORDER BY timestamp DESC
        |LIMIT 1
        |""".stripMargin)

这段代码不这样做,它只是获取一条记录 - 我如何获得合同项目候选人的最旧时间戳?

任何帮助表示赞赏

4

2 回答 2

1

如果表中只有 4 列,则可以使用聚合:

select candidate_id, project_id, contract_id, min(timestamp) first_timestamp
from candidateshistory
group by candidate_id, project_id, contract_id

如果有更多列并且您想将所有列都带进来,那么您可以使用row_number()过滤表:

select ch.*
from (
    select ch.*,
        row_number() over(partition by candidate_id, project_id, contract_id order by timestamp) rn
    from candidateshistory ch
) ch
where rn = 1

对于每个(candidate_id, project_id, contract_id)元组,这将为您提供最早的行timestamp

于 2020-10-17T22:46:48.790 回答
1

这应该可行,但不知道这是否是最好的方法:

SELECT candidate_id
, project_id
, contract_id
, timestamp
FROM (
    SELECT RANK() OVER (PARTITION BY candidate_id ORDER BY timestamp) AS RNK
    , candidate_id
    , project_id
    , contract_id
    FROM candidatesHistory
    ) as CH
WHERE CH.RNK = 1;
于 2020-10-17T22:51:41.237 回答