google-cloud-dataflow - BeamSQL Group By查询浮点值问题

Question

尝试使用 Google Dataflow 中的 BeamSQL 从 BigQuery 表中获取唯一值。使用 Group By 子句实现了 BeamSQL 中的条件（下面的示例查询）。其中一列具有浮点数据类型。在执行作业时遇到以下异常，

原因：org.apache.beam.sdk.coders.Coder$NonDeterministicException: org.apache.beam.sdk.coders.RowCoder@81d6d10 不是确定性的，因为：所有字段都必须具有确定性编码。原因：org.apache.beam.sdk.coders.Coder$NonDeterministicException：FloatCoder 不是确定性的，因为：浮点编码不能保证是确定性的。

BeamSQL 查询：

PCollection ST= mainColl.apply(SqlTransform.query("SELECT ID,ITEM,UNITPRICE FROM PCOLLECTION GROUP BY ID,ITEM,UNITPRICE"));

如果有人帮助我解决这个问题，那就太好了。

请注意，如果我们删除 float 列，那么 BeamSQL 查询就可以正常工作。

score 3 · Accepted Answer

这表明您不应该在聚合（分组依据）方案中使用浮点值（在这种情况下可能是UNITPRICE值），因为它们的输出是不确定的（即它可以根据精度变化而变化）。例如，考虑这个例子：

WITH
  data AS (
  SELECT 100 AS id, 'abc' as item, 0.3448473362800000001 AS unitprice
  UNION ALL
  SELECT 200 AS id, 'xyz' as item, 0.49300013 AS unitprice
  UNION ALL
  SELECT 500 AS id, 'pqr' as item, 0.67322332200000212 AS unitprice
)
select id, item, unitprice from data
group by id, item, unitprice

输出为：

100 abc 0.34484733628    
200 xyz 0.49300013   
500 pqr 0.6732233220000021

其中，unitprice值看起来有点不同。

为了避免这种情况，你可以走两条路：

您可以将单价转换为字符串，然后继续分组。类似于cast(unitprice as string) as unitprice您的查询中的内容。
您可以简单地选择保留unitprice为非分组实体（在大多数情况下这是一个合乎逻辑的选项），并在您的查询中执行max(unitprice) as unitpriceor ，同时按.avg(unitprice) as unitpriceid, item

希望这可以帮助。

google-cloud-dataflow - BeamSQL Group By查询浮点值问题

1 回答 1

Related

Reference