python - Pyspark：有没有一种有效的方法来排除只有空值但 pk 的行？

Question

我有一个带有 id(PK) 和几列的 sdf，后者可能包含空值。我想找到一种有效的方法来过滤其列中至少有一个值的行。

假设这是表格：

+-----------+-------+-------+-------+
|         id| clm_01| clm_02| clm_03|...
+-----------+-------+-------+-------+-
|    10001  |   null|  null |      5|...
|    10002  |      1|     3 |      2|...
|    10003  |   null|  null |   null|...
        ...
+-----------+-------+-------+-------+

从上表中，我想获取 id 为 10003 的行。这可以通过下面的脚本轻松完成；

sdf.withColumn(
  'flg', 
  when(
   col('clm_01').isNull() & col('clm_02').isNull() & col('clm_01').isNull(),1).\
  otherwise(0) 
).\
filter(col('flg') != 1)

但是如何将条件子句应用于更多列，而不重复 isNull() 链一百次？

提前感谢您的帮助。

score 1 · Accepted Answer

您可以使用或coalesce函数。如果所有列都是，它们会返回：leastgreatestnullnull

from pyspark.sql import functions as F

columns = list(set(sdf.columns) - {'id'})
sdf = sdf.filter(F.coalesce(*columns).isNull())

或以这种方式仅使用coalesce：

sdf = sdf.filter(F.coalesce(*sdf.columns) == F.col('id'))

python - Pyspark：有没有一种有效的方法来排除只有空值但 pk 的行？

1 回答 1

Related

Reference