filter - 在 Pyspark 中的多行中删除包含某些值的列

Question

所以我有一个 pyspark 数据框，它包含 12 行和 50 列。我想删除包含 0 多于 4 行的列。

然而，上述问题的答案仅适用于熊猫。pyspark 数据框有解决方案吗？

score 1 · Accepted Answer

在 pyspark 中，您必须使用 collect() 将每列中的零计数带入驱动程序。从记忆的角度来看，这应该不是一个大问题，因为每列都有一个值。试试这个，

from pyspark.sql import functions as F
tst= sqlContext.createDataFrame([(1,0,0),(1,0,4),(1,0,10),(2,1,90),(7,2,0),(0,3,11)],schema=['group','order','value'])
expr = [F.count(F.when(F.col(coln)==0,1)).alias(coln) for coln in tst.columns]
tst_cnt = tst.select(*expr).collect()[0].asDict()
#%%
sel_coln =[x for x in tst_cnt.keys() if tst_cnt[x]<=2]
tst_final = tst.select(sel_coln)

我认为，在 sql 语法中，你可以在子查询中做到这一点。

score 1 · Accepted Answer

您可以执行以下操作：

# Creates test data. Field called "col5" won't match 
# the criteria set on the function "check_number"
df1 = spark.sql("select 1 col1, 4 col2, 0 col3, 1 col4, 0 col5")
df2 = spark.sql("select 2 col1, 9 col2, 5 col3, 7 col4, 0 col5")
df3 = spark.sql("select 3 col1, 2 col2, 6 col3, 5 col4, 0 col5")
df4 = spark.sql("select 4 col1, 7 col2, 7 col3, 3 col4, 1 col5")

df = df1.union(df2).union(df3).union(df4)
df.createOrReplaceTempView("df")

print("Original dataframe")
df.show()

# Please change the criteria to filter whatever you need. In this case this sample 
# returns true on the columns that have less than 2 zeros
def check_number(column_name):
    return spark.sql("select count(" + column_name + ") from df where " + column_name + " = 0").take(1)[0][0] < 2

fields = [x.name for x in df.schema.fields if check_number(x.name)]

print("After filtering dataframe")
df.select(fields).show()

在功能上check_number，您可以放置任何标准。

输出是

Original dataframe
+----+----+----+----+----+
|col1|col2|col3|col4|col5|
+----+----+----+----+----+
|   1|   4|   0|   1|   0|
|   2|   9|   5|   7|   0|
|   3|   2|   6|   5|   0|
|   4|   7|   7|   3|   1|
+----+----+----+----+----+

After filtering dataframe
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   1|   4|   0|   1|
|   2|   9|   5|   7|
|   3|   2|   6|   5|
|   4|   7|   7|   3|
+----+----+----+----+

如您所见，我将 PySpark 与 SQL 结合使用

filter - 在 Pyspark 中的多行中删除包含某些值的列

2 回答 2

Related

Reference