我正在尝试开发自定义描述。为此,我会将 pyspark.sql.functions 中的函数与其他用户聚合的自定义函数 (UDAF)结合起来。代码如下所示:
from pyspark.sql.functions import count
from pyspark.sql.functions import pandas_udf, PandasUDFType
from scipy.stats import entropy
# Define a UDAF
@pandas_udf("double", PandasUDFType.GROUPED_AGG)
def my_entropy(data):
p_data = data.value_counts() # counts occurrence of each value
s = entropy(p_data) # get entropy from counts
return s
# Perform a groupby-agg
groupby_col = "a_column"
agg_col = "another_column"
df2return = df\
.groupBy(groupby_cols)\
.agg(count(agg_col).alias("count"),
my_entropy(agg_col).alias("s"))
df2return.show()
抛出的错误很长,所以我只复制最后出现的异常。
有人知道如何解决吗?
