1

我正在尝试开发自定义描述。为此,我会将 pyspark.sql.functions 中的函数与其他用户聚合的自定义函数 (UDAF)结合起来。代码如下所示:

from pyspark.sql.functions import count
from pyspark.sql.functions import pandas_udf, PandasUDFType
from scipy.stats import entropy



# Define a UDAF
@pandas_udf("double", PandasUDFType.GROUPED_AGG)
def my_entropy(data):

    p_data = data.value_counts()           # counts occurrence of each value
    s = entropy(p_data)  # get entropy from counts
    return s


# Perform a groupby-agg 
groupby_col = "a_column"
agg_col = "another_column"
df2return = df\
    .groupBy(groupby_cols)\
    .agg(count(agg_col).alias("count"),
        my_entropy(agg_col).alias("s"))

df2return.show()

抛出的错误很长,所以我只复制最后出现的异常。

是的

有人知道如何解决吗?

4

0 回答 0