不确定我是否正确理解了这个问题,但如果你想根据任何列对行进行操作,你可以使用数据框函数来做到这一点。例子 :
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql import Window
sc = SparkSession.builder.appName("example").\
config("spark.driver.memory","1g").\
config("spark.executor.cores",2).\
config("spark.max.cores",4).getOrCreate()
df1 = sc.read.format("csv").option("header","true").load("test.csv")
w = Window.partitionBy("student_id")
df2 = df1.groupBy("student_id").agg(f.sum(df1["marks"]).alias("total"))
df3 = df1.withColumn("max_marks_inanysub",f.max(df1["marks"]).over(w))
df3 = df3.filter(df3["marks"] == df3["max_marks_inanysub"])
df1.show()
df3.show()
样本数据
student_id,主题,标记 1,数学,3 1,科学,6 2,数学,4 2,科学,7
输出
+----------+--------+-----+ |student_id|subject|marks| +---------+-------+-----+ | 1| 数学| 3| | 1|科学| 6| | 2| 数学| 4| | 2|科学| 7| +---------+--------+------+
+---------+--------+-----+------------------+ |student_id|subject|标记|max_marks_inanysub| +---------+-------+-----+------------------+ | 1|科学| 6| 6| | 2|科学| 7| 7| +---------+--------+------+------------------+