python - 从 PySpark 中的数据框中删除重复项

Question

我在本地处理 pyspark 1.4 中的数据框，并且在使该dropDuplicates方法正常工作时遇到问题。它不断返回错误：

“AttributeError：‘list’对象没有属性‘dropDuplicates’”

不太清楚为什么，因为我似乎遵循最新文档中的语法。

#loading the CSV file into an RDD in order to start working with the data
rdd1 = sc.textFile("C:\myfilename.csv").map(lambda line: (line.split(",")[0], line.split(",")[1], line.split(",")[2], line.split(",")[3])).collect()

#loading the RDD object into a dataframe and assigning column names
df1 = sqlContext.createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4']).collect()

#dropping duplicates from the dataframe
df1.dropDuplicates().show()

score 44 · Accepted Answer

这不是进口问题。你只是调用.dropDuplicates()了一个错误的对象。虽然 class of sqlContext.createDataFrame(rdd1, ...)is pyspark.sql.dataframe.DataFrame，应用后.collect()它是一个普通的 Python list，并且列表不提供dropDuplicates方法。你想要的是这样的：

 (df1 = sqlContext
     .createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4'])
     .dropDuplicates())

 df1.collect()

score 20 · Accepted Answer

如果您有一个数据框并且想要删除所有重复项 - 参考特定列中的重复项（称为“colName”）：

重复数据删除前的计数：

df.count()

执行重复数据删除（将要重复数据删除的列转换为字符串类型）：

from pyspark.sql.functions import col
df = df.withColumn('colName',col('colName').cast('string'))

df.drop_duplicates(subset=['colName']).count()

可以使用排序的 groupby 来检查是否已删除重复项：

df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False)

python - 从 PySpark 中的数据框中删除重复项

2 回答 2

Related

Reference