python - 将 pandas 数据框转换为 PySpark RDD 时出现问题？

Question

使用 pandasread_csv()函数，我读取了一个iso-8859-1文件，如下所示：

df = pd.read_csv('path/file', \
                   sep = '|',names =['A','B'], encoding='iso-8859-1')

然后，我想使用 MLLib 的 word2vect。但是，它只接受 RDDs 作为参数。因此，我尝试将 pandas 数据帧转换为 RDD，如下所示：

from pyspark.sql import SQLContext
spDF = sqlContext.createDataFrame(df['A'])
spDF.show()

无论如何，我得到了以下异常：

TypeError: Can not infer schema for type: <type 'unicode'>

我去了Pyspark 的文档以查看是否有类似编码参数的东西，但我没有找到任何东西。关于如何将特定的 pandas 数据框列转换为 Pyspark RDD 的任何想法？

更新：

从@zeros 回答，这就是我尝试将列保存为数据框的方法，如下所示：

new_dataframe = df_3.loc[:,'A']
new_dataframe.head()

然后：

from pyspark.sql import SQLContext
spDF = sqlContext.createDataFrame(new_dataframe)
spDF.show()

我得到了同样的例外：

TypeError: Can not infer schema for type: <type 'unicode'>

score 2 · Accepted Answer

当您使用df['A']不是 a时pandas.DataFrame，pandas.Series因此当您将其传递给SqlContext.createDataFrame它时，它会被视为任何其他Iterable类型，并且 PySpark 不支持将简单类型转换为DataFrame.

如果要将数据保留为 PandasDataFrame使用loc方法：

df.loc[:,'A']

score 0 · Accepted Answer

从@zeros323 回答我注意到它实际上不是熊猫数据框。我查阅了熊猫文档，发现它to_frame()可以转换熊猫数据框中的特定列。所以我做了以下事情：

new_dataframe = df['A'].to_frame()
new_dataframe.head()
from pyspark.sql import SQLContext
spDF = sqlContext.createDataFrame(new_dataframe)
spDF.show()

python - 将 pandas 数据框转换为 PySpark RDD 时出现问题？

2 回答 2

Related

Reference