使用 将 numpy 数组的 RDD 转换为 Spark DFDenseVector
时,以下代码可以正常工作。
import numpy as np
from pyspark.ml.linalg import DenseVector
rdd = spark.sparkContext.parallelize([
np.array([u'5.0', u'0.0', u'0.0', u'0.0', u'1.0']),
np.array([u'6.0', u'0.0', u'0.0', u'0.0', u'1.0'])
])
(rdd.map(lambda x: (x[0].tolist(), DenseVector(x[1:])))
.toDF()
.show(2, False))
# +---+---------------------+
# | _1| _2|
# +---+---------------------+
# |5.0|[5.0,0.0,0.0,0.0,1.0]|
# |6.0|[6.0,0.0,0.0,0.0,1.0]|
# +---+---------------------+
但是,我不想要上面的第一列,即我的目标输出是:
# +---------------------+
# | _1|
# +---------------------+
# |[5.0,0.0,0.0,0.0,1.0]|
# |[6.0,0.0,0.0,0.0,1.0]|
# +---------------------+
我尝试了以下操作,它们都导致了TypeError: not supported type: <type 'numpy.ndarray'>
. 如何获得上述预期结果?任何帮助是极大的赞赏!
rdd.map(lambda x: DenseVector(x[0:])).toDF() \\only DenseVector
rdd.map(lambda x: (DenseVector(x[0:]))).toDF() \\with parenthesis
rdd.map(lambda x: DenseVector(x[1:])).toDF() \\only from element1