2

使用 将 numpy 数组的 RDD 转换为 Spark DFDenseVector时,以下代码可以正常工作。

import numpy as np
from pyspark.ml.linalg import DenseVector

rdd = spark.sparkContext.parallelize([
        np.array([u'5.0', u'0.0', u'0.0', u'0.0', u'1.0']),
        np.array([u'6.0', u'0.0', u'0.0', u'0.0', u'1.0'])
    ])

(rdd.map(lambda x: (x[0].tolist(), DenseVector(x[1:])))
       .toDF()
       .show(2, False))

# +---+---------------------+
# | _1|                   _2|
# +---+---------------------+
# |5.0|[5.0,0.0,0.0,0.0,1.0]|
# |6.0|[6.0,0.0,0.0,0.0,1.0]|
# +---+---------------------+

但是,我不想要上面的第一列,即我的目标输出是:

# +---------------------+
# |                   _1|
# +---------------------+
# |[5.0,0.0,0.0,0.0,1.0]|
# |[6.0,0.0,0.0,0.0,1.0]|
# +---------------------+

我尝试了以下操作,它们都导致了TypeError: not supported type: <type 'numpy.ndarray'>. 如何获得上述预期结果?任何帮助是极大的赞赏!

rdd.map(lambda x: DenseVector(x[0:])).toDF()    \\only DenseVector
rdd.map(lambda x: (DenseVector(x[0:]))).toDF()   \\with parenthesis
rdd.map(lambda x: DenseVector(x[1:])).toDF()   \\only from element1
4

0 回答 0