python - flatMap 覆盖 pyspark 中的自定义对象列表

Question

在类的对象列表上运行 flatMap() 时出现错误。它适用于常规的 python 数据类型，如 int、list 等，但是当列表包含我的类的对象时，我会遇到错误。这是整个代码：

from pyspark import SparkContext 

sc = SparkContext("local","WordCountBySparkKeyword")

def func(x):
    if x==2:
        return [2, 3, 4]
    return [1]

rdd = sc.parallelize([2])
rdd = rdd.flatMap(func) # rdd.collect() now has [2, 3, 4]
rdd = rdd.flatMap(func) # rdd.collect() now has [2, 3, 4, 1, 1]

print rdd.collect() # gives expected output

# Class I'm defining
class node(object):
    def __init__(self, value):
        self.value = value

    # Representation, for printing node
    def __repr__(self):
        return self.value


def foo(x):
    if x.value==2:
        return [node(2), node(3), node(4)]
    return [node(1)]

rdd = sc.parallelize([node(2)])
rdd = rdd.flatMap(foo)  #marker 2

print rdd.collect() # rdd.collect should contain nodes with values [2, 3, 4, 1, 1]

代码工作正常，直到标记 1（在代码中注释）。问题出现在标记 2 之后。我收到的具体错误消息是AttributeError: 'module' object has no attribute 'node' 如何解决此错误？

我正在使用 ubuntu，运行 pyspark 1.4.1

score 4 · Accepted Answer

您得到的错误与flatMap. 如果您node在主脚本中定义类，它可以在驱动程序上访问，但不会分发给工作人员。为了使其工作，您应该将node定义放在单独的模块中，并确保将其分发给工作人员。

使用定义创建单独的模块node，让我们调用它node.py
node在主脚本中导入此类：
```
from node import node
```
确保将模块分发给工作人员：
```
sc.addPyFile("node.py")
```

现在一切都应该按预期工作。

附带说明：

PEP 8建议类名使用 CapWords。这不是一个硬性要求，但它使生活更轻松
__repr__方法应该返回一个对象的字符串表示。至少确保它是 a string，但正确的表示会更好：
```
def __repr__(self):
     return "node({0})".format(repr(self.value))
```

python - flatMap 覆盖 pyspark 中的自定义对象列表

1 回答 1

Related

Reference