我正在尝试在处理从谷歌云存储读取的大量数据(2TB )的纱线模式下运行作业。


.map(lambda row: json.loads(row))\

 [...] later processing on collections and output to GCS.
  This computation over the elements of collections is not associative,
  each element is sorted in it's keyspace.


15/11/04 16:08:07 WARN org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster has disassociated: xxxxxxxxxxx
15/11/04 16:08:07 ERROR org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend: Yarn application has already exited with state FINISHED!
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/rdd.py", line 1299, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/usr/lib/spark/python/pyspark/context.py", line 916, in runJob
15/11/04 16:08:07 WARN org.apache.spark.ExecutorAllocationManager: No stages are running, but numRunningTasks != 0
    port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/usr/lib/spark/python/lib/py4j-", line 538, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 36, in deco
    return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job cancelled because SparkContext was shut down

我尝试通过连接到主服务器来逐个启动每个操作进行调查,但它似乎在groupBy上失败了。我还尝试通过添加节点并升级它们的内存和 CPU 数量来重新调整集群,但我仍然遇到同样的问题。

120 个节点 + 1 个具有相同规格的主节点:8 个 vCPU - 52GB 内存


主键是每条记录的必需值,我们需要所有没有过滤器的键,大约代表 600k 键。真的可以在不将集群扩展到大规模的情况下执行此操作吗?我刚刚读到 databricks 对 100TB 的数据(https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html)进行了排序,这也涉及到大规模的洗牌。他们成功地将多个内存缓冲区替换为单个缓冲区,从而导致大量磁盘 IO ?我的集群规模是否可以执行此类操作?


1 回答 1





 * Group the values for each key in the RDD into a single sequence. Allows controlling the
 * partitioning of the resulting key-value pair RDD by passing a Partitioner.
 * The ordering of elements within each group is not guaranteed, and may even differ
 * each time the resulting RDD is evaluated.
 * Note: This operation may be very expensive. If you are grouping in order to perform an
 * aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]
 * or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
 * Note: As currently implemented, groupByKey must be able to hold all the key-value pairs for any
 * key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].



通常,“内存中”约束意味着您无法通过添加更多节点来真正解决显着倾斜的键,因为它需要在热节点上“就地”缩放。对于特定情况,您可以设置spark.executor.memory为 a --confor in dataproc gcloud beta dataproc jobs submit spark [other flags] --properties spark.executor.memory=30g,只要最大键的值都可以适合该 30g (还有一些空间/开销)。但这将在任何可用的最大机器上达到顶峰,所以如果当你的整体数据集增长时最大密钥的大小有可能增长,最好改变密钥分布本身而不是尝试增加单执行器内存.

于 2015-11-05T00:41:41.667 回答