我在独立模式下使用 Spark 0.7.2 和以下驱动程序来处理 ~90GB(压缩:19GB)的日志数据,使用 7 个工作人员和 1 个不同的主服务器:
System.setProperty("spark.default.parallelism", "32")
val sc = new SparkContext("spark://10.111.1.30:7077", "MRTest", System.getenv("SPARK_HOME"), Seq(System.getenv("NM_JAR_PATH")))
val logData = sc.textFile("hdfs://10.111.1.30:54310/logs/")
val dcxMap = logData.map(line => (line.split("\\|")(0),
line.split("\\|")(9)))
.reduceByKey(_ + " || " + _)
dcxMap.saveAsTextFile("hdfs://10.111.1.30:54310/out")
在所有ShuffleMapTasks
阶段 1 完成后:
Stage 1 (reduceByKey at DcxMap.scala:31) finished in 111.312 s
它提交阶段 0:
Submitting Stage 0 (MappedRDD[6] at saveAsTextFile at DcxMap.scala:38), which is now runnable
经过一些序列化后,它会打印
spark.MapOutputTrackerActor - Asked to send map output locations for shuffle 0 to host23
spark.MapOutputTracker - Size of output statuses for shuffle 0 is 2008 bytes
spark.MapOutputTrackerActor - Asked to send map output locations for shuffle 0 to host21
spark.MapOutputTrackerActor - Asked to send map output locations for shuffle 0 to host22
spark.MapOutputTrackerActor - Asked to send map output locations for shuffle 0 to host26
spark.MapOutputTrackerActor - Asked to send map output locations for shuffle 0 to host24
spark.MapOutputTrackerActor - Asked to send map output locations for shuffle 0 to host27
spark.MapOutputTrackerActor - Asked to send map output locations for shuffle 0 to host28
在此之后,什么都没有发生,也top
表明工人现在都处于闲置状态。如果我查看工作机器上的日志,每台机器都会发生同样的事情:
13/06/21 07:32:25 INFO network.SendingConnection: Initiating connection to [host27/127.0.1.1:34288]
13/06/21 07:32:25 INFO network.SendingConnection: Initiating connection to [host27/127.0.1.1:36040]
13/06/21 07:32:25 INFO network.SendingConnection: Initiating connection to [host27/127.0.1.1:50467]
13/06/21 07:32:25 INFO network.SendingConnection: Initiating connection to [host27/127.0.1.1:60833]
13/06/21 07:32:25 INFO network.SendingConnection: Initiating connection to [host27/127.0.1.1:49893]
13/06/21 07:32:25 INFO network.SendingConnection: Initiating connection to [host27/127.0.1.1:39907]
然后,对于这些“启动连接”尝试中的每一个,它都会向每个工作人员抛出相同的错误(以 host27 的日志为例,并且仅显示第一次出现的错误):
13/06/21 07:32:25 WARN network.SendingConnection: Error finishing connection to host27/127.0.1.1:49893
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at spark.network.SendingConnection.finishConnect(Connection.scala:221)
at spark.network.ConnectionManager.spark$network$ConnectionManager$$run(ConnectionManager.scala:127)
at spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:70)
为什么会这样?工人之间似乎可以很好地交流,唯一的问题似乎是他们想给自己发送消息;在上面的例子中,host27 尝试向自己发送 6 条消息,但失败了 6 次。向其他工作人员发送消息工作正常。有人有想法吗?
编辑:也许它与使用 127.0 的火花有关。1 .1 而不是 127.0。0 .1?
/etc/hosts
如下所示:
127.0.0.1 localhost
127.0.1.1 host27.<ourdomain> host27