Suppose I have several machines each having spark worker and cassandra node installed. Is it possible to require each spark worker to query only its local cassandra node (on the same machine), so that no network operation involved when I do joinWithCassandraTable after repartitionByCassandraReplica using spark-cassandra-connector, so each spark worker fetches data from its local storage?
问问题
417 次
1 回答
2
在 Spark-Cassandra 连接器内部,LocalNodeFirstLoadBalancingPolicy
处理这项工作。它首先首选本地节点,然后检查同一 DC 中的节点。具体来说,本地节点是通过java.net.NetworkInterface
在主机列表中找到与本地地址列表中的一个匹配的地址来确定的,如下所示:
private val localAddresses =
NetworkInterface.getNetworkInterfaces.flatMap(_.getInetAddresses).toSet
/** Returns true if given host is local host */
def isLocalHost(host: Host): Boolean = {
val hostAddress = host.getAddress
hostAddress.isLoopbackAddress || localAddresses.contains(hostAddress)
}
此逻辑用于创建查询计划,该计划返回查询的候选主机列表。无论计划类型如何(令牌感知或不感知),列表中的第一个主机始终是本地主机(如果存在)。
于 2015-11-03T21:34:32.143 回答