我在一家小型软件公司工作,我的任务是研究一个分布式锁管理器供我们使用。它必须与 Java 和 C++ 接口。
我已经与 ZooKeeper 合作了几个星期,并根据文档实现了共享锁(读写锁)。我现在需要实现死锁检测。如果每个客户端都可以维护一个锁图,那将是快速和容易的。但是,您无法可靠地看到 ZooKeeper 中节点发生的每一次更改,因此无法维护准确的图表。这意味着每次检查死锁时,我都需要下载许多锁,这似乎不切实际。
另一个解决方案是在 ZooKeeper 服务器中实现死锁检测,我现在正在研究它。每个客户端都将在“/waiting”中创建一个以其会话 ID 命名的节点,其数据将是其等待的锁。由于每个锁都有一个临时所有者,因此我将有足够的信息来检测死锁。
我遇到的问题是 ZooKeeper 服务器没有 ZooKeeper 客户端具有的同步保证。另外,ZooKeeper 服务器不像客户端那样有很好的文档记录,因为您通常不应该接触它。
所以我的问题是:应该如何使用 Apache ZooKeeper 实现死锁检测?我在这里看到很多人推荐 ZooKeeper 作为分布式锁管理器,但是如果它不能支持死锁检测,那么没有人应该将它用于此目的。
编辑:
我有一个可行的解决方案。我不能保证它的正确性,但它已经通过了我所有的测试。
我正在分享我的checkForDeadlock
方法,这是死锁检测算法的核心。以下是您需要了解的其他信息:
- 一次只能运行一个客户端进行死锁检测。
- 首先,客户端尝试获取资源的锁。如果资源已经被锁定并且客户端想要等到它变得可用,那么客户端接下来会检查死锁。如果等待资源不会导致死锁,那么它接下来会在一个特殊目录中创建一个 znode,该目录标识该客户端正在等待该资源。该行如下所示:
waitNode = zooKeeper.create(waitingPath + "/" + sessionID, resource.getBytes(), ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL);
- 在此客户端创建等待节点之前,其他客户端不应开始检查死锁。
- 如果两个客户端几乎同时尝试获取锁,但同时授予两个客户端会导致死锁,那么有可能不是第一个客户端获得锁而第二个客户端被拒绝,而是第一个客户端被拒绝并且第二个客户可以获得锁。这应该不是问题。
checkForDeadlock
DeadlockException
如果发现死锁,则抛出 a 。否则,它会正常返回。- 锁是严格按顺序授予的。如果一个资源有一个已授予的读锁和一个等待的写锁,而另一个客户端想要获得一个读锁,它必须等到写锁被授予之后再释放。
bySequenceNumber
是一个比较器,它按照 ZooKeeper 附加到顺序 znode 末尾的序列对 znode 进行排序。
代码:
private void checkForDeadlock(String pathToResource) throws DeadlockException {
// Algorithm:
// For each client who holds a lock on this resource:
// If this client is me, announce deadlock.
// Otherwise, if this client is waiting for a reserved resource, recursively check for deadlock on that resource.
try {
List<String> lockQueue = zooKeeper.getChildren(pathToResource, false); // Last I checked, children is implemented as an ArrayList.
// lockQueue is the list of locks on this resource.
// FIXME There is a slight chance that lockQueue could be empty.
Collections.sort(lockQueue, bySequenceNumber);
ListIterator<String> lockQueueIterator = lockQueue.listIterator();
String grantedLock = lockQueueIterator.next(); // grantedLock is one lock on this resource.
do {
// lockQueue must contain a write lock, because there is a lock waiting.
String lockOwner = null;
try {
lockOwner = Long.toString(zooKeeper.exists(pathToResource + "/" + grantedLock, false).getEphemeralOwner());
// lockOwner is one client who holds a lock on this resource.
}
catch (NullPointerException e) {
// Locks may be released while I'm running deadlock detection. I got a NullPointerException because
// the lock I was currently looking at was deleted. Since the lock was deleted, its owner was obviously
// not part of a deadlock. Therefore I can ignore this lock and move on to the next one.
// (Note that a lock can be deleted if and only if its owner is not part of a deadlock.)
continue;
}
if (lockOwner.equals(sessionID)) { // If this client is me.
throw new DeadlockException("Waiting for this resource would result in a deadlock.");
}
try {
// XXX: Is is possible that reservedResource could be null?
String reservedResource = new String(zooKeeper.getData(waitingPath + "/" + lockOwner, false, new Stat()));
// reservedResource is the resource that this client is waiting for. If this client is not waiting for a resource, see exception.
// I only recursively check the next reservedResource if I havn't checked it before.
// I need to do this because, while I'm running my deadlock detection, another client may attempt to acquire
// a lock that would cause a deadlock. Without this check, I would loop in that deadlock cycle indefinitely.
if (checkedResources.add(reservedResource)) {
checkForDeadlock(reservedResource); // Depth-first-search
}
}
catch (KeeperException.NoNodeException e) {
// lockOwner is not waiting for a resource.
}
catch (KeeperException e) {
e.printStackTrace(syncOut);
}
// This loop needs to run for each lock that is currently being held on the resource. There are two possibilities:
// A. There is exactly one write lock on this resource. (Any other locks would be waiting locks.)
// In this case, the do-while loop ensures that the write lock has been checked.
// The condition that requires that the current lock is a read lock ensures that no locks after the write lock will be checked.
// B. There are one or more read locks on this resource.
// In this case, I just check that the next lock is a read lock before moving on.
} while (grantedLock.startsWith(readPrefix) && (grantedLock = lockQueueIterator.next()).startsWith(readPrefix));
}
catch (NoSuchElementException e) {
// The condition for the do-while loop assumes that there is a lock waiting on the resource.
// This assumption was made because a client just reported that it was waiting on the resource.
// However, there is a small chance that the client has since gotten the lock, or even released it before
// we check the locks on the resource.
// FIXME (This may be a problem.)
// In such a case, the childrenIterator.next() call could throw a NoSuchElementException.
// We can safely assume that we are finished searching this branch, and therefore return.
}
catch (KeeperException e) {
e.printStackTrace(syncOut);
}
catch (InterruptedException e) {
e.printStackTrace(syncOut);
}
}