azure - Windows Azure 暂存 <--> 生产导致表存储冲突和错误

Question

昨天在尝试交换我们的登台 <--> 生产角色时，我们遇到了一个可怕的问题/经历。

这是我们的设置：

我们有一个工人角色从队列中提取消息。这些消息在角色上进行处理。（表存储插入、数据库选择等）。每个队列消息可能需要 1-3 秒，具体取决于他需要创建多少表存储帖子。一切完成后，他将删除该消息。

交换时的问题：

当我们的暂存项目上线时，我们的生产工人开始出错。

当角色想要处理队列消息时，它会给出一个恒定的“ EntityAlreadyExists ”错误流。由于这些错误，队列消息没有被删除。这导致队列消息被放回队列中并返回处理等等......

在查看这些队列消息并分析它们会发生什么时，我们看到它们实际上已被处理但没有被删除。

删除这些错误消息时，问题还没有结束。新的队列消息也没有处理，而这些还没有处理，也没有添加表存储记录，这听起来很奇怪。

当删除暂存和生产并再次发布到生产时，一切都开始正常工作。

可能的问题？

我们几乎没有 2 不知道实际发生了什么。

也许两个角色都收到了相同的消息，一个发了帖子，一个出错了？
...？？？

可能的解决方案）？

我们对如何解决这个“问题”有一些想法。

使毒消息故障转移系统？当出队计数超过 X 时，我们应该删除该队列消息或将其放入单独的“毒队列”中。
捕获 EntityAlreadyExists 错误并删除该队列消息或将其放入单独的队列中。
……？？？？

多个角色

我想在设置多个角色时我们会遇到同样的问题？

非常感谢。

编辑 24/02/2012 - 额外信息

我们实际上使用GetMessage()
队列中的每个项目都是唯一的，并且会在表 Storage 中生成唯一的消息。关于这个过程的更多信息：用户发布了一些东西，并且必须分发给某些其他用户。从该用户生成的消息将具有唯一的 Id (guid)。此消息将被发布到队列中并由工作角色拾取。该消息分布在其他几个表中（partitionkey -> UserId，rowkey -> 一些时间戳记和唯一的消息 ID。因此在正常情况下几乎不可能发布相同的消息。
隐身超时可能是一个合乎逻辑的解释，因为一些消息可以分发到 10-20 个表。这意味着没有批处理选项的 10-20 插入。您可以设置或扩展此隐身超时吗？
由于异常而不删除队列消息也可能是一种解释，因为我们还没有实现任何有害消息故障转移；）。

score 2 · Accepted Answer

无论暂存与生产问题如何，拥有处理有害消息的机制都至关重要。我们在 Azure 队列上实现了一个抽象层，一旦尝试处理一些可配置的次数，它就会自动将消息移动到有害队列。

score 1 · Accepted Answer

With queues you need to code with idempotency in mind and expect and handle the ‘EntityAlreadyExists’ as a viable response.

As others have suggested, causes could be

Multiple message in the queue with the same identifier.
Are peeking for the message and not reading it form the queue and so not making them invisible.
Not deleting the message because an exception was thrown before you can delete them.
Taking too long to process the message so it cannot be deleted (because invisibility was timed out) and appears again

Without looking at the code I am guessing that it is either the 3 or 4 option that is occurring.

If you cannot detect the issue with a code review, you may consider adding time based logging and try/catch wrappers to get a better understanding.

Using queues effectively, in a multi-role environment, requires a slightly different mindset and running into such issues early is actually a blessing in disguise.

Appended 2/24

Just to clarify, modifying the invisibility time out is not a generic solution to this type of problem. Also, note that this feature although available on the REST API, may not be available on the queue client.

Other options involve writing to table storage in an asynchronous manner to speed up your processing time, but again this is a stop gap measures which does not really address the underlying paradigm of working with queues.

So, the bottom line is to be idempotent. You can try using the table storage upsert (update or insert) feature to avoid getting the ‘EntitiyAlreadyExists’ error, if that works for your code. If all you are doing is inserting new entities to azure table storage then the upsert should solve your problem with minimal code change.

If you are doing updates then it is a different ball game all together. One pattern is to pair updates with dummy inserts in the same table with the same partition key so as to error out if the update occurred previously and so skip the update. Later after the message is deleted, you can delete the dummy inserts. However, all this adds to the complexity, so it is much better to revisit the architecture of the product; for example, do you really need to insert/update into so many tables?

score 1 · Accepted Answer

在不知道您的辅助角色实际上在做什么的情况下，我在这里进行猜测，但听起来当您运行两个辅助角色的实例时，您在尝试写入 Azure 表时会遇到冲突。这可能是因为您的代码看起来像这样：

var queueMessage = GetNextMessageFromQueue();    

Foo myFoo = GetFooFromTableStorage(queueMessage.FooId);

if (myFoo == null)
{
    myFoo = new Foo {
                        PartitionKey = queueMessage.FooId
                    };

    AddFooToTableStorage(myFoo);
}

DeleteMessageFromQueue(queueMessage);

如果队列中有两条相邻的消息相同FooId，则很可能最终两个实例都检查是否Foo存在，但没有找到它然后尝试创建它。无论哪个实例是最后一个尝试保存该项目的实例，都会收到“实体已存在”错误。因为它出错了，所以它永远不会到达代码的删除消息部分，因此它会在一段时间后重新出现在队列中。

正如其他人所说，处理有毒消息是一个非常好的主意。

更新 27/02 如果不是后续消息（根据您的分区/行键方案，我会说这不太可能），那么我的下一个赌注是在可见性超时后出现在队列中的相同消息。默认情况下，如果您使用.GetMessage()超时时间为 30 秒。它有一个重载，允许您指定该时间范围有多长。还有.UpdateMessage() 函数允许您在处理消息时更新该超时。例如，您可以将初始可见性设置为 1 分钟，然后如果 50 秒后您仍在处理消息，请将其延长一分钟。

score 1 · Accepted Answer

有几个可能的原因：

您如何阅读队列消息？如果您正在执行 Peek Message，则在删除该消息之前，该消息仍将可见，以便由另一个角色实例（或您的暂存环境）拾取。您要确保您正在使用“获取消息”，以便消息在被删除之前是不可见的。

您的第一个角色是否有可能在为消息完成工作之后但在删除消息之前崩溃？这将导致消息再次可见并被另一个角色实例拾取。届时，该消息将成为有害消息，会导致您的实例不断崩溃。

这个问题几乎可以肯定与暂存与生产无关，但很可能是由于从同一个队列中读取多个实例造成的。您可以通过指定 2 个实例，或者通过将相同的代码部署到 2 个不同的生产服务，或者通过使用 2 个实例在您的开发机器上本地运行代码（仍然指向 Azure 存储）来重现相同的问题。

一般来说，您确实需要处理有害消息，因此无论如何您都需要实现该逻辑，但我建议您首先找到此问题的根本原因，否则您以后只会遇到更多问题。

score 1 · Accepted Answer

您显然在处理双重消息方面存在错误。您的 ID 是唯一的这一事实并不意味着该消息在某些情况下不会被处理两次，例如：

角色死亡并且工作部分完成，因此消息将重新出现在队列中进行处理
角色意外崩溃，因此消息最终回到队列中
FC 正在迁移您的角色，而您没有处理这种情况的代码，因此消息最终回到队列中

在所有情况下，您都需要处理消息将重新出现这一事实的代码。一种方法是使用DequeueCount属性并检查消息从队列中删除并接收处理的次数。确保您有处理消息的部分处理的代码。

现在交换过程中可能发生的情况是，当生产环境变成 staging 并且 staging 变成生产环境时，他们都试图接收相同的消息，所以他们基本上是在相互竞争这些消息，这可能还不错，因为这是一个已知模式无论如何都可以工作，但是当您杀死旧的生产（暂存）时，接收到的每条消息进行处理但尚未完成，最终回到队列中，您的新生产环境再次选择消息进行处理。没有代码逻辑来处理这种情况，并且消息被部分处理，表中存在一些记录，并且它开始导致您注意到的行为。

azure - Windows Azure 暂存 <--> 生产导致表存储冲突和错误

5 回答 5

Related

Reference