azure - High throughput send to EventHubs resulting into MessagingException / TimeoutException / Server was unable to process the request errors

Question

We are experiencing lots of these exceptions sending events to EventHubs during peak traffic:

"Failed to send event to EventHub. Exception : Microsoft.ServiceBus.Messaging.MessagingException: The server was unable to process the request; please retry the operation. If the problem persists, please contact your Service Bus administrator and provide the tracking id." or "Failed to send event to EventHub. Exception : System.TimeoutException: The operation did not complete within the allocated time "

You can see it clearly here:

As you can see, we got lots of Internal Errors, Server Busy Errors, Failed Request when Incoming messages are over 400K events/hour (or ~270 MB/hour). This is not just a transient issue. It's clearly related to throughput.

Our EH has 32 partitions, message retention of 7 days, and 5 throughput units assigned. OperationTimeout is set to 5 mins, and we are using the default RetryPolicy.

Is it anything we still need to tweak here? We are really concerned about the scalability of EH.

Thanks

score 7 · Accepted Answer

可以使用有效的分区分配策略来实现发送吞吐量调整。没有任何一个旋钮可以做到这一点。以下是为高吞吐量场景设计所需的基本信息。

1）让我们从命名空间开始：吞吐量单位（又名 TU）在命名空间级别配置。请。请记住，已应用配置的 TU - 该命名空间下所有 EventHub 的聚合。如果您的命名空间上有 5 个 TU，并且在其下有 5 个 eventthub - 它将在所有 5 个 eventthub 中分配。

2) 现在让我们看看 EventHub 级别：如果 EventHub 分配有 5 个 TU 并且它有 32 个分区 - 没有一个分区可以使用所有 5 个 TU。例如。如果您尝试将 5TU 的数据发送到 1 个分区并将“零”发送到所有其他 31 个分区 - 这是不可能的。您应该为每个分区计划的最大值为 1 TU。通常，您需要确保数据在所有分区中均匀分布。EventHubs 支持 3 种类型的发送 - 这为用户提供了对分区分布的不同级别的控制：

EventHubClient.Send(EventDataWithoutPartitionKey) -> 如果您使用此 API 发送 - eventthub 将负责将数据均匀分布在所有分区中。EventHubs 服务网关会将数据轮询到所有分区。当特定分区关闭时 - 网关会自动检测并确保客户端没有看到任何影响。这是发送到 EventHubs 的最推荐方式。
EventHubClient.Send(EventDataWithPartitionKey) -> 如果您使用此 API 发送到 EventHubs - partitionKey 将决定您的数据的分布。PartitionKey 用于将 EventData 散列到适当的分区（算法散列是 Microsoft 专有的，而不是共享的）。通常，需要关联一组消息的用户将使用此 Send 变体。
EventHubSender.Send(EventData) -> 在这个变体中，Sender 已经附加到了 Partition。所以 - 这可以完全控制跨分区的分发给客户端。

要测量您当前的数据分布 - 使用EventHubClient.GetPartitionRuntimeInfo Api 来估计哪个分区过载。与其他分区相比， b/w 的差异BeginSequenceNumber应该LastEnqueuedSequenceNumber给出该分区负载的估计值。

3) 最后但并非最不重要的一点 - 您可以在发送操作级别调整性能（而不是吞吐量） - 使用 SendBatch API。1 个 TU 可以购买 1000 msgs/sec 或 1MBPS 的 Max - 无论先达到哪个限制，您都会受到限制 - 这无法更改。如果您的消息很小 - 比如说 100 个字节并且您只能发送 1000 个消息/秒（根据 TU 限制） - 您将首先达到 1000 个事件/秒的限制。但是，总体而言，使用SendBatch API - 您可以批量处理 100 字节消息中的 10 个并以相同的速率推送 - 1000 个消息/秒，只需 100 个 API 调用，并改善系统的端到端延迟（因为它也有助于服务有效地持久化消息）。请记住，这里唯一的限制是 Max。可以发送的消息大小 - 为 256 kb（如果您使用 SendBatch API，此限制将适用于您的 BatchSize）。

鉴于这种背景，在您的情况下： - 拥有 32 个分区和 5 个 TU - 我真的会仔细检查分区分配策略。

这是有关事件中心的一些更一般性的阅读...

score 1 · Accepted Answer

经过大量挖掘，我们决定停止为发布的消息设置 PK，问题就消失了！我们使用 GUID 作为 PK。我们开始在 Azure 门户上发现很少的错误，并且没有更多的例外。希望这对其他人有帮助

azure - High throughput send to EventHubs resulting into MessagingException / TimeoutException / Server was unable to process the request errors

2 回答 2

Related

Reference