appfabric - AppFabric 扩展问题故障排除（间歇性错误代码）:子状态错误）

Question

我们已经为我们的 Web 应用程序实现了 AppFabric Windows Server 缓存。最初，我们能够毫无问题地使用缓存。然后我们将流量增加了大约 100 倍，并开始遇到间歇性异常。异常大约每 2 天发生一次，每次大约一分钟。

我们的配置：

9 个 Web 服务器在缓存中插入/检索对象：
- 主要是临时的 500 字节操作类型对象
- 使用 1 个命名区域
- 使用标签存储的对象
- 为给定标签批量检索
缓存集群：
- 1 台主机（领导）AppFabric 1.1（get-cachehost 报告的版本为 3）
- SQL 配置提供程序
- 主机上 96GB 的 RAM，默认 50% (48GB) 分配给 AppFabric
- 缓存主机配置
- 缓存客户端配置

错误发生的顺序（在 1 分钟内，九个网络服务器中的每一个都发生异常）：

System.Net.Sockets.SocketException：现有连接被远程主机强行关闭 Microsoft.ApplicationServer.Caching.DataCacheException：ErrorCode<ERRCA0016>:SubStatus<ES0001>:The connection was terminated, possibly due to server or network problems or serialized Object size is greater than MaxBufferSize on server. Result of the request is unknown. ---> System.ServiceModel.CommunicationException: The socket connection was aborted. This could be caused by an error processing your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket timeout was '00:15:00'. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host --- End of inner exception stack trace --- at System.Runtime.AsyncResult.End[TAsyncResult](IAsyncResult result) at System.ServiceModel.Channels.FramingDuplexSessionChannel.EndReceive(IAsyncResult result) at Microsoft.ApplicationServer.Caching.WcfClientChannel.CompleteProcessing(IAsyncResult result) --- End of inner exception stack trace --- at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody) at Microsoft.ApplicationServer.Caching.DataCache.GetNextBatch(String region, DataCacheTag[] tags, GetByTagsOperation op, IMonitoringListener listener, Byte[][]& state, Boolean& more) at Microsoft.ApplicationServer.Caching.CacheEnumerator.MoveNext() at System.Linq.Enumerable.WhereSelectEnumerableIterator'2.MoveNext() at System.Linq.Enumerable.<ExceptIterator>d__99'1.MoveNext() at System.Collections.Generic.List'1..ctor(IEnumerable'1 collection) at System.Linq.Enumerable.ToList[TSource](IEnumerable'1 source)
Microsoft.ApplicationServer.Caching.DataCacheException： ErrorCode<ERRCA0017>:SubStatus<ES0006>:There is a temporary failure. Please retry later. (One or more specified cache servers are unavailable, which could be caused by busy network or servers. For on-premises cache clusters, also verify the following conditions. Ensure that security permission has been granted for this client account, and check that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Also the MaxBufferSize on the server must be greater than or equal to the serialized object size sent from the client.) at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody) at Microsoft.ApplicationServer.Caching.DataCache.GetNextBatch(String region, DataCacheTag[] tags, GetByTagsOperation op, IMonitoringListener listener, Byte[][]& state, Boolean& more) at Microsoft.ApplicationServer.Caching.CacheEnumerator.MoveNext() at System.Linq.Enumerable.WhereSelectEnumerableIterator'2.MoveNext() at System.Linq.Enumerable.<ExceptIterator>d__99'1.MoveNext() at System.Collections.Generic.List'1..ctor(IEnumerable'1 collection) at System.Linq.Enumerable.ToList[TSource](IEnumerable'1 source)
Microsoft.ApplicationServer.Caching.DataCacheException： ErrorCode<ERRCA0018>:SubStatus<ES0001>:The request timed out. at Microsoft.ApplicationServer.Caching.DataCache.ThrowException(ResponseBody respBody) at Microsoft.ApplicationServer.Caching.DataCache.GetNextBatch(String region, DataCacheTag[] tags, GetByTagsOperation op, IMonitoringListener listener, Byte[][]& state, Boolean& more) at Microsoft.ApplicationServer.Caching.CacheEnumerator.MoveNext() at System.Linq.Enumerable.WhereSelectEnumerableIterator'2.MoveNext() at System.Linq.Enumerable.<ExceptIterator>d__99'1.MoveNext() at System.Collections.Generic.List'1..ctor(IEnumerable'1 collection) at System.Linq.Enumerable.ToList[TSource](IEnumerable'1 source)

我们还在缓存服务器上创建了一个跟踪日志会话，以捕获更多信息以诊断问题 - 任何有关如何分析此问题的建议将不胜感激（如果需要，我可以提供）。

我们还监控了各种 AppFabric、CLR 和网络性能计数器，下面是事件发生时的屏幕截图：

AppFabric 性能捕获

提前感谢您在解决此问题时可以分享的任何想法或建议。

更新 1

以下是在间歇性错误期间 AppFabric 缓存服务器上连续发生的异常（从跟踪日志中提取）：

System.ServiceModel.CommunicationException: The socket connection was aborted because an asynchronous send to the socket did not complete within the allotted timeout of 00:00:00.0082078. The time allotted to this operation may have been a portion of a longer timeout. ---> System.ObjectDisposedException: The socket connection has been disposed. Object name: 'System.ServiceModel.Channels.SocketConnection'. --- End of inner exception stack trace --- at System.ServiceModel.Channels.SocketConnection.ThrowIfNotOpen() at System.ServiceModel.Channels.SocketConnection.BeginRead(Int32 offset, Int32 size, TimeSpan timeout, WaitCallback callback, Object state) at System.ServiceModel.Channels.SessionConnectionReader.BeginReceive(TimeSpan timeout, WaitCallback callback, Object state) at System.ServiceModel.Channels.SynchronizedMessageSource.ReceiveAsyncResult.PerformOperation(TimeSpan timeout) at System.ServiceModel.Channels.SynchronizedMessageSource.SynchronizedAsyncResult'1..ctor(SynchronizedMessageSource syncSource, TimeSpan timeout, AsyncCallback callback, Object state) at System.ServiceModel.Channels.FramingDuplexSessionChannel.BeginReceive(TimeSpan timeout, AsyncCallback callback, Object state) at Microsoft.ApplicationServer.Caching.WcfServerChannel.CompleteProcessing(IAsyncResult result)
System.ServiceModel.CommunicationObjectAbortedException: The communication object, System.ServiceModel.Channels.ServerSessionPreambleConnectionReader+ServerFramingDuplexSessionChannel, cannot be used for communication because it has been Aborted. at System.Runtime.AsyncResult.End[TAsyncResult](IAsyncResult result) at System.ServiceModel.Channels.FramingDuplexSessionChannel.OnEndSend(IAsyncResult result) at Microsoft.ApplicationServer.Caching.ReplyContext.EndSend(IAsyncResult result)
System.ServiceModel.CommunicationObjectFaultedException: The communication object, System.ServiceModel.Channels.ServerSessionPreambleConnectionReader+ServerFramingDuplexSessionChannel, cannot be used for communication because it is in the Faulted state. at System.ServiceModel.Channels.CommunicationObject.ThrowIfDisposedOrNotOpen() at System.ServiceModel.Channels.OutputChannel.Send(Message message, TimeSpan timeout) at Microsoft.ApplicationServer.Caching.ReplyContext.Reply(Message message, TimeSpan timeout)
System.TimeoutException: Sending to via http://www.w3.org/2005/08/addressing/anonymous timed out after 00:00:15. The time allotted to this operation may have been a portion of a longer timeout. ---> System.TimeoutException: Cannot claim lock within the allotted timeout of 00:00:15. The time allotted to this operation may have been a portion of a longer timeout. --- End of inner exception stack trace --- at System.ServiceModel.Channels.FramingDuplexSessionChannel.OnSend(Message message, TimeSpan timeout) at System.ServiceModel.Channels.OutputChannel.Send(Message message, TimeSpan timeout) at Microsoft.ApplicationServer.Caching.ReplyContext.Reply(Message message, TimeSpan timeout)

更新 2

经过一天的故障排除后，我们采取了以下措施，取得了一些改进：

在此基础上，我们增加到maxConnectionsToServer. 3结果，AppFabric Caching:Cache perf counter 记录的客户端请求/秒增加了 50%，但间歇性错误并没有停止发生
我们在缓存服务器配置上增加了maxBufferSize和maxBufferPoolSize到2147483647(int32.max) 。到目前为止，我们能够处理没有错误的 300 倍流量。我们将继续增加流量和监控。更多更新要关注

更新 3

我们向集群添加了另外两台各 16GB 的主机，并启用了 HighAvailability 模式（通过Secondaries=1）。目前，原始主机保留在集群中，内存为 96GB - 所有主机都有cacheSize = 12GB。在缓存客户端上，我们将其MaxConnectionToServer增加到12（每个核心 1 个）。以下是我们的发现：

偶尔我们会得到（每 10 分钟一次或两次）：
- ErrorCode<ERRCA0017>:SubStatus<ES0005>:There is a temporary failure. Please retry later. (There was a contention on the store.)
- ErrorCode<ERRCA0017>:SubStatus<ES0004>:There is a temporary failure. Please retry later. (Replication queue was full. This may happen during reconfiguration of cluster hosts.)
如上所述，原来的 96GB 缓存主机仍然会出现 1 分钟的中断。新的缓存主机没有经历过中断

我们计划从原始缓存主机中移除 80GB 内存。更多更新。

更新 4

通过将缓存主机中的 RAM 量减少到 16GB，似乎已经解决了这个问题。我们不再看到流量增加到 400 倍的间歇性错误。好像封号了。现在转到下一个问题：高可用性

score 3 · Accepted Answer

您是否安装了http://support.microsoft.com/kb/983182和http://support.microsoft.com/kb/2527387？
在您的代码中，您是否检查异常和 retrylater bool？
```
                catch (DataCacheException ex2)
            {
                if (ex2.ErrorCode == DataCacheErrorCode.RetryLater)
                {
```
使用命名区域会强制服务器将该命名区域的值推送到单个服务器，而不是在所有缓存服务器中分散散列。（“为了提供这种添加的搜索功能，区域中的对象仅限于单个缓存主机。” http://msdn.microsoft.com/en-us/library/ee790985(v=azure.10).aspx）

我建议您将命名区域分片到另外 2 台服务器上，并将它们放在一个集群中。通过这种方式，您可以在运行 GC 并尝试找到更多内存来放置和存储对象和标签时将异常限制在较小的服务器上。

score 3 · Accepted Answer

重新发布Jeff-ITGuy在social.msdn.microsoft.com上给出的答案：

您遇到的问题似乎与我目前正在与 Microsoft 合作的问题几乎相同。如果是同样的问题，很可能是GC耗时过长导致AppFabric响应时间延迟造成的。从您的性能计数器来看，当您开始遇到问题时，GC 时间似乎增加了，所以它可能是同一个问题。

Microsoft 正在积极调查此问题。与此同时，为了缓解这个问题（至少从我们的发现来看），您可以使用更少的内存运行更多的服务器（缩小 GC 正在处理的内存空间的大小），并且您可以增加客户端上的 RequestTimeout。默认设置为 15000（15 秒），但我们尝试将其提高到 30000，这有助于消除一些问题。在我看来，这不是一个好的长期解决方案，只是传递信息。我已经看到只有 24gb 内存（12gb 用于缓存）的服务器的问题，当我们尝试将 4gb 设置为缓存的 8gb 服务器时，它才真正变得明显更好——这导致 GC 做得更好。

希望这会有所帮助，但如果这是我认为的问题，还没有解决方案。

它确实有帮助，在我们将缓存主机 RAM 减少到 16GB 后，间歇性错误停止了。

score 2 · Accepted Answer

2

此问题的修复程序当前可在此处获得：http: //support.microsoft.com/kb/2787717

于 2013-09-17T09:19:42.853 回答

appfabric - AppFabric 扩展问题故障排除（间歇性错误代码）:子状态错误）

更新 1

更新 2

更新 3

更新 4

3 回答 3

Related

Reference