我有一个有趣的问题,我有两个进程的两个转储显示托管堆损坏。我在 Windows 7 x64 上的 x64 中使用 clr.dll 4.0.30319.1008 (RTMGDR.030319-1000)。使用 VerifyHeap 我知道我有损坏:
0:016> !VerifyHeap
object 000000000367ec60: bad member 0000000004fba740 at 000000000367ec78
curr_object: 000000000528CF90
Last good object: 000000000367ec40
该对象是一个包含两个元素的数组
0:016> !DumpObj /d 000000000367ec60
Name: System.Object[]
MethodTable: 000007feedf6adf8
EEClass: 000007feedaefc68
Size: 48(0x30) bytes
Array: Rank 1, Number of elements 2, Type CLASS (Print Array)
Element Type:System.Object
Fields:
None
0:016> !DumpArray /d 000000000367ec60
Name: System.Object[]
MethodTable: 000007feedf6adf8
EEClass: 000007feedaefc68
Size: 48(0x30) bytes
Array: Rank 1, Number of elements 2, Type CLASS
Element Methodtable: 000007feedf65a48
[0] 0000000004fba740
[1] 000000000367ec90
第一个指针是损坏的值,它确实指向值为 1 的布尔值,该值不是托管对象。这就是 GC 确实退出的原因。
0:016> db 0000000004fba740-10
00000000`04fba730 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 ................
00000000`04fba740 **01 00** 00 00 00 00 00 00-00 00 00 00 00 00 00 00 ................
00000000`04fba750 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 ................
00000000`04fba760 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 ................
00000000`04fba770 00 00 00 00 00 00 00 00-b8 1b f7 ed fe 07 00 00 ................
00000000`04fba780 d0 a7 fb 04 00 00 00 00-00 00 00 00 00 00 00 00 ................
00000000`04fba790 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 ................
00000000`04fba7a0 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 ................
0:016> !lno 04fba740
Before: 0000000004fba718 System.Collections.Hashtable+bucket[]
After: 0000000004fba778 System.Collections.Hashtable
Heap local consistency confirmed.
周围的物体并不重要,因为它们会根据转储随机变化。
!GCRoot 0000000367ec60
Scan Thread 16 OSTHread 5fd0
r10:Root: 000000000367ec60(System.Object[])
Scan Thread 17 OSTHread 10cc
RSP:1de4cd58:Root: 000000000367ec60(System.Object[])
数组本身没有根,这表明它可以被收集。有趣的是,数组中的第二个对象是来自已退出线程的 ThreadLocal 数据。看起来 CLR 确实将 ThreadLocal 对象存储在每个线程的对象数组中,当它退出时可以收集这些对象。线程号 17 执行确实引发 ExecutionEngineException 的实际收集。但是 Thread 16 似乎确实将 Thread Local 数据保存到一个应该被固定(它不是)它应该无权访问的数组中。
线程 nr 16 似乎保存了已退出线程的 TLS 数据,并且可能会写入它。
OS Thread Id: 0x5fd0 (16)
Child SP IP Call Site
000000001dffdfe8 0000000076eb135a [NDirectMethodFrameStandalone: 000000001dffdfe8] MS.Win32.UnsafeNativeMethods.MsgWaitForMultipleObjects(Int32, IntPtr[], Boolean, Int32, Int32)
000000001dffdfa0 000007fecfa7e1bd DomainBoundILStubClass.IL_STUB_PInvoke(Int32, IntPtr[], Boolean, Int32, Int32)*** WARNING: Unable to verify checksum for UIAutomationClientsideProviders.ni.dll
000000001dffe090 000007fecfa7b28d MS.Internal.AutomationProxies.Misc.MsgWaitForMultipleObjects(Microsoft.Win32.SafeHandles.SafeWaitHandle, Boolean, Int32, Int32)
000000001dffe110 000007fecfab5cdd MS.Internal.AutomationProxies.QueueProcessor.WaitForWork()
000000001dffe1b0 000007feede22f78 System.Threading.ExecutionContext.runTryCode(System.Object)*** WARNING: Unable to verify checksum for mscorlib.ni.dll
000000001dffe8d8 000007fef08044c4 [HelperMethodFrame_PROTECTOBJ: 000000001dffe8d8] System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode, CleanupCode, System.Object)
000000001dffea00 000007feede11661 System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
000000001dffea60 000007feede115ab System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
000000001dffeab0 000007feedea6d8d System.Threading.ThreadHelper.ThreadStart()
000000001dffef08 000007fef08044c4 [GCFrame: 000000001dffef08]
000000001dfff2f0 000007fef08044c4 [DebuggerU2MCatchHandlerFrame: 000000001dfff2f0]
这是 GC 收集的堆栈:
0:017> !DumpStack
OS Thread Id: 0x10cc (17)
Current frame: clr!WKS::gc_heap::mark_object_simple+0x75
Child-SP RetAddr Caller, Callee
000000001de4cce0 000007fef0877fb2 clr!WKS::gc_heap::mark_through_cards_for_segments+0x36b
000000001de4ce50 000007fef0873980 clr!WKS::gc_heap::mark_phase+0x160, calling clr!WKS::gc_heap::mark_through_cards_for_segments
000000001de4ce80 000007fef086fce7 clr!EEJitManager::CleanupCodeHeaps+0x57, calling clr!CrstBase::Leave
000000001de4cea0 000007fef07e3dc1 clr!CrstBase::Leave+0x31, calling clr!GetThread
000000001de4ced0 000007fef0873f3d clr!WKS::gc_heap::gc1+0xae, calling clr!WKS::gc_heap::mark_phase
000000001de4cef0 000007fef0874786 clr!WKS::gc_heap::update_collection_counts+0x16, calling 000000000065006e
000000001de4cf20 000007fef0a1fa56 clr!WKS::gc_heap::garbage_collect+0x42e, calling clr!WKS::gc_heap::gc1
000000001de4cf60 000007feede2d774 (MethodDesc 000007feedaa93b8 +0x124 System.TimeZoneInfo.GetDateTimeNowUtcOffsetFromUtc(System.DateTime, Boolean ByRef)), calling (MethodDesc 000007feedaa8708 +0 System.TimeSpan.Add(System.TimeSpan))
000000001de4cfa0 000007fef07fd4ff clr!SystemNative::__GetSystemTimeAsFileTime+0xf, calling kernel32!GetSystemTimeAsFileTimeStub
000000001de4cff0 000007fef087452e clr!WKS::GCHeap::GarbageCollectGeneration+0x14e, calling clr!WKS::gc_heap::garbage_collect
000000001de4d040 000007fef08734ce clr!WKS::gc_heap::try_allocate_more_space+0x25f, calling clr!WKS::GCHeap::GarbageCollectGeneration
000000001de4d080 000007fef0872f43 clr!WKS::gc_heap::allocate_small+0x158, calling clr!WKS::gc_heap::a_fit_segment_end_p
000000001de4d110 000007fef08731fe clr!FastAllocateObject+0x73e, calling clr!WKS::gc_heap::try_allocate_more_space
000000001de4d1f0 000007fef07fc8b8 clr!JIT_NewFast+0xb8, calling clr!FastAllocateObject
000000001de4d2c8 000007feede3fa80 (MethodDesc 000007feedaaa8e8 +0x40 System.Text.StringBuilder.ExpandByABlock(Int32)), calling clr!JIT_TrialAllocSFastMP_InlineGetThread
0:016> !Threads
ThreadCount: 17
UnstartedThread: 0
BackgroundThread: 13
PendingThread: 0
DeadThread: 1
Hosted Runtime: no
PreEmptive Lock
ID OSID ThreadOBJ State GC GC Alloc Context Domain Count APT Exception
0 1 58e4 0000000000498ba0 2006020 Enabled 0000000000000000:0000000000000000 0000000000481df0 0 STA
2 2 4190 000000000049ee80 b220 Enabled 0000000000000000:0000000000000000 0000000000481df0 0 MTA (Finalizer)
6 3 48d4 000000001ac8bb60 1000220 Enabled 0000000000000000:0000000000000000 0000000000481df0 0 Ukn (Threadpool Worker)
8 5 5fbc 000000001aca1970 a009220 Enabled 0000000000000000:0000000000000000 0000000000481df0 0 MTA (Threadpool Completion Port)
9 6 615c 000000001c4b2880 b020 Enabled 0000000000000000:0000000000000000 0000000000481df0 0 MTA
10 7 5818 000000001c4e7bd0 200b220 Enabled 0000000000000000:0000000000000000 0000000000481df0 0 MTA
11 8 6e14 000000001c4f0850 7020 Enabled 0000000000000000:0000000000000000 0000000000481df0 2 STA
12 a 683c 000000001c512610 7220 Enabled 0000000000000000:0000000000000000 0000000000481df0 0 STA
14 b 6f40 000000001c521120 7220 Enabled 0000000000000000:0000000000000000 0000000000481df0 0 STA
15 c 5070 000000001c564760 100a220 Enabled 0000000000000000:0000000000000000 0000000000481df0 0 MTA (Threadpool Worker)
16 d 5fd0 000000000049bc10 b220 Enabled 0000000000000000:0000000000000000 0000000000481df0 0 MTA
17 e 10cc 000000001c62e370 b220 Enabled 0000000000000000:0000000000000000 0000000000481df0 2 MTA (GC) System.ExecutionEngineException (0000000002441228)
XXXX f 000000001e102c80 15820 Enabled 0000000000000000:0000000000000000 0000000000481df0 0 Ukn
22 10 158c 000000001e103aa0 1009220 Enabled 0000000000000000:0000000000000000 0000000000481df0 0 MTA (Threadpool Worker)
23 12 47e8 000000001e1048c0 8019220 Enabled 0000000000000000:0000000000000000 0000000000481df0 0 Ukn (Threadpool Completion Port)
24 4 58a8 000000001e103390 8019220 Enabled 0000000000000000:0000000000000000 0000000000481df0 0 Ukn (Threadpool Completion Port)
25 9 2874 000000001e102570 8009220 Enabled 0000000000000000:0000000000000000 0000000000481df0 0 MTA (Threadpool Completion Port)
这一切都很好而且很有趣,但我不知道如何进一步进行。由于错误确实发生在自动化测试机器上,其中测试控制器进程每天大约死亡 1-2 次,因此我不能简单地将调试器附加到进程并设置一些断点来保护对特定内存位置的写入。非常欢迎任何其他提示如何解决这个问题。我将获得更多转储,以便至少能够进行差异分析以检查哪些测试可能导致这种情况。
在我看来,确实保存线程静态的 CLR 数组是未固定的,并且有人确实将未装箱的 bool 值写入第一个数组元素。CLR 数组不包含值,而是包含通常托管对象应位于的地址,但只有 bool 值(一),而不包含通常的 CLR 对象及其对象头。
错误的 PInvoke 签名会导致这种行为吗?我见过一些像
[DllImport( "kernel32.dll" )]
public static extern bool Beep( int frequeny_in, int time_in );
它确实返回一个字节的布尔值,但 Beep 方法确实返回一个 4 字节的布尔值。来自 PInvoke 的错误返回类型(bool 而不是 int)会导致此类问题吗?