某日,在浙大國家實驗室,與老方和小崔調試監控死鎖問題。機柜里一溜架裝服務器上出現死鎖問題。用WinDbg看,發現其中導致死鎖的臨界區LockCount值是小于-1的數!!
多次重現該問題,發現LockCount經常是負的兩三百。
我等本著不十分科學嚴謹,但又有一點科學嚴謹的態度,裝模作樣查了下資料,顯示如下:
LockCount代表什么含義
ms-help://MS.MSDNQTR.v80.en/MS.MSDN.v80/dnmag03/html/CriticalSections1203default.htm 或 http://msdn.microsoft.com/zh-cn/magazine/cc164040(en-us).aspx struct RTL_CRITICAL_SECTION { PRTL_CRITICAL_SECTION_DEBUG DebugInfo; LONG LockCount; LONG RecursionCount; HANDLE OwningThread; HANDLE LockSemaphore; ULONG_PTR SpinCount; }; LockCount 這是臨界區里最重要的字段。其初始值為-1,而0或更大的值表示臨界區被持有。當該值不等于-1,OwningThread字段(該字段在WinNT.h里定義錯誤的,應該用DWORD而不是HANDLE類型)存放了持有該臨界區的線程ID。 LockCount - (RecursionCount - 1 ) 表示還有多少其他線程在等待獲取該臨界區。 (以下是英文原版) LockCount This is the most important field in a critical section. It is initialized to a value of -1; a value of 0 or greater indicates that the critical section is held or owned. When it's not equal to -1, the OwningThread field (this field is incorrectly defined in WINNT.H—it should be a DWORD instead of a HANDLE) contains the thread ID that owns this critical section. The delta between this field and the value of (RecursionCount -1) indicates how many additional threads are waiting to acquire the critical section. |
LockCount的值是如何變化的。
網上有很多文章根據臨界區的原理,總結了兩個能使LockCount變換的函數的偽代碼如下:
_RtlTryEnterCriticalSection if(CriticalSection->LockCount == -1) { // 臨界區可用 CriticalSection->LockCount = 0; CriticalSection->OwningThread = TEB->ClientID; CriticalSection->RecursionCount = 1; return TRUE; } else { if(CriticalSection->OwningThread == TEB->ClientID) { // 臨界區是當前線程獲取 CriticalSection->LockCount++; CriticalSection->RecursionCount++; return TRUE; } else { // 臨界區已被其它線程獲取 return FALSE; } } |
_RtlLeaveCriticalSection if(--CriticalSection->RecursionCount == 0) { // 臨界區已不再被使用 CriticalSection->OwningThread = 0; if(--CriticalSection->LockCount) { // 仍有線程鎖定在臨界區上 _RtlpUnWaitCriticalSection(CriticalSection) } } else { --CriticalSection->LockCount } |
上述文字中的含義可以比較清晰地推斷出:
1. RecursionCount有可能由于LeaveCriticalSection的多余調用而小于初值0 (已經實證)
2. LockCount的值只可能大于或等于初值-1
理論似乎再一次與事實不符!
我們開始胡思亂想,猜測如下幾種可能:
1. EnterCriticalSection執行到一半異常中止
這種機會很小,即使發生,也找不出什么道理讓LockCount變成負兩三百這么離譜。
2. 內存錯亂導致RTL_CRITICAL_SECTION結構被寫壞。
但幾種推測都查證無果。
一個偶然的機會 -_-!!! ,我在自己的計算機上實驗的時候,居然也發現了LockCount小于-1!而且屢試不爽!
我的計算機裝的Windows Vista,我們自然就有如下猜想:
在某個操作系統版本下,LockCount的機制本來就有所不同!!
這個猜想比較靠譜,立刻著手驗證。實驗室里發生這個問題的電腦都是Windows2003+SP1。我們馬上在Windows2003+SP1系統做了測試,寫了個非常簡單的測試,創建一個臨界區,然后調用EnterCriticalSection,果然發現LockCount編程了-2!而多線程下測試,也確實會出現負兩三百的情況。
看來LockCount的含義在不同版本的Win下確實不一樣。
其后我們多次嘗試上網搜索關于LockCount含義在Windows不同版本中的變遷,卻不得要領。
又一個偶然的機會 -_-!!! ,老方在WinDbg的幫助文檔里發現了一段關于LockCount變遷的說明,全文如下(真是踏破鐵鞋無覓處,得來全不費工夫)
Interpreting Critical Section Fields in Windows Server 2003 SP1 and Later In Microsoft Windows Server 2003 Service Pack 1 and later versions of Windows, the LockCount field is parsed as follows: The lowest bit shows the lock status. If this bit is 0, the critical section is locked; if it is 1, the critical section is not locked. The next bit shows whether a thread has been woken for this lock. If this bit is 0, then a thread has been woken for this lock; if it is 1, no thread has been woken. The remaining bits are the ones-complement of the number of threads waiting for the lock. As an example, suppose the LockCount is -22. The lowest bit can be determined in this way: 0:009> ? 0x1 & (-0n22) uate expression: 0 = 00000000 The next-lowest bit can be determined in this way: 0:009> ? (0x2 & (-0n22)) >> 1 uate expression: 1 = 00000001 The ones-complement of the remaining bits can be determined in this way: 0:009> ? ((-1) - (-0n22)) >> 2 uate expression: 5 = 00000005 In this example, the first bit is 0 and therefore the critical section is locked. The second bit is 1, and so no thread has been woken for this lock. The complement of the remaining bits is 5, and so there are five threads waiting for this lock. |
事情至此總算水落石出!