某日,在浙大國家實(shí)驗(yàn)室,與老方和小崔調(diào)試監(jiān)控死鎖問題。機(jī)柜里一溜架裝服務(wù)器上出現(xiàn)死鎖問題。用WinDbg看,發(fā)現(xiàn)其中導(dǎo)致死鎖的臨界區(qū)LockCount值是小于-1的數(shù)!!
多次重現(xiàn)該問題,發(fā)現(xiàn)LockCount經(jīng)常是負(fù)的兩三百。
我等本著不十分科學(xué)嚴(yán)謹(jǐn),但又有一點(diǎn)科學(xué)嚴(yán)謹(jǐn)?shù)膽B(tài)度,裝模作樣查了下資料,顯示如下:
LockCount代表什么含義
ms-help://MS.MSDNQTR.v80.en/MS.MSDN.v80/dnmag03/html/CriticalSections1203default.htm 或 http://msdn.microsoft.com/zh-cn/magazine/cc164040(en-us).aspx struct RTL_CRITICAL_SECTION { PRTL_CRITICAL_SECTION_DEBUG DebugInfo; LONG LockCount; LONG RecursionCount; HANDLE OwningThread; HANDLE LockSemaphore; ULONG_PTR SpinCount; }; LockCount 這是臨界區(qū)里最重要的字段。其初始值為-1,而0或更大的值表示臨界區(qū)被持有。當(dāng)該值不等于-1,OwningThread字段(該字段在WinNT.h里定義錯(cuò)誤的,應(yīng)該用DWORD而不是HANDLE類型)存放了持有該臨界區(qū)的線程ID。 LockCount - (RecursionCount - 1 ) 表示還有多少其他線程在等待獲取該臨界區(qū)。 (以下是英文原版) LockCount This is the most important field in a critical section. It is initialized to a value of -1; a value of 0 or greater indicates that the critical section is held or owned. When it's not equal to -1, the OwningThread field (this field is incorrectly defined in WINNT.H—it should be a DWORD instead of a HANDLE) contains the thread ID that owns this critical section. The delta between this field and the value of (RecursionCount -1) indicates how many additional threads are waiting to acquire the critical section. |
LockCount的值是如何變化的。
網(wǎng)上有很多文章根據(jù)臨界區(qū)的原理,總結(jié)了兩個(gè)能使LockCount變換的函數(shù)的偽代碼如下:
_RtlTryEnterCriticalSection if(CriticalSection->LockCount == -1) { // 臨界區(qū)可用 CriticalSection->LockCount = 0; CriticalSection->OwningThread = TEB->ClientID; CriticalSection->RecursionCount = 1; return TRUE; } else { if(CriticalSection->OwningThread == TEB->ClientID) { // 臨界區(qū)是當(dāng)前線程獲取 CriticalSection->LockCount++; CriticalSection->RecursionCount++; return TRUE; } else { // 臨界區(qū)已被其它線程獲取 return FALSE; } } |
_RtlLeaveCriticalSection if(--CriticalSection->RecursionCount == 0) { // 臨界區(qū)已不再被使用 CriticalSection->OwningThread = 0; if(--CriticalSection->LockCount) { // 仍有線程鎖定在臨界區(qū)上 _RtlpUnWaitCriticalSection(CriticalSection) } } else { --CriticalSection->LockCount } |
上述文字中的含義可以比較清晰地推斷出:
1. RecursionCount有可能由于LeaveCriticalSection的多余調(diào)用而小于初值0 (已經(jīng)實(shí)證)
2. LockCount的值只可能大于或等于初值-1
理論似乎再一次與事實(shí)不符!
我們開始胡思亂想,猜測(cè)如下幾種可能:
1. EnterCriticalSection執(zhí)行到一半異常中止
這種機(jī)會(huì)很小,即使發(fā)生,也找不出什么道理讓LockCount變成負(fù)兩三百這么離譜。
2. 內(nèi)存錯(cuò)亂導(dǎo)致RTL_CRITICAL_SECTION結(jié)構(gòu)被寫壞。
但幾種推測(cè)都查證無果。
一個(gè)偶然的機(jī)會(huì) -_-!!! ,我在自己的計(jì)算機(jī)上實(shí)驗(yàn)的時(shí)候,居然也發(fā)現(xiàn)了LockCount小于-1!而且屢試不爽!
我的計(jì)算機(jī)裝的Windows Vista,我們自然就有如下猜想:
在某個(gè)操作系統(tǒng)版本下,LockCount的機(jī)制本來就有所不同!!
這個(gè)猜想比較靠譜,立刻著手驗(yàn)證。實(shí)驗(yàn)室里發(fā)生這個(gè)問題的電腦都是Windows2003+SP1。我們馬上在Windows2003+SP1系統(tǒng)做了測(cè)試,寫了個(gè)非常簡(jiǎn)單的測(cè)試,創(chuàng)建一個(gè)臨界區(qū),然后調(diào)用EnterCriticalSection,果然發(fā)現(xiàn)LockCount編程了-2!而多線程下測(cè)試,也確實(shí)會(huì)出現(xiàn)負(fù)兩三百的情況。
看來LockCount的含義在不同版本的Win下確實(shí)不一樣。
其后我們多次嘗試上網(wǎng)搜索關(guān)于LockCount含義在Windows不同版本中的變遷,卻不得要領(lǐng)。
又一個(gè)偶然的機(jī)會(huì) -_-!!! ,老方在WinDbg的幫助文檔里發(fā)現(xiàn)了一段關(guān)于LockCount變遷的說明,全文如下(真是踏破鐵鞋無覓處,得來全不費(fèi)工夫)
Interpreting Critical Section Fields in Windows Server 2003 SP1 and Later In Microsoft Windows Server 2003 Service Pack 1 and later versions of Windows, the LockCount field is parsed as follows: The lowest bit shows the lock status. If this bit is 0, the critical section is locked; if it is 1, the critical section is not locked. The next bit shows whether a thread has been woken for this lock. If this bit is 0, then a thread has been woken for this lock; if it is 1, no thread has been woken. The remaining bits are the ones-complement of the number of threads waiting for the lock. As an example, suppose the LockCount is -22. The lowest bit can be determined in this way: 0:009> ? 0x1 & (-0n22) uate expression: 0 = 00000000 The next-lowest bit can be determined in this way: 0:009> ? (0x2 & (-0n22)) >> 1 uate expression: 1 = 00000001 The ones-complement of the remaining bits can be determined in this way: 0:009> ? ((-1) - (-0n22)) >> 2 uate expression: 5 = 00000005 In this example, the first bit is 0 and therefore the critical section is locked. The second bit is 1, and so no thread has been woken for this lock. The complement of the remaining bits is 5, and so there are five threads waiting for this lock. |
事情至此總算水落石出!