學(xué)習(xí)高效編程的有效途徑之一就是閱讀高手寫的源代碼,CRT(C/C++ Runtime Library)作為底層的函數(shù)庫,實(shí)現(xiàn)必然高效。恰好手中就有g(shù)libc和VC的CRT源代碼,于是挑了一個相對簡單的函數(shù)strlen研究了一下,并對各種實(shí)現(xiàn)作了簡單的效率測試。
strlen的函數(shù)原形如下:
??????size_t strlen(const char *str);
strlen返回str中字符的個數(shù),其中str為一個以'\0'結(jié)尾的字符串(a null-terminated string)。
1. 簡單實(shí)現(xiàn)
如果不管效率,最簡單的實(shí)現(xiàn)只需要4行代碼:
1?size_t?strlen_a(const?char?*
str)?{
2?????size_t?length?=?0
;
3?????while?(*str++
)
4?????????++
length;
5?????return
?length;
6?}
也許可以稍加改進(jìn)如下:
1?size_t?strlen_b(const?char?*
str)?{
2?????const?char?*cp?=
?str;
3?????while?(*cp++
)
4?
????????;
5?????return?(cp?-?str?-?1
);
6?}
2. 高效實(shí)現(xiàn)
很顯然,標(biāo)準(zhǔn)庫的實(shí)現(xiàn)肯定不會如此簡單,上面的strlen_a以及strlen_b都是一次判斷一個字符直到發(fā)現(xiàn)'\0'為止,這是非常低效的。比較高效的實(shí)現(xiàn)如下(在這里WORD表示計算機(jī)中的一個字,不是WORD類型):
(1) 一次判斷一個字符直到內(nèi)存對齊,如果在內(nèi)存對齊之前就遇到'\0'則直接return,否則到(2);
(2) 一次讀入并判斷一個WORD,如果此WORD中沒有為0的字節(jié),則繼續(xù)下一個WORD,否則到(3);
(3) 到這里則說明WORD中至少有一個字節(jié)為0,剩下的就是找出第一個為0的字節(jié)的位置然后return。
NOTE:
數(shù)據(jù)對齊(data alignment),是指數(shù)據(jù)所在的內(nèi)存地址必須是該數(shù)據(jù)長度的整數(shù)倍,這樣CPU的存取速度最快。比如在32位的計算機(jī)中,一個WORD為4 byte,則WORD數(shù)據(jù)的起始地址能被4整除的時候CPU的存取效率比較高。CPU的優(yōu)化規(guī)則大概如下:對于n字節(jié)(n = 2,4,8...)的元素,它的首地址能被n整除才能獲得最好的性能。
為了便于下面的討論,這里假設(shè)所用的計算機(jī)為32位,即一個WORD為4個字節(jié)。下面給出在32位計算機(jī)上的C語言實(shí)現(xiàn)(假設(shè)unsigned long為4個字節(jié)):
?1?typedef?unsigned?long
?ulong;
?2?
?3?size_t?strlen_c(const?char?*
str)?{
?4?
?5?????const?char?*
char_ptr;
?6?????const?ulong?*
longword_ptr;
?7?
????register?ulong?longword,?magic_bits;
?8?
?9?????for?(char_ptr?=
?str;?((ulong)char_ptr?
10?????????&?(sizeof(ulong)?-?1))?!=?0
;
11?????????++
char_ptr)?{
12?????????if?(*char_ptr?==?'\0'
)
13?????????????return?char_ptr?-
?str;
14?
????}
15?
16?????longword_ptr?=?(ulong*
)char_ptr;
17?
18?????magic_bits?=?0x7efefeffL
;
19?
20?????while?(1
)?{
21?
22?????????longword?=?*longword_ptr++
;
23?
24?????????if?((((longword?+?magic_bits)?^?~longword)?&?~magic_bits)?!=?0
)?{
25?
26?????????????const?char?*cp?=?(const?char*)(longword_ptr?-?1
);
27?
????????????
28?????????????if?(cp[0]?==?0
)
29?????????????????return?cp?-
?str;
30?????????????if?(cp[1]?==?0
)
31?????????????????return?cp?-?str?+?1
;
32?????????????if?(cp[2]?==?0
)
33?????????????????return?cp?-?str?+?2
;
34?????????????if?(cp[3]?==?0
)
35?????????????????return?cp?-?str?+?3
;
36?
????????}
37?
????}
38?}
3. 源碼剖析
上面給出的C語言實(shí)現(xiàn)雖然不算特別復(fù)雜,但也值得花點(diǎn)時間來弄清楚,先看9-14行:
for?(char_ptr?=?str;?((ulong)char_ptr?&?(sizeof(ulong)?-?1))?!=?0;?++char_ptr)?{
????if?(*char_ptr?==?'\0')
????????return?char_ptr?-?str;
}
上面的代碼實(shí)現(xiàn)了數(shù)據(jù)對齊,如果在對齊之前就遇到'\0'則可以直接return char_ptr - str;
第16行將longword_ptr指向數(shù)據(jù)對齊后的首地址
longword_ptr?=?(ulong*)char_ptr;
第18行給magic_bits賦值(在后面會解釋這個值的意義)
magic_bits?=?0x7efefeffL;
第22行讀入一個WORD到longword并將longword_ptr指向下一個WORD
longword?=?*longword_ptr++;
第24行的if語句是整個算法的核心,該語句判斷22行讀入的WORD中是否有為0的字節(jié)
if?((((longword?+?magic_bits)?^?~longword)?&?~magic_bits)?!=?0)
if語句中的計算可以分為如下3步:
(1) longword + magic_bits
其中magic_bits的二進(jìn)制表示如下:
??????????????????b3??????b2?????? b1?????? b0
??????????????31------------------------------->0
??magic_bits:?01111110?11111110?11111110?11111111
magic_bits中的31,24,16,8這些bits都為0,我們把這幾個bits稱為holes,注意在每個byte的左邊都有一個hole。
檢測0字節(jié):如果longword 中有一個字節(jié)的所有bit都為0,則進(jìn)行加法后,從這個字節(jié)的右邊的字節(jié)傳遞來的進(jìn)位都會落到這個字節(jié)的最低位所在的hole上,而從這個字節(jié)的最高位則永遠(yuǎn)不會產(chǎn)生向左邊字節(jié)的hole的進(jìn)位。則這個字節(jié)左邊的hole在進(jìn)行加法后不會改變,由此可以檢測出0字節(jié);相反,如果longword中所有字節(jié)都不為0,則每個字節(jié)中至少有1位為1,進(jìn)行加法后所有的hole都會被改變。
為了便于理解,請看下面的例子:
??????????????????b3??????b2?????? b1?????? b0
??????????????31------------------------------->0
??longword:???XXXXXXXX?XXXXXXXX?00000000?XXXXXXXX
+?magic_bits:?01111110?11111110?11111110?11111111
上面longword中的b1為0,X可能為0也可能為1。因?yàn)閎1的所有bit都為0,而從b0傳遞過來的進(jìn)位只可能是0或1,很顯然b1永遠(yuǎn)也不會產(chǎn)生進(jìn)位,所以加法后longword的第16 bit這個hole不會變。
(2)? ^ ~longword
這一步取出加法后longword中所有未改變的bit。
(3)? & ~magic_bits
最后取出longword中未改變的hole,如果有任何hole未改變則說明longword中有為0的字節(jié)。
根據(jù)上面的描述,如果longword中有為0的字節(jié),則if中的表達(dá)式結(jié)果為非0,否則為0。
NOTE:
如果b3為10000000,則進(jìn)行加法后第31 bit這個hole不會變,這說明我們無法檢測出b3為10000000的所有WORD。值得慶幸的是用于strlen的字符串都是ASCII標(biāo)準(zhǔn)字符,其值在0-127之間,這意味著每一個字節(jié)的第一個bit都為0。因此上面的算法是安全的。
一旦檢測出longword中有為0的字節(jié),后面的代碼只需要找到第一個為0的字節(jié)并返回相應(yīng)的長度就OK:
const?char?*cp?=?(const?char*)(longword_ptr?-?1);
if?(cp[0]?==?0)
????return?cp?-?str;
if?(cp[1]?==?0)
????return?cp?-?str?+?1;
if?(cp[2]?==?0)
????return?cp?-?str?+?2;
if?(cp[3]?==?0)
????return?cp?-?str?+?3;
4. 另一種實(shí)現(xiàn)?1?size_t?strlen_d(const?char?*str)?{
?2?
?3?????const?char?*char_ptr;
?4?????const?ulong?*longword_ptr;
?5?????register?ulong?longword,?himagic,?lomagic;
?6?
?7?????for?(char_ptr?=?str;?((ulong)char_ptr?
?8?????????&?(sizeof(ulong)?-?1))?!=?0;
?9?????????++char_ptr)?{
10?????????if?(*char_ptr?==?'\0')
11?????????????return?char_ptr?-?str;
12?????}
13?
14?????longword_ptr?=?(ulong*)char_ptr;
15?
16?????himagic?=?0x80808080L;
17?????lomagic?=?0x01010101L;
18?
19?????while?(1)?{
20?
21?????????longword?=?*longword_ptr++;
22?
23?????????if?(((longword?-?lomagic)?&?himagic)?!=?0)?{
24?
25?????????????const?char?*cp?=?(const?char*)(longword_ptr?-?1);
26?????????????
27?????????????if?(cp[0]?==?0)
28?????????????????return?cp?-?str;
29?????????????if?(cp[1]?==?0)
30?????????????????return?cp?-?str?+?1;
31?????????????if?(cp[2]?==?0)
32?????????????????return?cp?-?str?+?2;
33?????????????if?(cp[3]?==?0)
34?????????????????return?cp?-?str?+?3;
35?????????}
36?????}
37?}
上面的代碼與strlen_c基本一樣,不同的是:
magic_bits換成了himagic和lomagic
himagic?=?0x80808080L;
lomagic?=?0x01010101L;
以及?if語句變得比較簡單了
if?(((longword?-?lomagic)?&?himagic)?!=?0)
if語句中的計算可以分為如下2步:
(1) longword - lomagic
himagic和lomagic的二進(jìn)制表示如下:
????????????????b3??????b2???????b1???????b0
????????????31------------------------------->0
??himagic:??10000000?10000000?10000000?10000000
? lomagic:??00000001 00000001?00000001?00000001
在這種方法中假設(shè)所有字符都是ASCII標(biāo)準(zhǔn)字符,其值在0-127之間,因此longword總是如下形式:
????????????????b3??????b2???????b1???????b0
????????????31------------------------------->0
??longword:?0XXXXXXX 0XXXXXXX?0XXXXXXX?0XXXXXXX
檢測0字節(jié):
如果longword 中有一個字節(jié)的所有bit都為0,則進(jìn)行減法后,這個字節(jié)的最高位一定會從0變?yōu)?;相反,如果longword中所有字節(jié)都不為0,則每個字節(jié)中至少有1位為1,進(jìn)行減法后這個字節(jié)的最高位依然為0。
?(2)? & himagic
這一步取出每個字節(jié)最高位的1,如果有任意字節(jié)最高位為1則說明longword中有為0的字節(jié)。
根據(jù)上面的描述,如果longword中有為0的字節(jié),則if中的表達(dá)式結(jié)果為非0,否則為0。
5. 匯編實(shí)現(xiàn)
VC CRT的匯編實(shí)現(xiàn)與前面strlen_c算法類似
??1?????????page????,132
??2?????????title???strlen?-?return?the?length?of?a?null-terminated?string
??3?;***
??4?;strlen.asm?-?contains?strlen()?routine
??5?;
??6?;???????Copyright?(c)?Microsoft?Corporation.?All?rights?reserved.
??7?;
??8?;Purpose:
??9?;???????strlen?returns?the?length?of?a?null-terminated?string,
?10?;???????not?including?the?null?byte?itself.
?11?;
?12?;*******************************************************************************
?13?
?14?????????.xlist
?15?????????include?cruntime.inc
?16?????????.list
?17?
?18?page
?19?;***
?20?;strlen?-?return?the?length?of?a?null-terminated?string
?21?;
?22?;Purpose:
?23?;???????Finds?the?length?in?bytes?of?the?given?string,?not?including
?24?;???????the?final?null?character.
?25?;
?26?;???????Algorithm:
?27?;???????int?strlen?(const?char?*?str)
?28?;???????{
?29?;???????????int?length?=?0;
?30?;
?31?;???????????while(?*str++?)
?32?;???????????????????++length;
?33?;
?34?;???????????return(?length?);
?35?;???????}
?36?;
?37?;Entry:
?38?;???????const?char?*?str?-?string?whose?length?is?to?be?computed
?39?;
?40?;Exit:
?41?;???????EAX?=?length?of?the?string?"str",?exclusive?of?the?final?null?byte
?42?;
?43?;Uses:
?44?;???????EAX,?ECX,?EDX
?45?;
?46?;Exceptions:
?47?;
?48?;*******************************************************************************
?49?
?50?????????CODESEG
?51?
?52?????????public??strlen
?53?
?54?strlen??proc?\
?55?????????buf:ptr?byte
?56?
?57?????????OPTION?PROLOGUE:NONE,?EPILOGUE:NONE
?58?
?59?????????.FPO????(?0,?1,?0,?0,?0,?0?)
?60?
?61?string??equ?????[esp?+?4]
?62?
?63?????????mov?????ecx,string??????????????;?ecx?->?string
?64?????????test????ecx,3???????????????????;?test?if?string?is?aligned?on?32?bits
?65?????????je??????short?main_loop
?66?
?67?str_misaligned:
?68?????????;?simple?byte?loop?until?string?is?aligned
?69?????????mov?????al,byte?ptr?[ecx]
?70?????????add?????ecx,1
?71?????????test????al,al
?72?????????je??????short?byte_3
?73?????????test????ecx,3
?74?????????jne?????short?str_misaligned
?75?
?76?????????add?????eax,dword?ptr?0?????????;?5?byte?nop?to?align?label?below
?77?
?78?????????align???16??????????????????????;?should?be?redundant
?79?
?80?main_loop:
?81?????????mov?????eax,dword?ptr?[ecx]?????;?read?4?bytes
?82?????????mov?????edx,7efefeffh
?83?????????add?????edx,eax
?84?????????xor?????eax,-1
?85?????????xor?????eax,edx
?86?????????add?????ecx,4
?87?????????test????eax,81010100h
?88?????????je??????short?main_loop
?89?????????;?found?zero?byte?in?the?loop
?90?????????mov?????eax,[ecx?-?4]
?91?????????test????al,al???????????????????;?is?it?byte?0
?92?????????je??????short?byte_0
?93?????????test????ah,ah???????????????????;?is?it?byte?1
?94?????????je??????short?byte_1
?95?????????test????eax,00ff0000h???????????;?is?it?byte?2
?96?????????je??????short?byte_2
?97?????????test????eax,0ff000000h??????????;?is?it?byte?3
?98?????????je??????short?byte_3
?99?????????jmp?????short?main_loop?????????;?taken?if?bits?24-30?are?clear?and?bit
100?????????????????????????????????????????;?31?is?set
101?
102?byte_3:
103?????????lea?????eax,[ecx?-?1]
104?????????mov?????ecx,string
105?????????sub?????eax,ecx
106?????????ret
107?byte_2:
108?????????lea?????eax,[ecx?-?2]
109?????????mov?????ecx,string
110?????????sub?????eax,ecx
111?????????ret
112?byte_1:
113?????????lea?????eax,[ecx?-?3]
114?????????mov?????ecx,string
115?????????sub?????eax,ecx
116?????????ret
117?byte_0:
118?????????lea?????eax,[ecx?-?4]
119?????????mov?????ecx,string
120?????????sub?????eax,ecx
121?????????ret
122?
123?strlen??endp
124?
125?????????end
6. 測試結(jié)果為了對上述各種實(shí)現(xiàn)的效率有一個大概的認(rèn)識,我在VC8和GCC下分別進(jìn)行了測試,測試時均采用默認(rèn)優(yōu)化方式。下面是在GCC下運(yùn)行幾百萬次后的結(jié)果(在VC8下的運(yùn)行結(jié)果與此相似):
strlen_a
--------------------------------------------------
???????1:????????515?ticks?????????0.515?seconds
???????2:????????375?ticks?????????0.375?seconds
???????3:????????375?ticks?????????0.375?seconds
???????4:????????375?ticks?????????0.375?seconds
???????5:????????375?ticks?????????0.375?seconds
???total:???????2015?ticks?????????2.015?seconds
?average:????????403?ticks?????????0.403?seconds
--------------------------------------------------
strlen_b
--------------------------------------------------
???????1:????????360?ticks??????????0.36?seconds
???????2:????????390?ticks??????????0.39?seconds
???????3:????????375?ticks?????????0.375?seconds
???????4:????????360?ticks??????????0.36?seconds
???????5:????????375?ticks?????????0.375?seconds
???total:???????1860?ticks??????????1.86?seconds
?average:????????372?ticks?????????0.372?seconds
--------------------------------------------------
strlen_c
--------------------------------------------------
???????1:????????187?ticks?????????0.187?seconds
???????2:????????172?ticks?????????0.172?seconds
???????3:????????187?ticks?????????0.187?seconds
???????4:????????187?ticks?????????0.187?seconds
???????5:????????188?ticks?????????0.188?seconds
???total:????????921?ticks?????????0.921?seconds
?average:????????184?ticks????????0.1842?seconds
--------------------------------------------------
strlen_d
--------------------------------------------------
???????1:????????172?ticks?????????0.172?seconds
???????2:????????187?ticks?????????0.187?seconds
???????3:????????172?ticks?????????0.172?seconds
???????4:????????187?ticks?????????0.187?seconds
???????5:????????188?ticks?????????0.188?seconds
???total:????????906?ticks?????????0.906?seconds
?average:????????181?ticks????????0.1812?seconds
--------------------------------------------------
strlen
--------------------------------------------------
???????1:????????187?ticks?????????0.187?seconds
???????2:????????172?ticks?????????0.172?seconds
???????3:????????188?ticks?????????0.188?seconds
???????4:????????172?ticks?????????0.172?seconds
???????5:????????187?ticks?????????0.187?seconds
???total:????????906?ticks?????????0.906?seconds
?average:????????181?ticks????????0.1812?seconds
--------------------------------------------------
源代碼:
點(diǎn)擊下載