中文一区二区在线观看,久久国产黑丝,欧美日韩国产色综合一二三四

字符串Hash函數(shù)評(píng)估

Posted on 2011-11-04 14:21 Shuffy 閱讀(669) 評(píng)論(0) 編輯收藏引用

Hash查找因?yàn)槠銸(1)的查找性能而著稱，被對(duì)查找性能要求高的應(yīng)用所廣泛采用。它的基本思想是：
(1) 創(chuàng)建一個(gè)定長(zhǎng)的線性Hash表，一般可以初始化時(shí)指定length;

(2) 設(shè)計(jì)Hash函數(shù)，將關(guān)鍵字key散射到Hash表中。其中hash函數(shù)設(shè)計(jì)是最為關(guān)鍵的，均勻分布、沖突概率小全在它；

(3) 通常采用拉鏈方法來(lái)解決hash沖突問(wèn)題，即散射到同一個(gè)hash表項(xiàng)的關(guān)鍵字，以鏈表形式來(lái)表示(也稱為桶backet);

(4) 給定關(guān)鍵字key，就可以在O(1) + O(m)的時(shí)間復(fù)雜度內(nèi)定位到目標(biāo)。其中，m為拉鏈長(zhǎng)度，即桶深。

Hash應(yīng)用中，字符串是最為常見(jiàn)的關(guān)鍵字，應(yīng)用非常普通，現(xiàn)在的程序設(shè)計(jì)語(yǔ)言中基本上都提供了字符串hash表的支持。字符串hash函數(shù)非常多，常見(jiàn)的主要有Simple_hash, RS_hash, JS_hash, PJW_hash, ELF_hash, BKDR_hash, SDBM_hash, DJB_hash, AP_hash, CRC_hash等。它們的C語(yǔ)言實(shí)現(xiàn)見(jiàn)后面附錄代碼: hash.h, hash.c。那么這么些字符串hash函數(shù)，誰(shuí)好熟非呢？評(píng)估hash函數(shù)優(yōu)劣的基準(zhǔn)主要有以下兩個(gè)指標(biāo)：

(1) 散列分布性

即桶的使用率backet_usage = (已使用桶數(shù)) / (總的桶數(shù))，這個(gè)比例越高，說(shuō)明分布性良好，是好的hash設(shè)計(jì)。

(2) 平均桶長(zhǎng)

即avg_backet_len，所有已使用桶的平均長(zhǎng)度。理想狀態(tài)下這個(gè)值應(yīng)該=1，越小說(shuō)明沖突發(fā)生地越少，是好的hash設(shè)計(jì)。

hash函數(shù)計(jì)算一般都非常簡(jiǎn)潔，因此在耗費(fèi)計(jì)算時(shí)間復(fù)雜性方面判別甚微，這里不作對(duì)比。

評(píng)估方案設(shè)計(jì)是這樣的：

(1) 以200M的視頻文件作為輸入源，以4KB的塊為大小計(jì)算MD5值，并以此作為hash關(guān)鍵字;

(2) 分別應(yīng)用上面提到的各種字符串hash函數(shù)，進(jìn)行hash散列模擬；

(3) 統(tǒng)計(jì)結(jié)果，用散列分布性和平均桶長(zhǎng)兩個(gè)指標(biāo)進(jìn)行評(píng)估分析。

測(cè)試程序見(jiàn)附錄代碼hashtest.c，測(cè)試結(jié)果如下表所示。從這個(gè)結(jié)果我們也可以看出，這些字符串hash函數(shù)真是不相仲伯，難以決出高低，所以實(shí)際應(yīng)用中可以根據(jù)喜好選擇。當(dāng)然，最好實(shí)際測(cè)試一下，畢竟應(yīng)用特點(diǎn)不大相同。其他幾組測(cè)試結(jié)果也類似，這里不再給出。

Hash函數(shù)	桶數(shù)	Hash調(diào)用總數(shù)	最大桶長(zhǎng)	平均桶長(zhǎng)	桶使用率%
simple_hash	10240	47198	16	4.63	99.00%
RS_hash	10240	47198	16	4.63	98.91%
JS_hash	10240	47198	15	4.64	98.87%
PJW_hash	10240	47198	16	4.63	99.00%
ELF_hash	10240	47198	16	4.63	99.00%
BKDR_hash	10240	47198	16	4.63	99.00%
SDBM_hash	10240	47198	16	4.63	98.90%
DJB_hash	10240	47198	15	4.64	98.85%
AP_hash	10240	47198	16	4.63	98.96%
CRC_hash	10240	47198	16	4.64	98.77%

附錄源代碼：

hash.h

view plain copy to clipboard print ?

#ifndef _HASH_H
#define _HASH_H
#ifdef __cplusplus
extern "C" {
#endif
/* A Simple Hash Function */
unsigned int simple_hash(char *str);
/* RS Hash Function */
unsigned int RS_hash(char *str);
/* JS Hash Function */
unsigned int JS_hash(char *str);
/* P. J. Weinberger Hash Function */
unsigned int PJW_hash(char *str);
/* ELF Hash Function */
unsigned int ELF_hash(char *str);
/* BKDR Hash Function */
unsigned int BKDR_hash(char *str);
/* SDBM Hash Function */
unsigned int SDBM_hash(char *str);
/* DJB Hash Function */
unsigned int DJB_hash(char *str);
/* AP Hash Function */
unsigned int AP_hash(char *str);
/* CRC Hash Function */
unsigned int CRC_hash(char *str);
#ifdef __cplusplus
}
#endif
#endif

hash.c

view plain copy to clipboard print ?

#include <string.h>
#include "hash.h"
/* A Simple Hash Function */
unsigned int simple_hash(char *str)
{
register unsigned int hash;
register unsigned char *p;
for(hash = 0, p = (unsigned char *)str; *p ; p++)
hash = 31 * hash + *p;
return (hash & 0x7FFFFFFF);
}
/* RS Hash Function */
unsigned int RS_hash(char *str)
{
unsigned int b = 378551;
unsigned int a = 63689;
unsigned int hash = 0;
while (*str)
{
hash = hash * a + (*str++);
a *= b;
}
return (hash & 0x7FFFFFFF);
}
/* JS Hash Function */
unsigned int JS_hash(char *str)
{
unsigned int hash = 1315423911;
while (*str)
{
hash ^= ((hash << 5) + (*str++) + (hash >> 2));
}
return (hash & 0x7FFFFFFF);
}
/* P. J. Weinberger Hash Function */
unsigned int PJW_hash(char *str)
{
unsigned int BitsInUnignedInt = (unsigned int)(sizeof(unsigned int) * 8);
unsigned int ThreeQuarters = (unsigned int)((BitsInUnignedInt * 3) / 4);
unsigned int OneEighth = (unsigned int)(BitsInUnignedInt / 8);
unsigned int HighBits = (unsigned int)(0xFFFFFFFF) << (BitsInUnignedInt - OneEighth);
unsigned int hash = 0;
unsigned int test = 0;
while (*str)
{
hash = (hash << OneEighth) + (*str++);
if ((test = hash & HighBits) != 0)
{
hash = ((hash ^ (test >> ThreeQuarters)) & (~HighBits));
}
}
return (hash & 0x7FFFFFFF);
}
/* ELF Hash Function */
unsigned int ELF_hash(char *str)
{
unsigned int hash = 0;
unsigned int x = 0;
while (*str)
{
hash = (hash << 4) + (*str++);
if ((x = hash & 0xF0000000L) != 0)
{
hash ^= (x >> 24);
hash &= ~x;
}
}
return (hash & 0x7FFFFFFF);
}
/* BKDR Hash Function */
unsigned int BKDR_hash(char *str)
{
unsigned int seed = 131; // 31 131 1313 13131 131313 etc..
unsigned int hash = 0;
while (*str)
{
hash = hash * seed + (*str++);
}
return (hash & 0x7FFFFFFF);
}
/* SDBM Hash Function */
unsigned int SDBM_hash(char *str)
{
unsigned int hash = 0;
while (*str)
{
hash = (*str++) + (hash << 6) + (hash << 16) - hash;
}
return (hash & 0x7FFFFFFF);
}
/* DJB Hash Function */
unsigned int DJB_hash(char *str)
{
unsigned int hash = 5381;
while (*str)
{
hash += (hash << 5) + (*str++);
}
return (hash & 0x7FFFFFFF);
}
/* AP Hash Function */
unsigned int AP_hash(char *str)
{
unsigned int hash = 0;
int i;
for (i=0; *str; i++)
{
if ((i & 1) == 0)
{
hash ^= ((hash << 7) ^ (*str++) ^ (hash >> 3));
}
else
{
hash ^= (~((hash << 11) ^ (*str++) ^ (hash >> 5)));
}
}
return (hash & 0x7FFFFFFF);
}
/* CRC Hash Function */
unsigned int CRC_hash(char *str)
{
unsigned int nleft = strlen(str);
unsigned long long sum = 0;
unsigned short int *w = (unsigned short int *)str;
unsigned short int answer = 0;
/*
* Our algorithm is simple, using a 32 bit accumulator (sum), we add
* sequential 16 bit words to it, and at the end, fold back all the
* carry bits from the top 16 bits into the lower 16 bits.
*/
while ( nleft > 1 ) {
sum += *w++;
nleft -= 2;
}
/*
* mop up an odd byte, if necessary
*/
if ( 1 == nleft ) {
*( unsigned char * )( &answer ) = *( unsigned char * )w ;
sum += answer;
}
/*
* add back carry outs from top 16 bits to low 16 bits
* add hi 16 to low 16
*/
sum = ( sum >> 16 ) + ( sum & 0xFFFF );
/* add carry */
sum += ( sum >> 16 );
/* truncate to 16 bits */
answer = ~sum;
return (answer & 0xFFFFFFFF);
}

hashtest.c

view plain copy to clipboard print ?

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include "hash.h"
#include "md5.h"
struct hash_key {
unsigned char *key;
struct hash_key *next;
};
struct hash_counter_entry {
unsigned int hit_count;
unsigned int entry_count;
struct hash_key *keys;
};
#define BLOCK_LEN 4096
static int backet_len = 10240;
static int hash_call_count = 0;
static struct hash_counter_entry *hlist = NULL;
unsigned int (*hash_func)(char *str);
void choose_hash_func(char *hash_func_name)
{
if (0 == strcmp(hash_func_name, "simple_hash"))
hash_func = simple_hash;
else if (0 == strcmp(hash_func_name, "RS_hash"))
hash_func = RS_hash;
else if (0 == strcmp(hash_func_name, "JS_hash"))
hash_func = JS_hash;
else if (0 == strcmp(hash_func_name, "PJW_hash"))
hash_func = PJW_hash;
else if (0 == strcmp(hash_func_name, "ELF_hash"))
hash_func = ELF_hash;
else if (0 == strcmp(hash_func_name, "BKDR_hash"))
hash_func = BKDR_hash;
else if (0 == strcmp(hash_func_name, "SDBM_hash"))
hash_func = SDBM_hash;
else if (0 == strcmp(hash_func_name, "DJB_hash"))
hash_func = DJB_hash;
else if (0 == strcmp(hash_func_name, "AP_hash"))
hash_func = AP_hash;
else if (0 == strcmp(hash_func_name, "CRC_hash"))
hash_func = CRC_hash;
else
hash_func = NULL;
}
void insert_hash_entry(unsigned char *key, struct hash_counter_entry *hlist)
{
unsigned int hash_value = hash_func(key) % backet_len;
struct hash_key *p;
p = hlist[hash_value].keys;
while(p) {
if (0 == strcmp(key, p->key))
break;
p = p->next;
}
if (p == NULL)
{
p = (struct hash_key *)malloc(sizeof(struct hash_key));
if (p == NULL)
{
perror("malloc in insert_hash_entry");
return;
}
p->key = strdup(key);
p->next = hlist[hash_value].keys;
hlist[hash_value].keys = p;
hlist[hash_value].entry_count++;
}
hlist[hash_value].hit_count++;
}
void hashtest_init()
{
int i;
hash_call_count = 0;
hlist = (struct hash_counter_entry *) malloc (sizeof(struct hash_counter_entry) * backet_len);
if (NULL == hlist)
{
perror("malloc in hashtest_init");
return;
}
for (i = 0; i < backet_len; i++)
{
hlist[i].hit_count = 0;
hlist[i].entry_count = 0;
hlist[i].keys = NULL;
}
}
void hashtest_clean()
{
int i;
struct hash_key *pentry, *p;
if (NULL == hlist)
return;
for (i = 0; i < backet_len; ++i)
{
pentry = hlist[i].keys;
while(pentry)
{
p = pentry->next;
if (pentry->key) free(pentry->key);
free(pentry);
pentry = p;
}
}
free(hlist);
}
void show_hashtest_result()
{
int i, backet = 0, max_link = 0, sum = 0;
int conflict_count = 0, hit_count = 0;
float avg_link, backet_usage;
for(i = 0; i < backet_len; i++)
{
if (hlist[i].hit_count > 0)
{
backet++;
sum += hlist[i].entry_count;
if (hlist[i].entry_count > max_link)
{
max_link = hlist[i].entry_count;
}
if (hlist[i].entry_count > 1)
{
conflict_count++;
}
hit_count += hlist[i].hit_count;
}
}
backet_usage = backet/1.0/backet_len * 100;;
avg_link = sum/1.0/backet;
printf("backet_len = %d/n", backet_len);
printf("hash_call_count = %d/n", hash_call_count);
printf("hit_count = %d/n", hit_count);
printf("conflict count = %d/n", conflict_count);
printf("longest hash entry = %d/n", max_link);
printf("average hash entry length = %.2f/n", avg_link);
printf("backet usage = %.2f%/n", backet_usage);
}
void usage()
{
printf("Usage: hashtest filename hash_func_name [backet_len]/n");
printf("hash_func_name:/n");
printf("/tsimple_hash/n");
printf("/tRS_hash/n");
printf("/tJS_hash/n");
printf("/tPJW_hash/n");
printf("/tELF_hash/n");
printf("/tBKDR_hash/n");
printf("/tSDBM_hash/n");
printf("/tDJB_hash/n");
printf("/tAP_hash/n");
printf("/tCRC_hash/n");
}
void md5_to_32(unsigned char *md5_16, unsigned char *md5_32)
{
int i;
for (i = 0; i < 16; ++i)
{
sprintf(md5_32 + i * 2, "%02x", md5_16[i]);
}
}
int main(int argc, char *argv[])
{
int fd = -1, rwsize = 0;
unsigned char md5_checksum[16 + 1] = {0};
unsigned char buf[BLOCK_LEN] = {0};
if (argc < 3)
{
usage();
return -1;
}
if (-1 == (fd = open(argv[1], O_RDONLY)))
{
perror("open source file");
return errno;
}
if (argc == 4)
{
backet_len = atoi(argv[3]);
}
hashtest_init();
choose_hash_func(argv[2]);
while (rwsize = read(fd, buf, BLOCK_LEN))
{
md5(buf, rwsize, md5_checksum);
insert_hash_entry(md5_checksum, hlist);
hash_call_count++;
memset(buf, 0, BLOCK_LEN);
memset(md5_checksum, 0, 16 + 1);
}
close(fd);
show_hashtest_result();
hashtest_clean();
return 0;
}

原文地址：http://blog.csdn.net/liuben/article/details/5050697

只有注冊(cè)用戶登錄后才能發(fā)表評(píng)論。
【推薦】100%開(kāi)源！大型工業(yè)跨平臺(tái)軟件C++源碼提供，建模，組態(tài)！



網(wǎng)站導(dǎo)航: 博客園 IT新聞 BlogJava 博問(wèn) Chat2DB 管理

Shuffy

導(dǎo)航

公告

留言簿(4)

隨筆分類(101)

文章分類(19)

相冊(cè)

基于MFC的OpenGL編程

網(wǎng)絡(luò)連載技術(shù)書籍

搜索

積分與排名

最新評(píng)論

閱讀排行榜

字符串Hash函數(shù)評(píng)估