国内精品久久久久久久涩爱,国内精品久久国产大陆,久久99精品国产麻豆蜜芽

2010年3月15日 #

windows下配置python ，django，mysql，memcahe開(kāi)發(fā)環(huán)境

這里做個(gè)記錄，[分布式跨平臺(tái)監(jiān)控系統(tǒng)]肯定離不開(kāi)的要配置windows下的環(huán)境，linux的一鍵安裝程序有apt，zyyper，yum等傻瓜工具，windows下有時(shí)候還成了問(wèn)題。

1，將windows版python2.5裝入d：/python25，將d：/python25添加入環(huán)境變量path

2，下載下django，在django目錄里運(yùn)行，python setup.py install ，會(huì)自動(dòng)查找path環(huán)境變量，將django的庫(kù)放入d：/python25/lib

3，裝一個(gè)setuptools-0.6c11.win32-py2.5.rar，會(huì)在 D:\Python25\Scripts 中出現(xiàn) easy_install.exe

4，裝mysql api和memcahe api，在D:\Python25\Scripts 目錄下運(yùn)行 easy_install.exe install mysqldb，或easy_install.exe install memcahe提示要去 http://pypi.python.org/simple/ 找具體下載安裝的包，打開(kāi)網(wǎng)址找到相應(yīng)的url然后 easy_install.exe install url即可

5，如果沒(méi)有自動(dòng)安裝程序 setuptools-0.6c11.win32-py2.5 或裝不了，可以直接復(fù)制以前 D:\Python25\Lib\site-packages下的 MySQLdb 文件夾到現(xiàn)在的D:\Python25\Lib\site-packages的目錄下，只要版本對(duì)得上就可以正常運(yùn)行，完全綠色的。

6，如果url安裝不了，以前也沒(méi)有用過(guò)，可以下載安裝包，例如memcache的api安裝可以去這里ftp://ftp.tummy.com/pub/python-memcached/old-releases/python-memcached-1.45.tar.gz 下載

然后解壓進(jìn)入目錄執(zhí)行python setup.py install

7，建立新的django項(xiàng)目或在以往的工程目錄下運(yùn)行 python manage.py syncdb （這里只會(huì)檢測(cè)庫(kù)中的表，沒(méi)有表明就建立，如果有表明，結(jié)構(gòu)被改變了是不會(huì)做任何修改的），同步數(shù)據(jù)庫(kù)表結(jié)構(gòu)，事先要在mysql里建立setting.py里設(shè)置的數(shù)據(jù)庫(kù)。

posted @ 2010-03-15 19:25 學(xué)者站在巨人的肩膀上閱讀(655) | 評(píng)論 (0) | 編輯收藏

[分布式跨平臺(tái)監(jiān)控系統(tǒng)]linux，windows下一句話發(fā)郵件-python腳本應(yīng)用

前一陣花了點(diǎn)時(shí)間學(xué)習(xí)python，近段時(shí)間完成了一個(gè)監(jiān)控服務(wù)器基本信息的項(xiàng)目，都是為了滿足大家監(jiān)控的欲望，特殊日志并報(bào)警的分布式系統(tǒng)，單臺(tái)服務(wù)器采集粒度為1次/1分鐘，一天大約1440條，目前監(jiān)控了20多臺(tái)服務(wù)器，一天大約31680條日志，現(xiàn)在單點(diǎn)監(jiān)控中心服務(wù)器在性能上還綽綽有余，有更多的服務(wù)器來(lái)測(cè)試就好了，估計(jì)可以支持到100臺(tái)以上服務(wù)器監(jiān)控的級(jí)別。

現(xiàn)在遇到一個(gè)需求是發(fā)現(xiàn)報(bào)警時(shí)實(shí)時(shí)發(fā)送消息給相關(guān)人員，由于公司短信網(wǎng)關(guān)只買(mǎi)了上海電信用戶沒(méi)有上海電信的號(hào)碼，汗一個(gè)，只好通過(guò)發(fā)郵件來(lái)實(shí)施。

支持發(fā)送GB18030編碼的文本內(nèi)容，任意編碼附件，可以做出適當(dāng)修改支持群發(fā)。

··

class=dp-highlighter> 83;······10········20········30········40········50········60········70········80········90········100·······110·······120·······130·······140·······150</div> class=alt>#coding=utf-8   </li> class="">#!/usr/lib/python2.5/bin/python   </li> class=alt>import os   </li> class="">import sys   </li> class=alt>from smtplib import SMTP   </li> class="">from email.MIMEMultipart import MIMEMultipart   </li> class=alt>from email.mime.application import MIMEApplication   </li> class="">from email.MIMEText import MIMEText   </li> class=alt>from email.MIMEBase import MIMEBase   </li> class="">from email import Utils,Encoders   </li> class=alt>import mimetypes   </li> class="">import time   </li> class=alt>  </li> class="">STMP_SERVER = "mail.×××.com"  </li> class=alt>STMP_PORT = "25"  </li> class="">USERNAME = "×××@×××.com"  </li> class=alt>USERPASSWORD = "×××"  </li> class="">FROM = "MonitorCenterWarning@×××.com"  </li> class=alt>TO = "×××@gmail.com"  </li> class="">  </li> class=alt>def sendFildByMail(config):   </li> class="">    print 'Preparing...'  </li> class=alt>    message = MIMEMultipart( )   </li> class="">    message['from'] = config['from']   </li> class=alt>    message['to'] = config['to']   </li> class="">    message['Reply-To'] = config['from']   </li> class=alt>    message['Subject'] = config['subject']   </li> class="">    message['Date'] = time.ctime(time.time())   </li> class=alt>    message['X-Priority'] =  '3'  </li> class="">    message['X-MSMail-Priority'] =  'Normal'  </li> class=alt>    message['X-Mailer'] =  'Microsoft Outlook Express 6.00.2900.2180'  </li> class="">    message['X-MimeOLE'] =  'Produced By Microsoft MimeOLE V6.00.2900.2180'  </li> class=alt>       </li> class="">    if 'file' in config:   </li> class=alt>        #添加附件   </li> class="">        f=open(config['file'], 'rb')   </li> class=alt>        file = MIMEApplication(f.read())   </li> class="">        f.close()   </li> class=alt>        file.add_header('Content-Disposition', 'attachment', filename= os.path.basename(config['file']))   </li> class="">        message.attach(file)   </li> class=alt>       </li> class="">    if 'content' in config:   </li> class=alt>        #添加文本內(nèi)容   </li> class="">        f=open(config['content'], 'rb')   </li> class=alt>        f.seek(0)   </li> class="">        content = f.read()   </li> class=alt>        body = MIMEText(content, 'base64', 'gb2312')   </li> class="">        message.attach(body)   </li> class=alt>  </li> class="">    print 'OKay'  </li> class=alt>    print 'Logging...'  </li> class="">    smtp = SMTP(config['server'], config['port'])   </li> class=alt>    #如果SMTP服務(wù)器發(fā)郵件時(shí)不需要驗(yàn)證登錄則對(duì)下面這行加上注釋   </li> class="">    smtp.login(config['username'], config['password'])   </li> class=alt>    print 'OK'  </li> class="">       </li> class=alt>    print 'Sending...',   </li> class="">    smtp.sendmail (config['from'], [config['from'], config['to']], message.as_string())   </li> class=alt>    print 'OK'  </li> class="">    smtp.close()   </li> class=alt>    time.sleep(1)   </li> class="">  </li> class=alt>if __name__ == "__main__":   </li> class="">    if len(sys.argv) < 2:   </li> class=alt>        print 'Usage: python %s contentfilename' % os.path.basename(sys.argv[0])   </li> class="">        print 'OR Usage: python %s contentfilename attachfilename' % os.path.basename(sys.argv[0])   </li> class=alt>        wait=raw_input("quit.")   </li> class="">        sys.exit(-1)   </li> class=alt>    elif len(sys.argv) == 2:   </li> class="">        sendFildByMail({   </li> class=alt>            'from': FROM,   </li> class="">            'to': TO,   </li> class=alt>            'subject': '[MonitorCenter]Send Msg %s' % sys.argv[1],   </li> class="">            'content': sys.argv[1],   </li> class=alt>            'server': STMP_SERVER,   </li> class="">            'port': STMP_PORT,   </li> class=alt>            'username': USERNAME,   </li> class="">            'password': USERPASSWORD})   </li> class=alt>    elif len(sys.argv) == 3:   </li> class="">        sendFildByMail({   </li> class=alt>            'from': FROM,   </li> class="">            'to': TO,   </li> class=alt>            'subject': '[MonitorCenter]Send Msg and File %s %s' % (sys.argv[1], sys.argv[2]),   </li> class="">            'content': sys.argv[1],   </li> class=alt>            'file': sys.argv[2],   </li> class="">            'server': STMP_SERVER,   </li> class=alt>            'port': STMP_PORT,   </li> class="">            'username': USERNAME,   </li> class=alt>            'password': USERPASSWORD})   </li> class="">    wait=raw_input("end.")  </li> on:nocontrols:showcolumns style="DISPLAY: none" name=code rows=15 cols=50>#coding=utf-8 import MIMEMultipart import MIMEApplication = "mail.×××.com" "×××@×××.com" rWarning@×××.com" = time.ctime(time.time()) 'Normal' 'Microsoft Outlook Express 6.00.2900.2180' 'Produced By Microsoft MimeOLE V6.00.2900.2180' tent-Disposition', 'attachment', filename= os.path.basename(config['file'])) 'base64', 'gb2312') config['port']) )器發(fā)郵件時(shí)不需要驗(yàn)證登錄則對(duì)下面這行加上注釋 config['password']) (config['from'], [config['from'], config['to']], message.as_string()) python %s contentfilename' % os.path.basename(sys.argv[0]) python %s contentfilename attachfilename' % os.path.basename(sys.argv[0]) '[MonitorCenter]Send Msg %s' % sys.argv[1], '[MonitorCenter]Send Msg and File %s %s' % (sys.argv[1], sys.argv[2]),
windows xp下：
<img title=例子 height=166 alt=例子 src="http://hi.csdn.net/attachment/201003/12/6723_126837066066C3.png" width=657>
 linux ubuntu，suse下：
<img title=1 height=128 alt=1 src="http://hi.csdn.net/attachment/201003/12/6723_1268371549tzVm.png" width=634>
收到的結(jié)果：
<img title=2 height=66 alt=2 src="http://hi.csdn.net/attachment/201003/12/6723_1268371596HISQ.png" width=1237>

</div>
			
			<div id="w44444c" class="postFoot">
				posted @ <a href="http://www.shnenglu.com/jrckkyy/archive/2010/03/15/109755.html" Title = "permalink">2010-03-15 19:24</a> 學(xué)者站在巨人的肩膀上 閱讀(670) | <a href="http://www.shnenglu.com/jrckkyy/archive/2010/03/15/109755.html#FeedBack" Title = "comments, pingbacks, trackbacks">評(píng)論 (0)</a> | <a href="http://www.shnenglu.com/jrckkyy/admin/EditPosts.aspx?postid=109755">編輯</a> <a href="http://www.shnenglu.com/jrckkyy/AddToFavorite.aspx?id=109755">收藏</a>
			</div>
		</div>
		 
	
		<div id="4msmqgw" class="post">
			<div id="ow8awck" class="postTitle">
				<a id="_187ad2390c75_HomePageDays_DaysList_ctl00_DayItem_DayList_ctl02_TitleUrl" href="http://www.shnenglu.com/jrckkyy/archive/2010/03/15/109754.html">[分布式跨平臺(tái)監(jiān)控系統(tǒng)]linux下監(jiān)控網(wǎng)絡(luò)流量和網(wǎng)速-python腳本應(yīng)用</a>
			</div>
			
			<div id="ksu86ec" class="postText">
				由于上證所，深交所level1，level2金融數(shù)據(jù)服務(wù)器在上午9：00開(kāi)始到11：30和下午13：00開(kāi)始到15：30一共大約5個(gè)小時(shí)的時(shí)間內(nèi)流量比較大所以被監(jiān)控服務(wù)器的網(wǎng)絡(luò)流速算是一個(gè)被監(jiān)控的重要指標(biāo)。可以通過(guò)累加一段時(shí)間內(nèi)各個(gè)網(wǎng)卡的上行，下行流量除以這個(gè)時(shí)間間隔計(jì)算出這段時(shí)間內(nèi)的平均網(wǎng)速，我現(xiàn)在的采集頻率是1分鐘采集一次，在實(shí)際開(kāi)盤(pán)期間運(yùn)行過(guò)程中得到的網(wǎng)速監(jiān)控信息用還是比較準(zhǔn)確的，都保持在5M/S左右的速度，有時(shí)候在平時(shí)非服務(wù)期看見(jiàn)某臺(tái)服務(wù)器的內(nèi)網(wǎng)網(wǎng)卡網(wǎng)速達(dá)到5M/S ，果然就是有人在大手筆傳輸。
獨(dú)立的監(jiān)控腳本是返回一個(gè)列表嵌套元組的數(shù)據(jù)結(jié)構(gòu)，最后再匯總成一個(gè)完整的XML數(shù)據(jù)島，為了調(diào)試方便腳本的每一個(gè)中間結(jié)果都導(dǎo)出到一個(gè)臨時(shí)文本中。
運(yùn)行以下腳本要確定你的linux裝了ethtool工具，在ubuntu2.6.27-7-server，ubuntu22.6.27.19-5-default，suse 2.6.27.19-5-default 測(cè)試通過(guò)。
代碼：
 
<div id="4kos8ia" class=dp-highlighter>
<div id="8e88u4i" class=bar>
<div id="a8mo8wu" class=tools><a onclick="dp.sh.Toolbar.Command('ViewSource',this);return false;" >view plain</a><a onclick="dp.sh.Toolbar.Command('CopyToClipboard',this);return false;" >copy to clipboard</a><a onclick="dp.sh.Toolbar.Command('PrintSource',this);return false;" >print</a><a onclick="dp.sh.Toolbar.Command('About',this);return false;" >?</a></div>
</div>
<ol class=dp-py>
 <li id="48kwyqo" class=alt>#coding=utf-8   </li>
 <li id="qug8iy6" class="">#!/usr/bin/python   </li>
 <li id="88amgym" class=alt>import re   </li>
 <li id="o88q4io" class="">import os   </li>
 <li id="i8my84m" class=alt>import time   </li>
 <li id="8wiuy8s" class="">  </li>
 <li id="8qk88qy" class=alt>import utils   </li>
 <li id="4c444y8" class="">def sortedDictValues3(adict):   </li>
 <li id="ma84eu4" class=alt>    keys = adict.keys()   </li>
 <li id="88uw448" class="">    keys.sort()   </li>
 <li id="ggsu6ge" class=alt>    return map(adict.get, keys)   </li>
 <li id="8w44seo" class="">  </li>
 <li id="s4oso84" class=alt>def run():   </li>
 <li id="4sw8igy" class="">    if utils.isLinux() == False:   </li>
 <li id="88smy4i" class=alt>        return [('ifconfig_collect os type error','this is windows')]   </li>
 <li id="eicgsay" class="">    #not first run   </li>
 <li id="4eyw4y4" class=alt>    if os.path.isfile('./oldifconfig'):   </li>
 <li id="kwk884q" class="">        fileold = open('./oldifconfig', 'r')   </li>
 <li id="ke4u444" class=alt>        fileold.seek(0)   </li>
 <li id="8w8c4ac" class="">        #讀入上次記錄的臨時(shí)流量數(shù)據(jù)文件，和時(shí)間戳   </li>
 <li id="y8cwi84" class=alt>        (oldtime, fileoldcontent) = fileold.read().split('#')   </li>
 <li id="y44k4o4" class="">        fileold.close;   </li>
 <li id="c4ys84w" class=alt>        netcard = {}   </li>
 <li id="y48mi84" class="">        tempstr = ''  </li>
 <li id="ui4ywik" class=alt>        key = ''  </li>
 <li id="o4s84am" class="">        for strline in fileoldcontent.split('\n'):   </li>
 <li id="4444geu" class=alt>            reobj = re.compile('^lo*.')   </li>
 <li id="c84m8kk" class="">            if reobj.search(strline):   </li>
 <li id="88c8444" class=alt>                break;   </li>
 <li id="4ykqski" class="">            reobj = re.compile('^eth*.')   </li>
 <li id="a8c4i4w" class=alt>            if reobj.search(strline):   </li>
 <li id="k44y6ki" class="">                key = strline.split()[0]   </li>
 <li id="y4cwa84" class=alt>            tempstr = tempstr + strline + '\n'  </li>
 <li id="4kgiusi" class="">            netcard[key] = tempstr   </li>
 <li id="iu64484" class=alt>        RXold = {}   </li>
 <li id="oayamuk" class="">        TXold = {}   </li>
 <li id="8m8sy4m" class=alt>        for key,value in netcard.items():   </li>
 <li id="kmg84ow" class="">            tempsplit = value.split('\n')   </li>
 <li id="4g4yuca" class=alt>            netcard[key] = ''  </li>
 <li id="88e4ksq" class="">            for item in tempsplit:   </li>
 <li id="c4ei8g8" class=alt>                item = item + ' '  </li>
 <li id="4amquca" class="">                netcard[key] = netcard[key] + item   </li>
 <li id="8a8ek4i" class=alt>                tempcount = 1  </li>
 <li id="8ko8s4o" class="">                for match in re.finditer("(bytes:)(.*?)( \()", item):   </li>
 <li id="4amos84" class=alt>                    if tempcount == 1:   </li>
 <li id="cgaoaqg" class="">                        RXold[key] = match.group(2)   </li>
 <li id="4w8kooc" class=alt>                        tempcount = tempcount + 1  </li>
 <li id="e48sccs" class="">                    elif tempcount == 2:   </li>
 <li id="i8myccu" class=alt>                        TXold[key] = match.group(2)   </li>
 <li id="ysey4ag" class="">                        netcard[key] = netcard[key] + 'net io percent(bytes/s): 0  '  </li>
 <li id="4qkws8c" class=alt>           </li>
 <li id="isq8wcm" class="">        #記錄當(dāng)前網(wǎng)卡信息到臨時(shí)文件中   </li>
 <li id="4cgseu8" class=alt>        os.system('ifconfig > ifconfigtemp')   </li>
 <li id="4ae844i" class="">        file = open('./ifconfigtemp','r');   </li>
 <li id="884488s" class=alt>        fileold = open('./oldifconfig', 'w')   </li>
 <li id="m4qmo8y" class="">        temptimestr = str(int(time.time()));   </li>
 <li id="iuyswm4" class=alt>        fileold.write(temptimestr)   </li>
 <li id="q4c8s44" class="">        fileold.write('#')   </li>
 <li id="imqa8k4" class=alt>        file.seek(0)   </li>
 <li id="ikg8u4e" class="">        fileold.write(file.read())   </li>
 <li id="88uq444" class=alt>        fileold.close()   </li>
 <li id="kw4oii8" class="">        returnkeys = []   </li>
 <li id="coiuog8" class=alt>        returnvalues = []   </li>
 <li id="iaoy8si" class="">        netcard = {}   </li>
 <li id="cwquoek" class=alt>        tempcountcard = 0  </li>
 <li id="k4wq84u" class="">        file.seek(0)   </li>
 <li id="s44migm" class=alt>        key = ''  </li>
 <li id="augk8w8" class="">        for strline in file.readlines():   </li>
 <li id="6uyswem" class=alt>            reobj = re.compile('^lo*.')   </li>
 <li id="wuwsg8c" class="">            if reobj.search(strline):   </li>
 <li id="wkw8kki" class=alt>                break;   </li>
 <li id="s4wqskc" class="">            reobj = re.compile('^eth*.')   </li>
 <li id="c8eae8o" class=alt>            if reobj.search(strline):   </li>
 <li id="ys8q844" class="">                key = strline.split()[0]   </li>
 <li id="eose4o4" class=alt>                netcard[key] = ''  </li>
 <li id="w4c4umc" class="">            netcard[key] = netcard[key] + strline   </li>
 <li id="8osmg8i" class=alt>        newnetcard = {}   </li>
 <li id="8k4ikka" class="">        file.seek(0)   </li>
 <li id="iu4ugwu" class=alt>        key = ''  </li>
 <li id="8w84aq4" class="">        for strline in file.readlines():   </li>
 <li id="4844uc4" class=alt>            reobj = re.compile('^lo*.')   </li>
 <li id="48i84s8" class="">            if reobj.search(strline):   </li>
 <li id="y888cqs" class=alt>                break;   </li>
 <li id="8y8c4ai" class="">            if re.search("^eth", strline):   </li>
 <li id="uo4kw44" class=alt>                templist = strline.split()   </li>
 <li id="8u4o4eu" class="">                key = templist[0]   </li>
 <li id="g8m8k4w" class=alt>                newnetcard[key] = ''  </li>
 <li id="8ug8ksa" class="">                newnetcard[key] = templist[4] + newnetcard[key] + ' '  </li>
 <li id="8equg8c" class=alt>            if re.search("^ *inet ", strline):   </li>
 <li id="qcoqkaa" class="">                templist = strline.split()   </li>
 <li id="wi84kms" class=alt>                newnetcard[key] = templist[1][5:] + ' ' + newnetcard[key] + ' '  </li>
 <li id="amoa48c" class="">        for key,value in newnetcard.items():   </li>
 <li id="4848qgw" class=alt>            #記錄每張網(wǎng)卡是否工作狀態(tài)信息到臨時(shí)文件   </li>
 <li id="ey8comm" class="">            os.system('ethtool %s > ethtooltemp'%(key))   </li>
 <li id="q44k844" class=alt>            file = open('./ethtooltemp','r');   </li>
 <li id="ykmqski" class="">            tempethtooltemplist = file.read().split('\n\t')   </li>
 <li id="mi4os48" class=alt>            file.close   </li>
 <li id="kg4u4u4" class="">            if re.search("yes", tempethtooltemplist[-1]):   </li>
 <li id="oiu84oo" class=alt>                templist = newnetcard[key].split()   </li>
 <li id="ic8444e" class="">                newnetcard[key] = templist[0] + ' runing! ' + templist[1]   </li>
 <li id="m4ssm88" class=alt>            else:   </li>
 <li id="iuyugwu" class="">                templist = newnetcard[key].split()   </li>
 <li id="8eyk8s4" class=alt>                if len(templist) > 1:   </li>
 <li id="4ceiucq" class="">                    newnetcard[key] = templist[0] + ' stop! ' + templist[1]   </li>
 <li id="kmaego8" class=alt>                else:   </li>
 <li id="88mco44" class="">                    newnetcard[key] =  'stop! ' + templist[0]   </li>
 <li id="so88m4s" class=alt>        file.close()   </li>
 <li id="4c8o4ek" class="">        RX = {}   </li>
 <li id="yk4ycka" class=alt>        TX = {}   </li>
 <li id="cgimyom" class="">        for key,value in netcard.items():   </li>
 <li id="cesmqyw" class=alt>            tempsplit = value.split('\n')   </li>
 <li id="u8co84a" class="">            netcard[key] = ''  </li>
 <li id="kg884aa" class=alt>            for item in tempsplit:   </li>
 <li id="8gsc48g" class="">                item = item + ' '  </li>
 <li id="a84uyo8" class=alt>                netcard[key] = netcard[key] + item   </li>
 <li id="84ycomm" class="">                tempcount = 1  </li>
 <li id="48icge4" class=alt>                for match in re.finditer("(bytes:)(.*?)( \()", item):   </li>
 <li id="a4w4aqy" class="">                    if tempcount == 1:   </li>
 <li id="8ei4ms4" class=alt>                        RX[key] = str(int(match.group(2)) - int(RXold[key]))   </li>
 <li id="a4awiog" class="">                        tempcount = tempcount + 1  </li>
 <li id="qosmo8m" class=alt>                    elif tempcount == 2:   </li>
 <li id="c8ma88s" class="">                        TX[key] = str(int(match.group(2)) - int(TXold[key]))   </li>
 <li id="g4q8csq" class=alt>                        divtime = float(int(time.time()) - int(oldtime))   </li>
 <li id="ou84mki" class="">                        if divtime == 0:   </li>
 <li id="qw84ow4" class=alt>                            rate = (float(TX[key]) + float(RX[key]))   </li>
 <li id="wse8u44" class="">                        else:   </li>
 <li id="goa844y" class=alt>                            rate = (float(TX[key]) + float(RX[key]))/(divtime)   </li>
 <li id="g48qsiy" class="">                        if rate == 0:   </li>
 <li id="cico4em" class=alt>                            newnetcard[key] = '0' + ' ' + newnetcard[key]   </li>
 <li id="us8q84s" class="">                        else:   </li>
 <li id="m4g8c4m" class=alt>                            newnetcard[key] = '%.2f'%rate + ' ' + newnetcard[key]   </li>
 <li id="wquge44" class="">        return zip(['order'], ['48']) + newnetcard.items();   </li>
 <li id="u8gk88q" class=alt>    else:   </li>
 <li id="m4acsi8" class="">        os.system('ifconfig > ifconfigtemp')   </li>
 <li id="uc8sw4a" class=alt>        file = open('./ifconfigtemp','r');   </li>
 <li id="so8o84g" class="">        fileold = open('./oldifconfig', 'w')   </li>
 <li id="q4wac84" class=alt>        temptimestr = str(int(time.time()));   </li>
 <li id="8mqu4go" class="">        fileold.write(temptimestr)   </li>
 <li id="8ie8g4i" class=alt>        fileold.write('#')   </li>
 <li id="mwiug8c" class="">        file.seek(0)   </li>
 <li id="ugsc8a8" class=alt>        fileold.write(file.read())   </li>
 <li id="eawq8m8" class="">        fileold.close()   </li>
 <li id="4i4ga44" class=alt>  </li>
 <li id="8qc44u8" class="">        netcard = {}   </li>
 <li id="4kga8is" class=alt>        file.seek(0)   </li>
 <li id="ugam44c" class="">        key = ''  </li>
 <li id="oa8q4yy" class=alt>        for strline in file.readlines():   </li>
 <li id="i44ugge" class="">            reobj = re.compile('^lo*.')   </li>
 <li id="o4suyoe" class=alt>            if reobj.search(strline):   </li>
 <li id="ie6uyoo" class="">                break;   </li>
 <li id="884kw8w" class=alt>            reobj = re.compile('^eth*.')   </li>
 <li id="88e4aai" class="">            if reobj.search(strline):   </li>
 <li id="os88444" class=alt>                key = strline.split()[0]   </li>
 <li id="4oquqsy" class="">                netcard[key] = ''  </li>
 <li id="4844e4e" class=alt>            netcard[key] = netcard[key] + strline   </li>
 <li id="s88oqg4" class="">        RX = {}   </li>
 <li id="8aosc88" class=alt>        TX = {}   </li>
 <li id="a4egck4" class="">           </li>
 <li id="o6wcmmc" class=alt>        key = ''  </li>
 <li id="4s44mm4" class="">        newnetcard = {}   </li>
 <li id="ywcgiq8" class=alt>        file.seek(0)   </li>
 <li id="484coeu" class="">        for strline in file.readlines():   </li>
 <li id="g4u4igw" class=alt>            reobj = re.compile('^lo*.')   </li>
 <li id="o48mgw4" class="">            if reobj.search(strline):   </li>
 <li id="io44uua" class=alt>                break;   </li>
 <li id="4qk4yoe" class="">            if re.search("^eth", strline):   </li>
 <li id="486gai4" class=alt>                templist = strline.split()   </li>
 <li id="iamwia8" class="">                key = templist[0]   </li>
 <li id="8g8amu4" class=alt>                newnetcard[key] = templist[4] + ' '  </li>
 <li id="gw88c44" class="">            if re.search("^ *inet ", strline):   </li>
 <li id="8se8a4c" class=alt>                templist = strline.split()   </li>
 <li id="w4k84c6" class="">                newnetcard[key] = newnetcard[key] + templist[1][5:] + ' '  </li>
 <li id="auykowm" class=alt>        for key,value in newnetcard.items():   </li>
 <li id="88mqkaq" class="">            os.system('ethtool %s > ethtooltemp'%(key))   </li>
 <li id="o48keu4" class=alt>            file = open('./ethtooltemp','r');   </li>
 <li id="eykmy4o" class="">            tempethtooltemplist = file.read().split('\n')   </li>
 <li id="u8co8wy" class=alt>            file.close   </li>
 <li id="gi88i8u" class="">            if re.search("yes", tempethtooltemplist[-1]):   </li>
 <li id="o8co8gw" class=alt>                newnetcard[key] = newnetcard[key] + 'runing!'  </li>
 <li id="wy4yk8e" class="">            else:   </li>
 <li id="gqmgkii" class=alt>                newnetcard[key] = newnetcard[key] + 'stop!'  </li>
 <li id="mocga8w" class="">        file.close()   </li>
 <li id="y8cy8qo" class=alt>        for key,value in netcard.items():   </li>
 <li id="o4q444u" class="">            tempsplit = value.split('\n')   </li>
 <li id="y88kwgo" class=alt>            netcard[key] = ''  </li>
 <li id="equequ8" class="">            for item in tempsplit:   </li>
 <li id="iy4q4so" class=alt>                item = item + ' '  </li>
 <li id="e48wm8m" class="">                #print item   </li>
 <li id="syu8wy8" class=alt>                netcard[key] = netcard[key] + item   </li>
 <li id="i8w8a4c" class="">                tempcount = 1  </li>
 <li id="qo888m4" class=alt>                for match in re.finditer("(bytes:)(.*?)( \()", item):   </li>
 <li id="48guya4" class="">                    if tempcount == 1:   </li>
 <li id="i4ose84" class=alt>                        RX[key] = match.group(2)   </li>
 <li id="a8eq4qk" class="">                        tempcount = tempcount + 1  </li>
 <li id="imu4y44" class=alt>                    elif tempcount == 2:   </li>
 <li id="o4k4woq" class="">                        TX[key] = match.group(2)   </li>
 <li id="oeukm8s" class=alt>                        netcard[key] = netcard[key] + 'net io percent(bytes/s): 0  '  </li>
 <li id="8gwko84" class="">                        newnetcard[key] = newnetcard[key] + ' ' + '0  '  </li>
 <li id="mgqkwgg" class=alt>        return zip(['order'], ['48']) + newnetcard.items();   </li>
 <li id="qcokecu" class="">if __name__ == '__main__':   </li>
 <li id="mmaueou" class=alt>    print run()  </li>
</ol>
</div>
<textarea class=python style="DISPLAY: none" name=code rows=15 cols=50>#coding=utf-8
#!/usr/bin/python
import re
import os
import time
import utils
def sortedDictValues3(adict):
keys = adict.keys()
keys.sort()
return map(adict.get, keys)
def run():
if utils.isLinux() == False:
return [('ifconfig_collect os type error','this is windows')]
#not first run
if os.path.isfile('./oldifconfig'):
fileold = open('./oldifconfig', 'r')
fileold.seek(0)
#讀入上次記錄的臨時(shí)流量數(shù)據(jù)文件，和時(shí)間戳
(oldtime, fileoldcontent) = fileold.read().split('#')
fileold.close;
netcard = {}
tempstr = ''
key = ''
for strline in fileoldcontent.split('\n'):
reobj = re.compile('^lo*.')
if reobj.search(strline):
break;
reobj = re.compile('^eth*.')
if reobj.search(strline):
key = strline.split()[0]
tempstr = tempstr + strline + '\n'
netcard[key] = tempstr
RXold = {}
TXold = {}
for key,value in netcard.items():
tempsplit = value.split('\n')
netcard[key] = ''
for item in tempsplit:
item = item + ' '
netcard[key] = netcard[key] + item
tempcount = 1
for match in re.finditer("(bytes:)(.*?)( \()", item):
if tempcount == 1:
RXold[key] = match.group(2)
tempcount = tempcount + 1
elif tempcount == 2:
TXold[key] = match.group(2)
netcard[key] = netcard[key] + 'net io percent(bytes/s): 0 '
#記錄當(dāng)前網(wǎng)卡信息到臨時(shí)文件中
os.system('ifconfig > ifconfigtemp')
file = open('./ifconfigtemp','r');
fileold = open('./oldifconfig', 'w')
temptimestr = str(int(time.time()));
fileold.write(temptimestr)
fileold.write('#')
file.seek(0)
fileold.write(file.read())
fileold.close()
returnkeys = []
returnvalues = []
netcard = {}
tempcountcard = 0
file.seek(0)
key = ''
for strline in file.readlines():
reobj = re.compile('^lo*.')
if reobj.search(strline):
break;
reobj = re.compile('^eth*.')
if reobj.search(strline):
key = strline.split()[0]
netcard[key] = ''
netcard[key] = netcard[key] + strline
newnetcard = {}
file.seek(0)
key = ''
for strline in file.readlines():
reobj = re.compile('^lo*.')
if reobj.search(strline):
break;
if re.search("^eth", strline):
templist = strline.split()
key = templist[0]
newnetcard[key] = ''
newnetcard[key] = templist[4] + newnetcard[key] + ' '
if re.search("^ *inet ", strline):
templist = strline.split()
newnetcard[key] = templist[1][5:] + ' ' + newnetcard[key] + ' '
for key,value in newnetcard.items():
#記錄每張網(wǎng)卡是否工作狀態(tài)信息到臨時(shí)文件
os.system('ethtool %s > ethtooltemp'%(key))
file = open('./ethtooltemp','r');
tempethtooltemplist = file.read().split('\n\t')
file.close
if re.search("yes", tempethtooltemplist[-1]):
templist = newnetcard[key].split()
newnetcard[key] = templist[0] + ' runing! ' + templist[1]
else:
templist = newnetcard[key].split()
if len(templist) > 1:
newnetcard[key] = templist[0] + ' stop! ' + templist[1]
else:
newnetcard[key] = 'stop! ' + templist[0]
file.close()
RX = {}
TX = {}
for key,value in netcard.items():
tempsplit = value.split('\n')
netcard[key] = ''
for item in tempsplit:
item = item + ' '
netcard[key] = netcard[key] + item
tempcount = 1
for match in re.finditer("(bytes:)(.*?)( \()", item):
if tempcount == 1:
RX[key] = str(int(match.group(2)) - int(RXold[key]))
tempcount = tempcount + 1
elif tempcount == 2:
TX[key] = str(int(match.group(2)) - int(TXold[key]))
divtime = float(int(time.time()) - int(oldtime))
if divtime == 0:
rate = (float(TX[key]) + float(RX[key]))
else:
rate = (float(TX[key]) + float(RX[key]))/(divtime)
if rate == 0:
newnetcard[key] = '0' + ' ' + newnetcard[key]
else:
newnetcard[key] = '%.2f'%rate + ' ' + newnetcard[key]
return zip(['order'], ['48']) + newnetcard.items();
else:
os.system('ifconfig > ifconfigtemp')
file = open('./ifconfigtemp','r');
fileold = open('./oldifconfig', 'w')
temptimestr = str(int(time.time()));
fileold.write(temptimestr)
fileold.write('#')
file.seek(0)
fileold.write(file.read())
fileold.close()
netcard = {}
file.seek(0)
key = ''
for strline in file.readlines():
reobj = re.compile('^lo*.')
if reobj.search(strline):
break;
reobj = re.compile('^eth*.')
if reobj.search(strline):
key = strline.split()[0]
netcard[key] = ''
netcard[key] = netcard[key] + strline
RX = {}
TX = {}
key = ''
newnetcard = {}
file.seek(0)
for strline in file.readlines():
reobj = re.compile('^lo*.')
if reobj.search(strline):
break;
if re.search("^eth", strline):
templist = strline.split()
key = templist[0]
newnetcard[key] = templist[4] + ' '
if re.search("^ *inet ", strline):
templist = strline.split()
newnetcard[key] = newnetcard[key] + templist[1][5:] + ' '
for key,value in newnetcard.items():
os.system('ethtool %s > ethtooltemp'%(key))
file = open('./ethtooltemp','r');
tempethtooltemplist = file.read().split('\n')
file.close
if re.search("yes", tempethtooltemplist[-1]):
newnetcard[key] = newnetcard[key] + 'runing!'
else:
newnetcard[key] = newnetcard[key] + 'stop!'
file.close()
for key,value in netcard.items():
tempsplit = value.split('\n')
netcard[key] = ''
for item in tempsplit:
item = item + ' '
#print item
netcard[key] = netcard[key] + item
tempcount = 1
for match in re.finditer("(bytes:)(.*?)( \()", item):
if tempcount == 1:
RX[key] = match.group(2)
tempcount = tempcount + 1
elif tempcount == 2:
TX[key] = match.group(2)
netcard[key] = netcard[key] + 'net io percent(bytes/s): 0 '
newnetcard[key] = newnetcard[key] + ' ' + '0 '
return zip(['order'], ['48']) + newnetcard.items();
if __name__ == '__main__':
print run()

使用例子：

每一個(gè)列表元素元組里面第二個(gè)元素第一個(gè)字段為網(wǎng)速 Bytes/S，例如eth1網(wǎng)卡的網(wǎng)速就是3.3KB/s，eth0網(wǎng)速是2.9KB/s，今天是周六這個(gè)流量很正常

posted @ 2010-03-15 19:22 學(xué)者站在巨人的肩膀上閱讀(584) | 評(píng)論 (0) | 編輯收藏

2009年12月10日 #

自頂向下學(xué)搜索引擎——北大天網(wǎng)搜索引擎TSE分析及完全注釋[6]倒排索引的建立的程序分析(4)

以下是根據(jù)正向索引建立倒排索引的注釋

int main(int argc, char* argv[]) //./CrtInvertedIdx moon.fidx.sort > sun.iidx
{
 ifstream ifsImgInfo(argv[1]);
 if (!ifsImgInfo)
 {
 cerr << "Cannot open " << argv[1] << " for input\n";
 return -1;
 }

    string strLine,strDocNum,tmp1="";
    int cnt = 0;
    while (getline(ifsImgInfo, strLine))
    {
        string::size_type idx;
        string tmp;

idx = strLine.find("\t");
tmp = strLine.substr(0,idx);

if (tmp.size()<2 || tmp.size() > 8) continue;

if (tmp1.empty()) tmp1=tmp;

        if (tmp == tmp1)
        {
            strDocNum = strDocNum + " " + strLine.substr(idx+1);
        }
        else
        {
            if ( strDocNum.empty() )
                strDocNum = strDocNum + " " + strLine.substr(idx+1);

cout << tmp1 << "\t" << strDocNum << endl;
 tmp1 = tmp;
 strDocNum.clear();
 strDocNum = strDocNum + " " + strLine.substr(idx+1);
 }

cnt++;
 //if (cnt==100) break;
 }
 cout << tmp1 << "\t" << strDocNum << endl; //倒排索引中每個(gè)字典單詞后的文檔編號(hào)以table鍵為間隔

return 0;
}

posted @ 2009-12-10 23:03 學(xué)者站在巨人的肩膀上閱讀(1587) | 評(píng)論 (3) | 編輯收藏

自頂向下學(xué)搜索引擎——北大天網(wǎng)搜索引擎TSE分析及完全注釋[6]倒排索引的建立的程序分析(3)

這里介紹正向索引的建立，如果直接建立倒排索引效率上可能會(huì)很低，所以可以先產(chǎn)生正向索引為后面的倒排索引打下基礎(chǔ)。

詳細(xì)的文件功能和介紹都在這里有了介紹自頂向下學(xué)搜索引擎——北大天網(wǎng)搜索引擎TSE分析及完全注釋[5]倒排索引的建立及文件介紹

CrtForwardIdx.cpp文件

int main(int argc, char* argv[]) //./CrtForwardIdx Tianwang.raw.***.seg > moon.fidx
{
 ifstream ifsImgInfo(argv[1]);
 if (!ifsImgInfo)
 {
 cerr << "Cannot open " << argv[1] << " for input\n";
 return -1;
 }

    string strLine,strDocNum;
    int cnt = 0;
    while (getline(ifsImgInfo, strLine))
    {
        string::size_type idx;

        cnt++;
        if (cnt%2 == 1) //奇數(shù)行為文檔編號(hào)
        {
            strDocNum = strLine.substr(0,strLine.size());
            continue;
        }
        if (strLine[0]=='\0' || strLine[0]=='#' || strLine[0]=='\n')
        {
            continue;
        }

while ( (idx = strLine.find(SEPARATOR)) != string::npos ) //指定查找分界符
 {
 string tmp1 = strLine.substr(0,idx);
 cout << tmp1 << "\t" << strDocNum << endl;
 strLine = strLine.substr(idx + SEPARATOR.size());
 }

//if (cnt==100) break;
}

return 0;
}

author:http://hi.baidu.com/jrckkyy

author:http://blog.csdn.net/jrckkyy

posted @ 2009-12-10 23:02 學(xué)者站在巨人的肩膀上閱讀(1195) | 評(píng)論 (1) | 編輯收藏

自頂向下學(xué)搜索引擎——北大天網(wǎng)搜索引擎TSE分析及完全注釋[6]倒排索引的建立的程序分析(2)

前面的DocIndex程序輸入一個(gè)Tianwang.raw.*****文件，會(huì)產(chǎn)生一下三個(gè)文件 Doc.idx, Url.idx, DocId2Url.idx，我們這里對(duì)DocSegment程序進(jìn)行分析。

這里輸入 Tianwang.raw.*****，Doc.idx，Url.idx.sort_uniq等三個(gè)文件，輸出一個(gè)Tianwang.raw.***.seg 分詞完畢的文件

int main(int argc, char* argv[])
{
 string strLine, strFileName=argv[1];
 CUrl iUrl;
 vector<CUrl> vecCUrl;
 CDocument iDocument;
 vector<CDocument> vecCDocument;
 unsigned int docId = 0;

//ifstream ifs("Tianwang.raw.2559638448");
 ifstream ifs(strFileName.c_str()); //DocSegment Tianwang.raw.****
 if (!ifs)
 {
 cerr << "Cannot open tianwang.img.info for input\n";
 return -1;
 }

ifstream ifsUrl("Url.idx.sort_uniq"); //排序并消重后的url字典
 if (!ifsUrl)
 {
 cerr << "Cannot open Url.idx.sort_uniq for input\n";
 return -1;
 }
 ifstream ifsDoc("Doc.idx"); //字典文件
 if (!ifsDoc)
 {
 cerr << "Cannot open Doc.idx for input\n";
 return -1;
 }

    while (getline(ifsUrl,strLine)) //偏離url字典存入一個(gè)向量?jī)?nèi)存中
    {
        char chksum[33];
        int docid;

        memset(chksum, 0, 33);
        sscanf( strLine.c_str(), "%s%d", chksum, &docid );
        iUrl.m_sChecksum = chksum;
        iUrl.m_nDocId = docid;
        vecCUrl.push_back(iUrl);
    }

    while (getline(ifsDoc,strLine))     //偏離字典文件將其放入一個(gè)向量?jī)?nèi)存中
    {
        int docid,pos,length;
        char chksum[33];

        memset(chksum, 0, 33);
        sscanf( strLine.c_str(), "%d%d%d%s", &docid, &pos, &length,chksum );
        iDocument.m_nDocId = docid;
        iDocument.m_nPos = pos;
        iDocument.m_nLength = length;
        iDocument.m_sChecksum = chksum;
        vecCDocument.push_back(iDocument);
    }

strFileName += ".seg";
 ofstream fout(strFileName.c_str(), ios::in|ios::out|ios::trunc|ios::binary); //設(shè)置完成分詞后的數(shù)據(jù)輸出文件
 for ( docId=0; docId<MAX_DOC_ID; docId++ )
 {

        // find document according to docId
        int length = vecCDocument[docId+1].m_nPos - vecCDocument[docId].m_nPos -1;
        char *pContent = new char[length+1];
        memset(pContent, 0, length+1);
        ifs.seekg(vecCDocument[docId].m_nPos);
        ifs.read(pContent, length);

char *s;
s = pContent;

        // skip Head
        int bytesRead = 0,newlines = 0;
        while (newlines != 2 && bytesRead != HEADER_BUF_SIZE-1)
        {
            if (*s == '\n')
                newlines++;
            else
                newlines = 0;
            s++;
            bytesRead++;
        }
        if (bytesRead == HEADER_BUF_SIZE-1) continue;

        // skip header
        bytesRead = 0,newlines = 0;
        while (newlines != 2 && bytesRead != HEADER_BUF_SIZE-1)
        {
            if (*s == '\n')
                newlines++;
            else
                newlines = 0;
            s++;
            bytesRead++;
        }
        if (bytesRead == HEADER_BUF_SIZE-1) continue;

//iDocument.m_sBody = s;
 iDocument.RemoveTags(s); //去除<>
 iDocument.m_sBodyNoTags = s;

delete[] pContent;
string strLine = iDocument.m_sBodyNoTags;

CStrFun::ReplaceStr(strLine, " ", " ");
CStrFun::EmptyStr(strLine); // set " \t\r\n" to " "

// segment the document 具體分詞處理
 CHzSeg iHzSeg;
 strLine = iHzSeg.SegmentSentenceMM(iDict,strLine);
 fout << docId << endl << strLine;
 fout << endl;

 }

return(0);
}
這里只是浮光掠影式的過(guò)一遍大概的代碼，后面我會(huì)有專題詳細(xì)講解 parse html 和 segment docment 等技術(shù)

posted @ 2009-12-10 23:02 學(xué)者站在巨人的肩膀上閱讀(1157) | 評(píng)論 (1) | 編輯收藏

自頂向下學(xué)搜索引擎——北大天網(wǎng)搜索引擎TSE分析及完全注釋[6]倒排索引的建立的程序分析(1)

author:http://hi.baidu.com/jrckkyy

author:http://blog.csdn.net/jrckkyy

上一篇主要介紹了倒排索引建立相關(guān)的文件及中間文件。
TSE建立索引在運(yùn)行程序上的大致步驟可以簡(jiǎn)化分為以下幾步：

1、運(yùn)行命令#./DocIndex
會(huì)用到一個(gè)文件 tianwang.raw.520 //爬取回來(lái)的原始文件，包含多個(gè)網(wǎng)頁(yè)的所有信息，所以很大，這也是一個(gè)有待解決的問(wèn)題，到底存成大文件（如果過(guò)大會(huì)超過(guò)2G或4G的限制，而且文件過(guò)大索引效率過(guò)低）還是小文件（文件數(shù)過(guò)多用于打開(kāi)關(guān)閉文件句柄的消耗過(guò)大）還有待思考，還就是存儲(chǔ)方案的解決最終肯定是要存為分布式的，最終總文件量肯定是會(huì)上TB的，TSE只支持小型的搜索引擎需求。
會(huì)產(chǎn)生一下三個(gè)文件 Doc.idx, Url.idx, DocId2Url.idx //Data文件夾中的Doc.idx DocId2Url.idx和Doc.idx

2、運(yùn)行命令#sort Url.idx|uniq > Url.idx.sort_uniq //Data文件夾中的Url.idx.sort_uniq
會(huì)用到一個(gè)文件 Url.idx文件 //md5 hash 之后的url完整地址和document id值對(duì)
會(huì)產(chǎn)生一個(gè)文件 Url.idx.sort_uniq //URL消重，md5 hash排序，提高檢索效率

3、運(yùn)行命令#./DocSegment Tianwang.raw.2559638448
會(huì)用到一個(gè)文件 Tianwang.raw.2559638448 //Tianwang.raw.2559638448為爬回來(lái)的文件，每個(gè)頁(yè)面包含http頭，分詞為后面建立到排索引做準(zhǔn)備
會(huì)產(chǎn)生一個(gè)文件 Tianwang.raw.2559638448.seg //分詞文件，由一行document id號(hào)和一行文檔分詞組（只對(duì)每個(gè)文檔<html></html>中<head></head><body></body>等文字標(biāo)記中的文本進(jìn)行分組）構(gòu)成

4、運(yùn)行命令#./CrtForwardIdx Tianwang.raw.2559638448.seg > moon.fidx //建立獨(dú)立的正向索引

5、運(yùn)行命令
#set | grep "LANG"
#LANG=en; export LANG;
#sort moon.fidx > moon.fidx.sort

6、運(yùn)行命令#./CrtInvertedIdx moon.fidx.sort > sun.iidx //建立倒排索引

我們先從建立索引的第一個(gè)程序DocIndex.cpp開(kāi)始分析。(注釋約定：Tianwang.raw.2559638448是抓回來(lái)合并成的大文件，后面就叫大文件，里面包含了很多篇html文檔，里面的文檔有規(guī)律的分隔就叫做一篇一篇的文檔)

//DocIndex.h start-------------------------------------------------------------

#ifndef _COMM_H_040708_
#define _COMM_H_040708_

#include

#include
#include
#include
#include
#include
#include
#include

using namespace std;

const unsigned HEADER_BUF_SIZE = 1024;
const unsigned RstPerPage = 20; //前臺(tái)搜索結(jié)果數(shù)據(jù)集返回條數(shù)

//iceway
//const unsigned MAX_DOC_IDX_ID = 21312; //DocSegment.cpp中要用到
const unsigned MAX_DOC_IDX_ID = 22104;

//const string IMG_INFO_NAME("./Data/s1.1");
const string INF_INFO_NAME("./Data/sun.iidx"); //倒排索引文件
//朱德 14383 16151 16151 16151 1683 207 6302 7889 8218 8218 8637
//朱古力 1085 1222

//9萬(wàn)多條字元文件包括特殊符號(hào)，標(biāo)點(diǎn)，漢字
const string DOC_IDX_NAME("./Data/Doc.idx"); //倒排索引文件
const string RAWPAGE_FILE_NAME("./Data/Tianwang.swu.iceway.1.0");

//iceway
const string DOC_FILE_NAME = "Tianwang.swu.iceway.1.0"; //Docindex.cpp中要用到
const string Data_DOC_FILE_NAME = "./Data/Tianwang.swu.iceway.1.0"; //Snapshot.cpp中要用到

//const string RM_THUMBNAIL_FILES("rm -f ~/public_html/ImgSE/timg/*");

//const string THUMBNAIL_DIR("/ImgSE/timg/");

#endif _COMM_H_040708_
//DocIndex.h end--------------------------------------------------------------//DocIndex.cpp start-----------------------------------------------------------

#include
#include
#include "Md5.h"
#include "Url.h"
#include "Document.h"

//iceway(mnsc)
#include "Comm.h"
#include

using namespace std;

int main(int argc, char* argv[])
{
 //ifstream ifs("Tianwang.raw.2559638448");
//ifstream ifs("Tianwang.raw.3023555472");
//iceway(mnsc)
ifstream ifs(DOC_FILE_NAME.c_str()); //打開(kāi)Tianwang.raw.3023555472文件，最原始的文件
if (!ifs)
{
 cerr << "Cannot open " << "tianwang.img.info" << " for input\n";
 return -1;
 }
ofstream ofsUrl("Url.idx", ios::in|ios::out|ios::trunc|ios::binary); //建立并打開(kāi)Url.idx文件
if( !ofsUrl )
{
 cout << "error open file " << endl;
}

ofstream ofsDoc("Doc.idx", ios::in|ios::out|ios::trunc|ios::binary); //建立并打開(kāi)Doc.idx文件
if( !ofsDoc )
{
cout << "error open file " << endl;
}

ofstream ofsDocId2Url("DocId2Url.idx", ios::in|ios::out|ios::trunc|ios::binary); //建立并打開(kāi)DocId2Url.idx文件
if( !ofsDocId2Url )
{
cout << "error open file " << endl;
}

int cnt=0; //文檔編號(hào)從0開(kāi)始計(jì)算
string strLine,strPage;
CUrl iUrl;
CDocument iDocument;
CMD5 iMD5;

int nOffset = ifs.tellg();
while (getline(ifs, strLine))
{
  if (strLine[0]=='\0' || strLine[0]=='#' || strLine[0]=='\n')
  {
   nOffset = ifs.tellg();
   continue;
  }

  if (!strncmp(strLine.c_str(), "version: 1.0", 12)) //判斷第一行是否是version: 1.0如果是就解析下去
  {
   if(!getline(ifs, strLine)) break;
   if (!strncmp(strLine.c_str(), "url: ", 4)) //判斷第二行是否是url: 如果是則解析下去
   {
    iUrl.m_sUrl = strLine.substr(5); //截取url: 五個(gè)字符之后的url內(nèi)容
    iMD5.GenerateMD5( (unsigned char*)iUrl.m_sUrl.c_str(), iUrl.m_sUrl.size() ); //對(duì)url用md5 hash處理
    iUrl.m_sChecksum = iMD5.ToString(); //將字符數(shù)組組合成字符串這個(gè)函數(shù)在Md5.h中實(shí)現(xiàn)

   } else
   {
    continue;
   }

   while (getline(ifs, strLine))
   {
    if (!strncmp(strLine.c_str(), "length: ", 8)) //一直讀下去直到判斷澹澹(相對(duì)第五行)惺欠袷莑ength: 是則接下下去
    {
     sscanf(strLine.substr(8).c_str(), "%d", &(iDocument.m_nLength)); //將該塊所代表網(wǎng)頁(yè)的實(shí)際網(wǎng)頁(yè)內(nèi)容長(zhǎng)度放入iDocument數(shù)據(jù)結(jié)構(gòu)中
     break;
    }
   }

getline(ifs, strLine); //跳過(guò)相對(duì)第六行故意留的一個(gè)空行

   iDocument.m_nDocId = cnt; //將文檔編號(hào)賦值到iDocument數(shù)據(jù)結(jié)構(gòu)中
   iDocument.m_nPos = nOffset; //文檔結(jié)尾在大文件中的結(jié)束行號(hào)
   char *pContent = new char[iDocument.m_nLength+1]; //新建該文檔長(zhǎng)度的字符串指針

memset(pContent, 0, iDocument.m_nLength+1); //每一位初始化為0
 ifs.read(pContent, iDocument.m_nLength); //根據(jù)獲得的文檔長(zhǎng)度讀取澹(其中包含協(xié)議頭)讀取文檔內(nèi)容
 iMD5.GenerateMD5( (unsigned char*)pContent, iDocument.m_nLength );
 iDocument.m_sChecksum = iMD5.ToString(); //將字符數(shù)組組合成字符串這個(gè)函數(shù)在Md5.h中實(shí)現(xiàn)

 delete[] pContent;

 ofsUrl << iUrl.m_sChecksum ; //將md5hash后的url寫(xiě)入U(xiǎn)rl.idx文件
 ofsUrl << "\t" << iDocument.m_nDocId << endl; //在一行中一個(gè)tab距離分隔，將文件編號(hào)寫(xiě)入U(xiǎn)rl.idx文件

ofsDoc << iDocument.m_nDocId ; //將文件編號(hào)寫(xiě)入Doc.idx文件
 ofsDoc << "\t" << iDocument.m_nPos ; //在一行中一個(gè)tab距離分隔，將該文檔結(jié)束行號(hào)澹(同樣也是下一文檔開(kāi)始行號(hào))寫(xiě)入Doc.idx文件
 //ofsDoc << "\t" << iDocument.m_nLength ;
 ofsDoc << "\t" << iDocument.m_sChecksum << endl; //在一行中一個(gè)tab距離分隔，將md5hash后的url寫(xiě)入Doc.idx文件

ofsDocId2Url << iDocument.m_nDocId ; //將文件編號(hào)寫(xiě)入DocId2Url.idx文件
ofsDocId2Url << "\t" << iUrl.m_sUrl << endl; //將該文檔的完整url寫(xiě)入DocId2Url.idx文件

cnt++; //文檔編號(hào)加一說(shuō)明該以文檔分析完畢，生成下一文檔的編號(hào)
}

nOffset = ifs.tellg();

}

//最后一行只有文檔號(hào)和上一篇文檔結(jié)束號(hào)
ofsDoc << cnt ;
ofsDoc << "\t" << nOffset << endl;

return(0);
}

//DocIndex.cpp end-----------------------------------------------------------author:http://hi.baidu.com/jrckkyy

author:http://blog.csdn.net/jrckkyy

posted @ 2009-12-10 23:00 學(xué)者站在巨人的肩膀上閱讀(1330) | 評(píng)論 (1) | 編輯收藏

自頂向下學(xué)搜索引擎——北大天網(wǎng)搜索引擎TSE分析及完全注釋[5]倒排索引的建立及文件介紹

不好意思讓大家久等了，前一陣一直在忙考試，終于結(jié)束了。呵呵！廢話不多說(shuō)了下面我們開(kāi)始吧！

TSE用的是將抓取回來(lái)的網(wǎng)頁(yè)文檔全部裝入一個(gè)大文檔，讓后對(duì)這一個(gè)大文檔內(nèi)的數(shù)據(jù)整體統(tǒng)一的建索引，其中包含了幾個(gè)步驟。

view plaincopy to clipboardprint?
1. The document index (Doc.idx) keeps information about each document.

It is a fixed width ISAM (Index sequential access mode) index, orderd by docID.

The information stored in each entry includes a pointer into the repository,

a document length, a document checksum.

//Doc.idx 文檔編號(hào) 文檔長(zhǎng)度 checksum hash碼

0 0 bc9ce846d7987c4534f53d423380ba70

1 76760 4f47a3cad91f7d35f4bb6b2a638420e5

2 141624 d019433008538f65329ae8e39b86026c

3 142350 5705b8f58110f9ad61b1321c52605795

//Doc.idx end

The url index (url.idx) is used to convert URLs into docIDs.

//url.idx

5c36868a9c5117eadbda747cbdb0725f 0

3272e136dd90263ee306a835c6c70d77 1

6b8601bb3bb9ab80f868d549b5c5a5f3 2

3f9eba99fa788954b5ff7f35a5db6e1f 3

//url.idx end

It is a list of URL checksums with their corresponding docIDs and is sorted by

checksum. In order to find the docID of a particular URL, the URL's checksum

is computed and a binary search is performed on the checksums file to find its

docID.

 ./DocIndex

 got Doc.idx, Url.idx, DocId2Url.idx //Data文件夾中的Doc.idx DocId2Url.idx和Doc.idx中

//DocId2Url.idx

0 http://*.*.edu.cn/index.aspx

1 http://*.*.edu.cn/showcontent1.jsp?NewsID=118

2 http://*.*.edu.cn/0102.html

3 http://*.*.edu.cn/0103.html

//DocId2Url.idx end

2. sort Url.idx|uniq > Url.idx.sort_uniq //Data文件夾中的Url.idx.sort_uniq

//Url.idx.sort_uniq

//對(duì)hash值進(jìn)行排序

000bfdfd8b2dedd926b58ba00d40986b 1111

000c7e34b653b5135a2361c6818e48dc 1831

0019d12f438eec910a06a606f570fde8 366

0033f7c005ec776f67f496cd8bc4ae0d 2103

3. Segment document to terms, (with finding document according to the url)

 ./DocSegment Tianwang.raw.2559638448 //Tianwang.raw.2559638448為爬回來(lái)的文件，每個(gè)頁(yè)面包含http頭

 got Tianwang.raw.2559638448.seg

//Tianwang.raw.2559638448 爬取的原始網(wǎng)頁(yè)文件在文檔內(nèi)部每一個(gè)文檔之間應(yīng)該是通過(guò)version，</html>和回車(chē)做標(biāo)志位分割的

version: 1.0

url: http://***.105.138.175/Default2.asp?lang=gb

origin: http://***.105.138.175/

date: Fri, 23 May 2008 20:01:36 GMT

ip: 162.105.138.175

length: 38413

HTTP/1.1 200 OK

Server: Microsoft-IIS/5.0

Date: Fri, 23 May 2008 11:17:49 GMT

Connection: keep-alive

Connection: Keep-Alive

Content-Length: 38088

Content-Type: text/html; Charset=gb2312

Expires: Fri, 23 May 2008 11:17:49 GMT

Set-Cookie: ASPSESSIONIDSSTRDCAB=IMEOMBIAIPDFCKPAEDJFHOIH; path=/

Cache-control: private

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"

"
<html>

<head>

<title>Apabi數(shù)字資源平臺(tái)</title>

<meta http-equiv="Content-Type" content="text/html; charset=gb2312">

<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">

<META NAME="DESCRIPTION" CONTENT="數(shù)字圖書(shū)館方正數(shù)字圖書(shū)館電子圖書(shū) 電子書(shū) ebook e書(shū) Apabi 數(shù)字資源平臺(tái)">

<link rel="stylesheet" type="text/css" href="css\common.css">

<style type="text/css">



</style>

<script LANGUAGE="vbscript">

...

</script>

<Script Language="javascript">

...

</Script>

</head>

<body leftmargin="0" topmargin="0">

</body>

</html>

//Tianwang.raw.2559638448 end

//Tianwang.raw.2559638448.seg 將每個(gè)頁(yè)面分成一行如下(注意中間沒(méi)有回車(chē)作為分隔)

1

...

...

...

2

...

...

...

//Tianwang.raw.2559638448.seg end

//下是 Tiny search 非必須因素

4. Create forward index (docic-->termid) //建立正向索引

 ./CrtForwardIdx Tianwang.raw.2559638448.seg > moon.fidx

//Tianwang.raw.2559638448.seg 將每個(gè)頁(yè)面分成一行如下 //分詞 DocID 1 三星/ s/ 手機(jī)/ 論壇/ ,/ 手機(jī)/ 鈴聲/ 下載/ ,/ 手機(jī)/ 圖片/ 下載/ ,/ 手機(jī)/ 2 ... ... ...

1. The document index (Doc.idx) keeps information about each document.

It is a fixed width ISAM (Index sequential access mode) index, orderd by docID.

The information stored in each entry includes a pointer into the repository,

a document length, a document checksum.

//Doc.idx 文檔編號(hào) 文檔長(zhǎng)度 checksum hash碼

0 0 bc9ce846d7987c4534f53d423380ba70

1 76760 4f47a3cad91f7d35f4bb6b2a638420e5

2 141624 d019433008538f65329ae8e39b86026c

3 142350 5705b8f58110f9ad61b1321c52605795

//Doc.idx end

The url index (url.idx) is used to convert URLs into docIDs.

//url.idx

5c36868a9c5117eadbda747cbdb0725f 0

3272e136dd90263ee306a835c6c70d77 1

6b8601bb3bb9ab80f868d549b5c5a5f3 2

3f9eba99fa788954b5ff7f35a5db6e1f 3

//url.idx end

It is a list of URL checksums with their corresponding docIDs and is sorted by

checksum. In order to find the docID of a particular URL, the URL's checksum

is computed and a binary search is performed on the checksums file to find its

docID.

./DocIndex

got Doc.idx, Url.idx, DocId2Url.idx //Data文件夾中的Doc.idx DocId2Url.idx和Doc.idx中

//DocId2Url.idx

0 http://*.*.edu.cn/index.aspx

1 http://*.*.edu.cn/showcontent1.jsp?NewsID=118

2 http://*.*.edu.cn/0102.html

3 http://*.*.edu.cn/0103.html

//DocId2Url.idx end

2. sort Url.idx|uniq > Url.idx.sort_uniq //Data文件夾中的Url.idx.sort_uniq

//Url.idx.sort_uniq

//對(duì)hash值進(jìn)行排序

000bfdfd8b2dedd926b58ba00d40986b 1111

000c7e34b653b5135a2361c6818e48dc 1831

0019d12f438eec910a06a606f570fde8 366

0033f7c005ec776f67f496cd8bc4ae0d 2103

3. Segment document to terms, (with finding document according to the url)

./DocSegment Tianwang.raw.2559638448 //Tianwang.raw.2559638448為爬回來(lái)的文件，每個(gè)頁(yè)面包含http頭

got Tianwang.raw.2559638448.seg

//Tianwang.raw.2559638448 爬取的原始網(wǎng)頁(yè)文件在文檔內(nèi)部每一個(gè)文檔之間應(yīng)該是通過(guò)version，</html>和回車(chē)做標(biāo)志位分割的

version: 1.0

url: http://***.105.138.175/Default2.asp?lang=gb

origin: http://***.105.138.175/

date: Fri, 23 May 2008 20:01:36 GMT

ip: 162.105.138.175

length: 38413

HTTP/1.1 200 OK

Server: Microsoft-IIS/5.0

Date: Fri, 23 May 2008 11:17:49 GMT

Connection: keep-alive

Connection: Keep-Alive

Content-Length: 38088

Content-Type: text/html; Charset=gb2312

Expires: Fri, 23 May 2008 11:17:49 GMT

Set-Cookie: ASPSESSIONIDSSTRDCAB=IMEOMBIAIPDFCKPAEDJFHOIH; path=/

Cache-control: private

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"

<html>

<head>

<title>Apabi數(shù)字資源平臺(tái)</title>

<!--

.style4 {color: #666666}

-->

</style>

...

</script>

...

</Script>

</head>

</body>

</html>

//Tianwang.raw.2559638448 end

//Tianwang.raw.2559638448.seg 將每個(gè)頁(yè)面分成一行如下(注意中間沒(méi)有回車(chē)作為分隔)

...

//Tianwang.raw.2559638448.seg end

//下是 Tiny search 非必須因素

4. Create forward index (docic-->termid) //建立正向索引

./CrtForwardIdx Tianwang.raw.2559638448.seg > moon.fidx

//Tianwang.raw.2559638448.seg 將每個(gè)頁(yè)面分成一行如下//分詞 DocID1三星/ s/ 手機(jī)/ 論壇/ ,/ 手機(jī)/ 鈴聲/ 下載/ ,/ 手機(jī)/ 圖片/ 下載/ ,/ 手機(jī)/2.........view plaincopy to clipboardprint?
//Tianwang.raw.2559638448.seg end

//moon.fidx

//每篇文檔號(hào)對(duì)應(yīng)文檔內(nèi)分出來(lái)的 分詞 DocID

都會(huì) 2391

使 2391

那些 2391

擁有 2391

它 2391

的 2391

人 2391

的 2391

視野 2391

變 2391

窄 2391

在 2180

研究生部 2180

主頁(yè) 2180

培養(yǎng) 2180

管理 2180

欄目 2180

下載 2180

） 2180

、 2180

關(guān)于 2180

做好 2180

年 2180

國(guó)家 2180

公派 2180

研究生 2180

項(xiàng)目 2180

//moon.fidx end

5.# set | grep "LANG"

LANG=en; export LANG;

sort moon.fidx > moon.fidx.sort

6. Create inverted index (termid-->docid) //建立倒排索引

 ./CrtInvertedIdx moon.fidx.sort > sun.iidx

//sun.iidx //文件規(guī)模大概減少1/2

花工 236

花海 2103

花卉 1018 1061 1061 1061 1730 1730 1730 1730 1730 1852 949 949

花蕾 447 447

花木 1061

花呢 1430

花期 447 447 447 447 447 525

花錢(qián) 174 236

花色 1730 1730

花色品種 1660

花生 450 526

花式 1428 1430 1430 1430

花紋 1430 1430

花序 447 447 447 447 447 450

花絮 136 137

花芽 450 450

//sun.iidx end

TSESearch CGI program for query

Snapshot CGI program for page snapshot


author:http://hi.baidu.com/jrckkyy

author:http://blog.csdn.net/jrckkyy

posted @ 2009-12-10 22:55 學(xué)者站在巨人的肩膀上閱讀(1305) | 評(píng)論 (1) | 編輯收藏

自頂向下學(xué)搜索引擎——北大天網(wǎng)搜索引擎TSE分析及完全注釋[4]小結(jié)

通過(guò)前面的三篇文章相信你已經(jīng)對(duì)神秘的搜索引擎有了一個(gè)感性的認(rèn)識(shí)，和普通的php類(lèi)似的腳本語(yǔ)言服務(wù)器類(lèi)似，通過(guò)獲取前臺(tái)關(guān)鍵字，通過(guò)字典分詞，和事先建立建立好的倒排索引進(jìn)行相關(guān)性分析，得出查詢結(jié)構(gòu)格式化輸出結(jié)果。而這里的技術(shù)難點(diǎn)在于

1、字典的選取（事實(shí)上根據(jù)不同時(shí)代不同地方人們的語(yǔ)言習(xí)慣是不一樣的所以說(shuō)字典的最小元的取值是不同的）

2、倒排索引的建立（這里就要涉及到爬蟲(chóng)的抓取和索引的建立后面將重點(diǎn)介紹這2點(diǎn)，搜索引擎的效率和服務(wù)質(zhì)量實(shí)效性瓶頸在這里）

3、相關(guān)性分析（對(duì)抓回來(lái)的文檔分詞建索引和用戶關(guān)鍵字分詞算法上要對(duì)應(yīng)）

后面文章會(huì)重點(diǎn)介紹爬蟲(chóng)的抓取和索引的建立。

posted @ 2009-12-10 22:54 學(xué)者站在巨人的肩膀上閱讀(996) | 評(píng)論 (0) | 編輯收藏

自頂向下學(xué)搜索引擎——北大天網(wǎng)搜索引擎TSE分析及完全注釋[3]來(lái)到關(guān)鍵字分詞及相關(guān)性分析程序

有前面注釋我們可以知道查詢關(guān)鍵字和字典文件準(zhǔn)備好好后，將進(jìn)入用戶關(guān)鍵字分詞階段

//TSESearch.cpp中：

view plaincopy to clipboardprint?
CHzSeg iHzSeg; //include ChSeg/HzSeg.h

//
iQuery.m_sSegQuery = iHzSeg.SegmentSentenceMM(iDict, iQuery.m_sQuery); //將get到的查詢變量分詞分成 "我/ 愛(ài)/ 你們/ 的/ 格式"

vector<STRING></STRING> vecTerm;
iQuery.ParseQuery(vecTerm); //將以"/"劃分開(kāi)的關(guān)鍵字一一順序放入一個(gè)向量容器中

set<STRING></STRING> setRelevantRst;
iQuery.GetRelevantRst(vecTerm, mapBuckets, setRelevantRst);

gettimeofday(&end_tv,&tz);
// search end
//搜索完畢

CHzSeg iHzSeg; //include ChSeg/HzSeg.h

//
iQuery.m_sSegQuery = iHzSeg.SegmentSentenceMM(iDict, iQuery.m_sQuery); //將get到的查詢變量分詞分成 "我/ 愛(ài)/ 你們/ 的/ 格式"

vector vecTerm;
iQuery.ParseQuery(vecTerm); //將以"/"劃分開(kāi)的關(guān)鍵字一一順序放入一個(gè)向量容器中

set setRelevantRst;
iQuery.GetRelevantRst(vecTerm, mapBuckets, setRelevantRst);

gettimeofday(&end_tv,&tz);
// search end
//搜索完畢view plaincopy to clipboardprint?
看CHzSeg 中的這個(gè)方法

看CHzSeg 中的這個(gè)方法view plaincopy to clipboardprint?
//ChSeg/HzSeg.h

//ChSeg/HzSeg.hview plaincopy to clipboardprint?
/**
* 程序翻譯說(shuō)明
* 進(jìn)一步凈化數(shù)據(jù)，轉(zhuǎn)換漢字
* @access public
* @param CDict, string 參數(shù)的漢字說(shuō)明:字典，查詢字符串
* @return string 0
*/
// process a sentence before segmentation
//在分詞前處理句子
string CHzSeg::SegmentSentenceMM (CDict &dict, string s1) const
{
 string s2="";
 unsigned int i,len;

 while (!s1.empty())
 {
 unsigned char ch=(unsigned char) s1[0];
 if(ch<128)
 { // deal with ASCII
 i=1;
 len = s1.size();
 while (i<LEN len="s1.length();" i="0;" 中文標(biāo)點(diǎn)等非漢字字符="" if="" else="" yhf="" s1="s1.substr(i);" by="" added="" ch="=13)" s2="" cr=""></LEN>=161)
 && (!((unsigned char)s1[i]==161 && ((unsigned char)s1[i+1]>=162 && (unsigned char)s1[i+1]<=168)))
 && (!((unsigned char)s1[i]==161 && ((unsigned char)s1[i+1]>=171 && (unsigned char)s1[i+1]<=191)))
 && (!((unsigned char)s1[i]==163 && ((unsigned char)s1[i+1]==172 || (unsigned char)s1[i+1]==161)
 || (unsigned char)s1[i+1]==168 || (unsigned char)s1[i+1]==169 || (unsigned char)s1[i+1]==186
 || (unsigned char)s1[i+1]==187 || (unsigned char)s1[i+1]==191)))
 {
 ii=i+2; // 假定沒(méi)有半個(gè)漢字
 }

 if (i==0) ii=i+2;

 // 不處理中文空格
 if (!(ch==161 && (unsigned char)s1[1]==161))
 {
 if (i <= s1.size()) // yhf
 // 其他的非漢字雙字節(jié)字符可能連續(xù)輸出
 s2 += s1.substr(0, i) + SEPARATOR;
 else break; // yhf
 }

 if (i <= s1.size()) // yhf
 s1s1=s1.substr(i);
 else break; //yhf

 continue;
 }
 }


 // 以下處理漢字串

 i = 2;
 len = s1.length();

 while(i<LEN></LEN>=176)
// while(i<LEN></LEN>=128 && (unsigned char)s1[i]!=161)
 i+=2;

 s2+=SegmentHzStrMM(dict, s1.substr(0,i));

 if (i <= len) // yhf
 s1s1=s1.substr(i);
 else break; // yhf
 }

 return s2;
}

/**
* 程序翻譯說(shuō)明
* 進(jìn)一步凈化數(shù)據(jù)，轉(zhuǎn)換漢字
* @access public
* @param CDict, string 參數(shù)的漢字說(shuō)明:字典，查詢字符串
* @return string 0
*/
// process a sentence before segmentation
//在分詞前處理句子
string CHzSeg::SegmentSentenceMM (CDict &dict, string s1) const
{
string s2="";
unsigned int i,len;

while (!s1.empty())
{
 unsigned char ch=(unsigned char) s1[0];
 if(ch<128)
 { // deal with ASCII
 i=1;
 len = s1.size();
 while (i=161)
 && (!((unsigned char)s1[i]==161 && ((unsigned char)s1[i+1]>=162 && (unsigned char)s1[i+1]<=168)))
 && (!((unsigned char)s1[i]==161 && ((unsigned char)s1[i+1]>=171 && (unsigned char)s1[i+1]<=191)))
 && (!((unsigned char)s1[i]==163 && ((unsigned char)s1[i+1]==172 || (unsigned char)s1[i+1]==161)
 || (unsigned char)s1[i+1]==168 || (unsigned char)s1[i+1]==169 || (unsigned char)s1[i+1]==186
 || (unsigned char)s1[i+1]==187 || (unsigned char)s1[i+1]==191)))
 {
 i=i+2; // 假定沒(méi)有半個(gè)漢字
 }

if (i==0) i=i+2;

// 不處理中文空格
 if (!(ch==161 && (unsigned char)s1[1]==161))
 {
 if (i <= s1.size()) // yhf
 // 其他的非漢字雙字節(jié)字符可能連續(xù)輸出
 s2 += s1.substr(0, i) + SEPARATOR;
 else break; // yhf
 }

if (i <= s1.size()) // yhf
 s1=s1.substr(i);
 else break; //yhf

    continue;
   }
  }

// 以下處理漢字串

i = 2;
len = s1.length();

  while(i=176)
//    while(i=128 && (unsigned char)s1[i]!=161)
   i+=2;

s2+=SegmentHzStrMM(dict, s1.substr(0,i));

if (i <= len) // yhf
 s1=s1.substr(i);
 else break; // yhf
}

return s2;
}view plaincopy to clipboardprint?

view plaincopy to clipboardprint?
//Query.cpp

//Query.cppview plaincopy to clipboardprint?
<PRE class=csharp name="code">/**
* 程序翻譯說(shuō)明
* 將以"/"劃分開(kāi)的關(guān)鍵字一一順序放入一個(gè)向量容器中
*
* @access public
* @param vector<STRING></STRING> 參數(shù)的漢字說(shuō)明：向量容器
* @return void
*/
void CQuery::ParseQuery(vector<STRING></STRING> &vecTerm)
{
 string::size_type idx;
 while ( (idx = m_sSegQuery.find("/ ")) != string::npos ) {
 vecTerm.push_back(m_sSegQuery.substr(0,idx));
 m_sSegQuerym_sSegQuery = m_sSegQuery.substr(idx+3);
 }
}
</PRE>
<PRE class=csharp name="code"> </PRE>
<PRE class=csharp name="code"><PRE class=csharp name="code">/**
* 程序翻譯說(shuō)明
* 相關(guān)性分析查詢，構(gòu)造結(jié)果集合setRelevantRst //瓶頸所在
*
* @access public
* @param vector<STRING></STRING> map set<STRING></STRING> 參數(shù)的漢字說(shuō)明：用戶提交關(guān)鍵字的分詞組，倒排索引映射，相關(guān)性結(jié)果集合
* @return string 0
*/
bool CQuery::GetRelevantRst
(
 vector<STRING></STRING> &vecTerm,
 map &mapBuckets,
 set<STRING></STRING> &setRelevantRst
) const
{
 set<STRING></STRING> setSRst;

 bool bFirst=true;
 vector<STRING></STRING>::iterator itTerm = vecTerm.begin();

 for ( ; itTerm != vecTerm.end(); ++itTerm )
 {

 setSRst.clear();
 copy(setRelevantRst.begin(), setRelevantRst.end(), inserter(setSRst,setSRst.begin()));

 map mapRstDoc;
 string docid;
 int doccnt;

 map::iterator itBuckets = mapBuckets.find(*itTerm);
 if (itBuckets != mapBuckets.end())
 {
 string strBucket = (*itBuckets).second;
 string::size_type idx;
 idx = strBucket.find_first_not_of(" ");
 strBucketstrBucket = strBucket.substr(idx);

 while ( (idx = strBucket.find(" ")) != string::npos )
 {
 docid = strBucket.substr(0,idx);
 doccnt = 0;

 if (docid.empty()) continue;

 map::iterator it = mapRstDoc.find(docid);
 if ( it != mapRstDoc.end() )
 {
 doccnt = (*it).second + 1;
 mapRstDoc.erase(it);
 }
 mapRstDoc.insert( pair(docid,doccnt) );

 strBucketstrBucket = strBucket.substr(idx+1);
 }

 // remember the last one
 docid = strBucket;
 doccnt = 0;
 map::iterator it = mapRstDoc.find(docid);
 if ( it != mapRstDoc.end() )
 {
 doccnt = (*it).second + 1;
 mapRstDoc.erase(it);
 }
 mapRstDoc.insert( pair(docid,doccnt) );
 }

 // sort by term frequencty
 multimap > newRstDoc;
 map::iterator it0 = mapRstDoc.begin();
 for ( ; it0 != mapRstDoc.end(); ++it0 ){
 newRstDoc.insert( pair((*it0).second,(*it0).first) );
 }

 multimap::iterator itNewRstDoc = newRstDoc.begin();
 setRelevantRst.clear();
 for ( ; itNewRstDoc != newRstDoc.end(); ++itNewRstDoc ){
 string docid = (*itNewRstDoc).second;

 if (bFirst==true) {
 setRelevantRst.insert(docid);
 continue;
 }

 if ( setSRst.find(docid) != setSRst.end() ){
 setRelevantRst.insert(docid);
 }
 }

 //cout << "setRelevantRst.size(): " << setRelevantRst.size() << " ";
 bFirst = false;
 }
 return true;
}</PRE>
</PRE>
接下來(lái)的就是現(xiàn)實(shí)了，前面都只是處理數(shù)據(jù)得到 setRelevantRst 這個(gè)查詢結(jié)構(gòu)集合,這里就不多說(shuō)了下面就和php之類(lèi)的腳本語(yǔ)言差不多，格式化結(jié)果集合并顯示出來(lái)。

view plaincopy to clipboardprint?/** * 程序翻譯說(shuō)明 * 將以"/"劃分開(kāi)的關(guān)鍵字一一順序放入一個(gè)向量容器中 * * @access public * @param vector<STRING></STRING> 參數(shù)的漢字說(shuō)明：向量容器 * @return void */ void CQuery::ParseQuery(vector<STRING></STRING> &vecTerm) { string::size_type idx; while ( (idx = m_sSegQuery.find("/ ")) != string::npos ) { vecTerm.push_back(m_sSegQuery.substr(0,idx)); m_sSegQuery = m_sSegQuery.substr(idx+3); } } /**
* 程序翻譯說(shuō)明
* 將以"/"劃分開(kāi)的關(guān)鍵字一一順序放入一個(gè)向量容器中
*
* @access public
* @param vector 參數(shù)的漢字說(shuō)明：向量容器
* @return void
*/
void CQuery::ParseQuery(vector &vecTerm)
{
string::size_type idx;
while ( (idx = m_sSegQuery.find("/ ")) != string::npos ) {
 vecTerm.push_back(m_sSegQuery.substr(0,idx));
 m_sSegQuery = m_sSegQuery.substr(idx+3);
}
}

view plaincopy to clipboardprint?
view plaincopy to clipboardprint?<PRE class=csharp name="code">/** * 程序翻譯說(shuō)明 * 相關(guān)性分析查詢，構(gòu)造結(jié)果集合setRelevantRst //瓶頸所在 * * @access public * @param vector<STRING></STRING> map set<STRING></STRING> 參數(shù)的漢字說(shuō)明：用戶提交關(guān)鍵字的分詞組，倒排索引映射，相關(guān)性結(jié)果集合 * @return string 0 */ bool CQuery::GetRelevantRst ( vector<STRING></STRING> &vecTerm, map &mapBuckets, set<STRING></STRING> &setRelevantRst ) const { set<STRING></STRING> setSRst; bool bFirst=true; vector<STRING></STRING>::iterator itTerm = vecTerm.begin(); for ( ; itTerm != vecTerm.end(); ++itTerm ) { setSRst.clear(); copy(setRelevantRst.begin(), setRelevantRst.end(), inserter(setSRst,setSRst.begin())); map mapRstDoc; string docid; int doccnt; map::iterator itBuckets = mapBuckets.find(*itTerm); if (itBuckets != mapBuckets.end()) { string strBucket = (*itBuckets).second; string::size_type idx; idx = strBucket.find_first_not_of(" "); strBucket = strBucket.substr(idx); while ( (idx = strBucket.find(" ")) != string::npos ) { docid = strBucket.substr(0,idx); doccnt = 0; if (docid.empty()) continue; map::iterator it = mapRstDoc.find(docid); if ( it != mapRstDoc.end() ) { doccnt = (*it).second + 1; mapRstDoc.erase(it); } mapRstDoc.insert( pair(docid,doccnt) ); strBucket = strBucket.substr(idx+1); } // remember the last one docid = strBucket; doccnt = 0; map::iterator it = mapRstDoc.find(docid); if ( it != mapRstDoc.end() ) { doccnt = (*it).second + 1; mapRstDoc.erase(it); } mapRstDoc.insert( pair(docid,doccnt) ); } // sort by term frequencty multimap > newRstDoc; map::iterator it0 = mapRstDoc.begin(); for ( ; it0 != mapRstDoc.end(); ++it0 ){ newRstDoc.insert( pair((*it0).second,(*it0).first) ); } multimap::iterator itNewRstDoc = newRstDoc.begin(); setRelevantRst.clear(); for ( ; itNewRstDoc != newRstDoc.end(); ++itNewRstDoc ){ string docid = (*itNewRstDoc).second; if (bFirst==true) { setRelevantRst.insert(docid); continue; } if ( setSRst.find(docid) != setSRst.end() ){ setRelevantRst.insert(docid); } } //cout << "setRelevantRst.size(): " << setRelevantRst.size() << " "; bFirst = false; } return true; }</PRE> view plaincopy to clipboardprint?/** * 程序翻譯說(shuō)明 * 相關(guān)性分析查詢，構(gòu)造結(jié)果集合setRelevantRst //瓶頸所在 * * @access public * @param vector<STRING></STRING> map set<STRING></STRING> 參數(shù)的漢字說(shuō)明：用戶提交關(guān)鍵字的分詞組，倒排索引映射，相關(guān)性結(jié)果集合 * @return string 0 */ bool CQuery::GetRelevantRst ( vector<STRING></STRING> &vecTerm, map &mapBuckets, set<STRING></STRING> &setRelevantRst ) const { set<STRING></STRING> setSRst; bool bFirst=true; vector<STRING></STRING>::iterator itTerm = vecTerm.begin(); for ( ; itTerm != vecTerm.end(); ++itTerm ) { setSRst.clear(); copy(setRelevantRst.begin(), setRelevantRst.end(), inserter(setSRst,setSRst.begin())); map mapRstDoc; string docid; int doccnt; map::iterator itBuckets = mapBuckets.find(*itTerm); if (itBuckets != mapBuckets.end()) { string strBucket = (*itBuckets).second; string::size_type idx; idx = strBucket.find_first_not_of(" "); strBucket = strBucket.substr(idx); while ( (idx = strBucket.find(" ")) != string::npos ) { docid = strBucket.substr(0,idx); doccnt = 0; if (docid.empty()) continue; map::iterator it = mapRstDoc.find(docid); if ( it != mapRstDoc.end() ) { doccnt = (*it).second + 1; mapRstDoc.erase(it); } mapRstDoc.insert( pair(docid,doccnt) ); strBucket = strBucket.substr(idx+1); } // remember the last one docid = strBucket; doccnt = 0; map::iterator it = mapRstDoc.find(docid); if ( it != mapRstDoc.end() ) { doccnt = (*it).second + 1; mapRstDoc.erase(it); } mapRstDoc.insert( pair(docid,doccnt) ); } // sort by term frequencty multimap > newRstDoc; map::iterator it0 = mapRstDoc.begin(); for ( ; it0 != mapRstDoc.end(); ++it0 ){ newRstDoc.insert( pair((*it0).second,(*it0).first) ); } multimap::iterator itNewRstDoc = newRstDoc.begin(); setRelevantRst.clear(); for ( ; itNewRstDoc != newRstDoc.end(); ++itNewRstDoc ){ string docid = (*itNewRstDoc).second; if (bFirst==true) { setRelevantRst.insert(docid); continue; } if ( setSRst.find(docid) != setSRst.end() ){ setRelevantRst.insert(docid); } } //cout << "setRelevantRst.size(): " << setRelevantRst.size() << " "; bFirst = false; } return true; } /**
* 程序翻譯說(shuō)明
* 相關(guān)性分析查詢，構(gòu)造結(jié)果集合setRelevantRst //瓶頸所在
*
* @access public
* @param vector map set 參數(shù)的漢字說(shuō)明：用戶提交關(guān)鍵字的分詞組，倒排索引映射，相關(guān)性結(jié)果集合
* @return string 0
*/
bool CQuery::GetRelevantRst
(
vector &vecTerm,
map &mapBuckets,
set &setRelevantRst
) const
{
set setSRst;

bool bFirst=true;
vector::iterator itTerm = vecTerm.begin();

for ( ; itTerm != vecTerm.end(); ++itTerm )
{

setSRst.clear();
copy(setRelevantRst.begin(), setRelevantRst.end(), inserter(setSRst,setSRst.begin()));

  map mapRstDoc;
  string docid;
  int doccnt;

  map::iterator itBuckets = mapBuckets.find(*itTerm);
  if (itBuckets != mapBuckets.end())
  {
   string strBucket = (*itBuckets).second;
   string::size_type idx;
   idx = strBucket.find_first_not_of(" ");
   strBucket = strBucket.substr(idx);

   while ( (idx = strBucket.find(" ")) != string::npos )
   {
    docid = strBucket.substr(0,idx);
    doccnt = 0;

if (docid.empty()) continue;

    map::iterator it = mapRstDoc.find(docid);
    if ( it != mapRstDoc.end() )
    {
     doccnt = (*it).second + 1;
     mapRstDoc.erase(it);
    }
    mapRstDoc.insert( pair(docid,doccnt) );

strBucket = strBucket.substr(idx+1);
}

   // remember the last one
   docid = strBucket;
   doccnt = 0;
   map::iterator it = mapRstDoc.find(docid);
   if ( it != mapRstDoc.end() )
   {
    doccnt = (*it).second + 1;
    mapRstDoc.erase(it);
   }
   mapRstDoc.insert( pair(docid,doccnt) );
  }

  // sort by term frequencty
  multimap > newRstDoc;
  map::iterator it0 = mapRstDoc.begin();
  for ( ; it0 != mapRstDoc.end(); ++it0 ){
   newRstDoc.insert( pair((*it0).second,(*it0).first) );
  }

  multimap::iterator itNewRstDoc = newRstDoc.begin();
  setRelevantRst.clear();
  for ( ; itNewRstDoc != newRstDoc.end(); ++itNewRstDoc ){
   string docid = (*itNewRstDoc).second;

   if (bFirst==true) {
    setRelevantRst.insert(docid);
    continue;
   }

   if ( setSRst.find(docid) != setSRst.end() ){
    setRelevantRst.insert(docid);
   }
  }

//cout << "setRelevantRst.size(): " << setRelevantRst.size() << "";
bFirst = false;
}
return true;
}

接下來(lái)的就是現(xiàn)實(shí)了，前面都只是處理數(shù)據(jù)得到 setRelevantRst 這個(gè)查詢結(jié)構(gòu)集合,這里就不多說(shuō)了下面就和php之類(lèi)的腳本語(yǔ)言差不多，格式化結(jié)果集合并顯示出來(lái)。
//TSESearch.cpp

view plaincopy to clipboardprint?
//下面開(kāi)始顯示
    CDisplayRst iDisplayRst;
    iDisplayRst.ShowTop();

    float used_msec = (end_tv.tv_sec-begin_tv.tv_sec)*1000
        +((float)(end_tv.tv_usec-begin_tv.tv_usec))/(float)1000;

    iDisplayRst.ShowMiddle(iQuery.m_sQuery,used_msec,
            setRelevantRst.size(), iQuery.m_iStart);

    iDisplayRst.ShowBelow(vecTerm,setRelevantRst,vecDocIdx,iQuery.m_iStart);

posted @ 2009-12-10 22:53 學(xué)者站在巨人的肩膀上閱讀(996) | 評(píng)論 (0) | 編輯收藏

僅列出標(biāo)題下一頁(yè)

學(xué)著站在巨人的肩膀上

公告

常用鏈接

留言簿(1)

隨筆分類(lèi)

隨筆檔案

搜索

最新評(píng)論

閱讀排行榜

評(píng)論排行榜