??xml version="1.0" encoding="utf-8" standalone="yes"?>香蕉免费一区二区三区在线观看,国产亚洲精品美女,久久精品久久综合http://www.shnenglu.com/jrckkyy/archive/2010/03/15/109755.html学者站在巨人的肩膀?/dc:creator>学者站在巨人的肩膀?/author>Mon, 15 Mar 2010 11:24:00 GMThttp://www.shnenglu.com/jrckkyy/archive/2010/03/15/109755.htmlhttp://www.shnenglu.com/jrckkyy/comments/109755.htmlhttp://www.shnenglu.com/jrckkyy/archive/2010/03/15/109755.html#Feedback0http://www.shnenglu.com/jrckkyy/comments/commentRss/109755.htmlhttp://www.shnenglu.com/jrckkyy/services/trackbacks/109755.html前一阵花了点旉学习pythonQ近D|间完成了一个监控服务器基本信息的项目,都是Z满大家监控的欲望,Ҏ日志q报警的分布式系l,单台服务器采集粒度ؓ1?1分钟Q一天大U?440条,目前监控?0多台服务器,一天大U?1680条日志,现在单点监控中心服务器在性能上还lԒ有余Q有更多的服务器来测试就好了Q估计可以支持到100C上服务器监控的别?/p>

现在遇到一个需求是发现报警时实时发送消息给相关人员Q由于公司短信网兛_C上v电信用户没有上v电信的号码,汗一个,只好通过发邮件来实施?/p>

支持发送GB18030~码的文本内容,L~码附gQ可以做出适当修改支持发?/p>

 

·········10········20········30········40········50········60········70········80········90········100·······110·······120·······130·······140·······150
  1. #coding=utf-8   
  2. #!/usr/lib/python2.5/bin/python   
  3. import os   
  4. import sys   
  5. from smtplib import SMTP   
  6. from email.MIMEMultipart import MIMEMultipart   
  7. from email.mime.application import MIMEApplication   
  8. from email.MIMEText import MIMEText   
  9. from email.MIMEBase import MIMEBase   
  10. from email import Utils,Encoders   
  11. import mimetypes   
  12. import time   
  13.   
  14. STMP_SERVER = "mail.×××.com"  
  15. STMP_PORT = "25"  
  16. USERNAME = "×××@×××.com"  
  17. USERPASSWORD = "×××"  
  18. FROM = "MonitorCenterWarning@×××.com"  
  19. TO = "×××@gmail.com"  
  20.   
  21. def sendFildByMail(config):   
  22.     print 'Preparing...'  
  23.     message = MIMEMultipart( )   
  24.     message['from'] = config['from']   
  25.     message['to'] = config['to']   
  26.     message['Reply-To'] = config['from']   
  27.     message['Subject'] = config['subject']   
  28.     message['Date'] = time.ctime(time.time())   
  29.     message['X-Priority'] =  '3'  
  30.     message['X-MSMail-Priority'] =  'Normal'  
  31.     message['X-Mailer'] =  'Microsoft Outlook Express 6.00.2900.2180'  
  32.     message['X-MimeOLE'] =  'Produced By Microsoft MimeOLE V6.00.2900.2180'  
  33.        
  34.     if 'file' in config:   
  35.         #d附g   
  36.         f=open(config['file'], 'rb')   
  37.         file = MIMEApplication(f.read())   
  38.         f.close()   
  39.         file.add_header('Content-Disposition''attachment', filename= os.path.basename(config['file']))   
  40.         message.attach(file)   
  41.        
  42.     if 'content' in config:   
  43.         #d文本内容   
  44.         f=open(config['content'], 'rb')   
  45.         f.seek(0)   
  46.         content = f.read()   
  47.         body = MIMEText(content, 'base64''gb2312')   
  48.         message.attach(body)   
  49.   
  50.     print 'OKay'  
  51.     print 'Logging...'  
  52.     smtp = SMTP(config['server'], config['port'])   
  53.     #如果SMTP服务器发邮g时不需要验证登录则对下面这行加上注?  
  54.     smtp.login(config['username'], config['password'])   
  55.     print 'OK'  
  56.        
  57.     print 'Sending...',   
  58.     smtp.sendmail (config['from'], [config['from'], config['to']], message.as_string())   
  59.     print 'OK'  
  60.     smtp.close()   
  61.     time.sleep(1)   
  62.   
  63. if __name__ == "__main__":   
  64.     if len(sys.argv) < 2:   
  65.         print 'Usage: python %s contentfilename' % os.path.basename(sys.argv[0])   
  66.         print 'OR Usage: python %s contentfilename attachfilename' % os.path.basename(sys.argv[0])   
  67.         wait=raw_input("quit.")   
  68.         sys.exit(-1)   
  69.     elif len(sys.argv) == 2:   
  70.         sendFildByMail({   
  71.             'from': FROM,   
  72.             'to': TO,   
  73.             'subject''[MonitorCenter]Send Msg %s' % sys.argv[1],   
  74.             'content': sys.argv[1],   
  75.             'server': STMP_SERVER,   
  76.             'port': STMP_PORT,   
  77.             'username': USERNAME,   
  78.             'password': USERPASSWORD})   
  79.     elif len(sys.argv) == 3:   
  80.         sendFildByMail({   
  81.             'from': FROM,   
  82.             'to': TO,   
  83.             'subject''[MonitorCenter]Send Msg and File %s %s' % (sys.argv[1], sys.argv[2]),   
  84.             'content': sys.argv[1],   
  85.             'file': sys.argv[2],   
  86.             'server': STMP_SERVER,   
  87.             'port': STMP_PORT,   
  88.             'username': USERNAME,   
  89.             'password': USERPASSWORD})   
  90.     wait=raw_input("end.")  

 

windows xp下:

例子

 linux ubuntuQsuse下:

1

收到的结果:

2



]]>
[分布式跨q_监控pȝ]linux下监控网l流量和|?python脚本应用http://www.shnenglu.com/jrckkyy/archive/2010/03/15/109754.html学者站在巨人的肩膀?/dc:creator>学者站在巨人的肩膀?/author>Mon, 15 Mar 2010 11:22:00 GMThttp://www.shnenglu.com/jrckkyy/archive/2010/03/15/109754.htmlhttp://www.shnenglu.com/jrckkyy/comments/109754.htmlhttp://www.shnenglu.com/jrckkyy/archive/2010/03/15/109754.html#Feedback0http://www.shnenglu.com/jrckkyy/comments/commentRss/109754.htmlhttp://www.shnenglu.com/jrckkyy/services/trackbacks/109754.html׃上证所Q深交所level1Qlevel2金融数据服务器在上午9Q?0开始到11Q?0和下?3Q?0开始到15Q?0一共大U?个小时的旉内流量比较大所以被监控服务器的|络速算是一个被监控的重要指标。可以通过累加一D|间内各个|卡的上行,下行量除以q个旉间隔计算D|间内的^均网速,我现在的采集频率?分钟采集一ơ,在实际开盘期间运行过E中得到的网速监控信息用q是比较准确的,都保持在5M/S左右的速度Q有时候在qx非服务期看见某台服务器的内网|卡|速达?M/S Q果然就是有人在大手W传输?/p>

独立的监控脚本是q回一个列表嵌套元l的数据l构Q最后再汇L一个完整的XML数据岛,Z调试方便脚本的每一个中间结果都导出C个时文本中?/p>

q行以下脚本要确定你的linux装了ethtool工具Q在ubuntu2.6.27-7-serverQubuntu22.6.27.19-5-defaultQsuse 2.6.27.19-5-default 试通过?/p>

代码Q?/p>

 

  1. #coding=utf-8   
  2. #!/usr/bin/python   
  3. import re   
  4. import os   
  5. import time   
  6.   
  7. import utils   
  8. def sortedDictValues3(adict):   
  9.     keys = adict.keys()   
  10.     keys.sort()   
  11.     return map(adict.get, keys)   
  12.   
  13. def run():   
  14.     if utils.isLinux() == False:   
  15.         return [('ifconfig_collect os type error','this is windows')]   
  16.     #not first run   
  17.     if os.path.isfile('./oldifconfig'):   
  18.         fileold = open('./oldifconfig''r')   
  19.         fileold.seek(0)   
  20.         #d上次记录的时流量数据文Ӟ和时间戳   
  21.         (oldtime, fileoldcontent) = fileold.read().split('#')   
  22.         fileold.close;   
  23.         netcard = {}   
  24.         tempstr = ''  
  25.         key = ''  
  26.         for strline in fileoldcontent.split('\n'):   
  27.             reobj = re.compile('^lo*.')   
  28.             if reobj.search(strline):   
  29.                 break;   
  30.             reobj = re.compile('^eth*.')   
  31.             if reobj.search(strline):   
  32.                 key = strline.split()[0]   
  33.             tempstr = tempstr + strline + '\n'  
  34.             netcard[key] = tempstr   
  35.         RXold = {}   
  36.         TXold = {}   
  37.         for key,value in netcard.items():   
  38.             tempsplit = value.split('\n')   
  39.             netcard[key] = ''  
  40.             for item in tempsplit:   
  41.                 item = item + '<br>'  
  42.                 netcard[key] = netcard[key] + item   
  43.                 tempcount = 1  
  44.                 for match in re.finditer("(bytes:)(.*?)( \()", item):   
  45.                     if tempcount == 1:   
  46.                         RXold[key] = match.group(2)   
  47.                         tempcount = tempcount + 1  
  48.                     elif tempcount == 2:   
  49.                         TXold[key] = match.group(2)   
  50.                         netcard[key] = netcard[key] + 'net io percent(bytes/s): 0 <br>'  
  51.            
  52.         #记录当前|卡信息C时文件中   
  53.         os.system('ifconfig > ifconfigtemp')   
  54.         file = open('./ifconfigtemp','r');   
  55.         fileold = open('./oldifconfig''w')   
  56.         temptimestr = str(int(time.time()));   
  57.         fileold.write(temptimestr)   
  58.         fileold.write('#')   
  59.         file.seek(0)   
  60.         fileold.write(file.read())   
  61.         fileold.close()   
  62.         returnkeys = []   
  63.         returnvalues = []   
  64.         netcard = {}   
  65.         tempcountcard = 0  
  66.         file.seek(0)   
  67.         key = ''  
  68.         for strline in file.readlines():   
  69.             reobj = re.compile('^lo*.')   
  70.             if reobj.search(strline):   
  71.                 break;   
  72.             reobj = re.compile('^eth*.')   
  73.             if reobj.search(strline):   
  74.                 key = strline.split()[0]   
  75.                 netcard[key] = ''  
  76.             netcard[key] = netcard[key] + strline   
  77.         newnetcard = {}   
  78.         file.seek(0)   
  79.         key = ''  
  80.         for strline in file.readlines():   
  81.             reobj = re.compile('^lo*.')   
  82.             if reobj.search(strline):   
  83.                 break;   
  84.             if re.search("^eth", strline):   
  85.                 templist = strline.split()   
  86.                 key = templist[0]   
  87.                 newnetcard[key] = ''  
  88.                 newnetcard[key] = templist[4] + newnetcard[key] + ' '  
  89.             if re.search("^ *inet ", strline):   
  90.                 templist = strline.split()   
  91.                 newnetcard[key] = templist[1][5:] + ' ' + newnetcard[key] + ' '  
  92.         for key,value in newnetcard.items():   
  93.             #记录每张|卡是否工作状态信息到临时文g   
  94.             os.system('ethtool %s > ethtooltemp'%(key))   
  95.             file = open('./ethtooltemp','r');   
  96.             tempethtooltemplist = file.read().split('\n\t')   
  97.             file.close   
  98.             if re.search("yes", tempethtooltemplist[-1]):   
  99.                 templist = newnetcard[key].split()   
  100.                 newnetcard[key] = templist[0] + ' runing! ' + templist[1]   
  101.             else:   
  102.                 templist = newnetcard[key].split()   
  103.                 if len(templist) > 1:   
  104.                     newnetcard[key] = templist[0] + ' stop! ' + templist[1]   
  105.                 else:   
  106.                     newnetcard[key] =  'stop! ' + templist[0]   
  107.         file.close()   
  108.         RX = {}   
  109.         TX = {}   
  110.         for key,value in netcard.items():   
  111.             tempsplit = value.split('\n')   
  112.             netcard[key] = ''  
  113.             for item in tempsplit:   
  114.                 item = item + '<br>'  
  115.                 netcard[key] = netcard[key] + item   
  116.                 tempcount = 1  
  117.                 for match in re.finditer("(bytes:)(.*?)( \()", item):   
  118.                     if tempcount == 1:   
  119.                         RX[key] = str(int(match.group(2)) - int(RXold[key]))   
  120.                         tempcount = tempcount + 1  
  121.                     elif tempcount == 2:   
  122.                         TX[key] = str(int(match.group(2)) - int(TXold[key]))   
  123.                         divtime = float(int(time.time()) - int(oldtime))   
  124.                         if divtime == 0:   
  125.                             rate = (float(TX[key]) + float(RX[key]))   
  126.                         else:   
  127.                             rate = (float(TX[key]) + float(RX[key]))/(divtime)   
  128.                         if rate == 0:   
  129.                             newnetcard[key] = '0' + ' ' + newnetcard[key]   
  130.                         else:   
  131.                             newnetcard[key] = '%.2f'%rate + ' ' + newnetcard[key]   
  132.         return zip(['order'], ['48']) + newnetcard.items();   
  133.     else:   
  134.         os.system('ifconfig > ifconfigtemp')   
  135.         file = open('./ifconfigtemp','r');   
  136.         fileold = open('./oldifconfig''w')   
  137.         temptimestr = str(int(time.time()));   
  138.         fileold.write(temptimestr)   
  139.         fileold.write('#')   
  140.         file.seek(0)   
  141.         fileold.write(file.read())   
  142.         fileold.close()   
  143.   
  144.         netcard = {}   
  145.         file.seek(0)   
  146.         key = ''  
  147.         for strline in file.readlines():   
  148.             reobj = re.compile('^lo*.')   
  149.             if reobj.search(strline):   
  150.                 break;   
  151.             reobj = re.compile('^eth*.')   
  152.             if reobj.search(strline):   
  153.                 key = strline.split()[0]   
  154.                 netcard[key] = ''  
  155.             netcard[key] = netcard[key] + strline   
  156.         RX = {}   
  157.         TX = {}   
  158.            
  159.         key = ''  
  160.         newnetcard = {}   
  161.         file.seek(0)   
  162.         for strline in file.readlines():   
  163.             reobj = re.compile('^lo*.')   
  164.             if reobj.search(strline):   
  165.                 break;   
  166.             if re.search("^eth", strline):   
  167.                 templist = strline.split()   
  168.                 key = templist[0]   
  169.                 newnetcard[key] = templist[4] + ' '  
  170.             if re.search("^ *inet ", strline):   
  171.                 templist = strline.split()   
  172.                 newnetcard[key] = newnetcard[key] + templist[1][5:] + ' '  
  173.         for key,value in newnetcard.items():   
  174.             os.system('ethtool %s > ethtooltemp'%(key))   
  175.             file = open('./ethtooltemp','r');   
  176.             tempethtooltemplist = file.read().split('\n')   
  177.             file.close   
  178.             if re.search("yes", tempethtooltemplist[-1]):   
  179.                 newnetcard[key] = newnetcard[key] + 'runing!'  
  180.             else:   
  181.                 newnetcard[key] = newnetcard[key] + 'stop!'  
  182.         file.close()   
  183.         for key,value in netcard.items():   
  184.             tempsplit = value.split('\n')   
  185.             netcard[key] = ''  
  186.             for item in tempsplit:   
  187.                 item = item + '<br>'  
  188.                 #print item   
  189.                 netcard[key] = netcard[key] + item   
  190.                 tempcount = 1  
  191.                 for match in re.finditer("(bytes:)(.*?)( \()", item):   
  192.                     if tempcount == 1:   
  193.                         RX[key] = match.group(2)   
  194.                         tempcount = tempcount + 1  
  195.                     elif tempcount == 2:   
  196.                         TX[key] = match.group(2)   
  197.                         netcard[key] = netcard[key] + 'net io percent(bytes/s): 0 <br>'  
  198.                         newnetcard[key] = newnetcard[key] + ' ' + '0 <br>'  
  199.         return zip(['order'], ['48']) + newnetcard.items();   
  200. if __name__ == '__main__':   
  201.     print run()  

 

使用例子Q?/p>

1 

每一个列表元素元l里面第二个元素W一个字Dؓ|?Bytes/SQ例如eth1|卡的网速就?.3KB/sQeth0|速是2.9KB/sQ今天是周六q个量很正?/p>

]]>
自顶向下学搜索引擎——北大天|搜索引擎TSE分析及完全注释[6]倒排索引的徏立的E序分析(4)http://www.shnenglu.com/jrckkyy/archive/2009/12/10/102949.html学者站在巨人的肩膀?/dc:creator>学者站在巨人的肩膀?/author>Thu, 10 Dec 2009 15:03:00 GMThttp://www.shnenglu.com/jrckkyy/archive/2009/12/10/102949.htmlhttp://www.shnenglu.com/jrckkyy/comments/102949.htmlhttp://www.shnenglu.com/jrckkyy/archive/2009/12/10/102949.html#Feedback3http://www.shnenglu.com/jrckkyy/comments/commentRss/102949.htmlhttp://www.shnenglu.com/jrckkyy/services/trackbacks/102949.html以下是根据正向烦引徏立倒排索引的注?/p>

 

int main(int argc, char* argv[])    //./CrtInvertedIdx moon.fidx.sort > sun.iidx
{
    ifstream ifsImgInfo(argv[1]);
    if (!ifsImgInfo) 
    {
        cerr << "Cannot open " << argv[1] << " for input\n";
        return -1;
    }

    string strLine,strDocNum,tmp1="";
    int cnt = 0;
    while (getline(ifsImgInfo, strLine)) 
    {
        string::size_type idx;
        string tmp;


        idx = strLine.find("\t");
        tmp = strLine.substr(0,idx);

        if (tmp.size()<2 || tmp.size() > 8) continue;

        if (tmp1.empty()) tmp1=tmp;

        if (tmp == tmp1) 
        {
            strDocNum = strDocNum + " " + strLine.substr(idx+1);
        }
        else 
        {
            if ( strDocNum.empty() )
                strDocNum = strDocNum + " " + strLine.substr(idx+1);

            cout << tmp1 << "\t" << strDocNum << endl;
            tmp1 = tmp;
            strDocNum.clear();
            strDocNum = strDocNum + " " + strLine.substr(idx+1);
        }

        cnt++;
        //if (cnt==100) break;
    }
    cout << tmp1 << "\t" << strDocNum << endl;  //倒排索引中每个字典单词后的文编号以table键ؓ间隔

    return 0;
}

 

 



]]>
自顶向下学搜索引擎——北大天|搜索引擎TSE分析及完全注释[6]倒排索引的徏立的E序分析(3) http://www.shnenglu.com/jrckkyy/archive/2009/12/10/102948.html学者站在巨人的肩膀?/dc:creator>学者站在巨人的肩膀?/author>Thu, 10 Dec 2009 15:02:00 GMThttp://www.shnenglu.com/jrckkyy/archive/2009/12/10/102948.htmlhttp://www.shnenglu.com/jrckkyy/comments/102948.htmlhttp://www.shnenglu.com/jrckkyy/archive/2009/12/10/102948.html#Feedback1http://www.shnenglu.com/jrckkyy/comments/commentRss/102948.htmlhttp://www.shnenglu.com/jrckkyy/services/trackbacks/102948.htmlq里介绍正向索引的徏立,如果直接建立倒排索引效率上可能会很低Q所以可以先产生正向索引为后面的倒排索引打下基础?/p>

 

详细的文件功能和介绍都在q里有了介绍自顶向下学搜索引擎——北大天|搜索引擎TSE分析及完全注释[5]倒排索引的徏立及文g介绍

 

CrtForwardIdx.cpp文g

 

int main(int argc, char* argv[])    //./CrtForwardIdx Tianwang.raw.***.seg > moon.fidx
{
    ifstream ifsImgInfo(argv[1]);
    if (!ifsImgInfo) 
    {
        cerr << "Cannot open " << argv[1] << " for input\n";
        return -1;
    }

    string strLine,strDocNum;
    int cnt = 0;
    while (getline(ifsImgInfo, strLine)) 
    {
        string::size_type idx;

        cnt++;
        if (cnt%2 == 1) //奇数行ؓ文~号
        {
            strDocNum = strLine.substr(0,strLine.size());
            continue;
        }
        if (strLine[0]=='\0' || strLine[0]=='#' || strLine[0]=='\n')
        {
            continue;
        }

        while ( (idx = strLine.find(SEPARATOR)) != string::npos ) //指定查找分界W?
        {
            string tmp1 = strLine.substr(0,idx);
            cout << tmp1 << "\t" << strDocNum << endl;
            strLine = strLine.substr(idx + SEPARATOR.size());
        }

        //if (cnt==100) break;
    }

    return 0;
}

 

author:http://hi.baidu.com/jrckkyy

author:http://blog.csdn.net/jrckkyy

 

 



]]>
自顶向下学搜索引擎——北大天|搜索引擎TSE分析及完全注释[6]倒排索引的徏立的E序分析(2)http://www.shnenglu.com/jrckkyy/archive/2009/12/10/102947.html学者站在巨人的肩膀?/dc:creator>学者站在巨人的肩膀?/author>Thu, 10 Dec 2009 15:02:00 GMThttp://www.shnenglu.com/jrckkyy/archive/2009/12/10/102947.htmlhttp://www.shnenglu.com/jrckkyy/comments/102947.htmlhttp://www.shnenglu.com/jrckkyy/archive/2009/12/10/102947.html#Feedback1http://www.shnenglu.com/jrckkyy/comments/commentRss/102947.htmlhttp://www.shnenglu.com/jrckkyy/services/trackbacks/102947.html前面的DocIndexE序输入一个Tianwang.raw.*****文gQ会产生一下三个文?Doc.idx, Url.idx, DocId2Url.idxQ我们这里对DocSegmentE序q行分析?/p>

q里输入 Tianwang.raw.*****QDoc.idxQUrl.idx.sort_uniq{三个文Ӟ输出一个Tianwang.raw.***.seg 分词完毕的文?/p>

int main(int argc, char* argv[])
{
    string strLine, strFileName=argv[1];
    CUrl iUrl;
    vector<CUrl> vecCUrl;
    CDocument iDocument;
    vector<CDocument> vecCDocument;
    unsigned int docId = 0;

    //ifstream ifs("Tianwang.raw.2559638448");
    ifstream ifs(strFileName.c_str());  //DocSegment Tianwang.raw.****
    if (!ifs) 
    {
        cerr << "Cannot open tianwang.img.info for input\n";
        return -1;
    }

    ifstream ifsUrl("Url.idx.sort_uniq");   //排序q消重后的url字典
    if (!ifsUrl) 
    {
        cerr << "Cannot open Url.idx.sort_uniq for input\n";
        return -1;
    }
    ifstream ifsDoc("Doc.idx"); //字典文g
    if (!ifsDoc) 
    {
        cerr << "Cannot open Doc.idx for input\n";
        return -1;
    }

    while (getline(ifsUrl,strLine)) //偏离url字典存入一个向量内存中
    {
        char chksum[33];
        int  docid;

        memset(chksum, 0, 33);
        sscanf( strLine.c_str(), "%s%d", chksum, &docid );
        iUrl.m_sChecksum = chksum;
        iUrl.m_nDocId = docid;
        vecCUrl.push_back(iUrl);
    }

    while (getline(ifsDoc,strLine))     //偏离字典文g其攑օ一个向量内存中
    {
        int docid,pos,length;
        char chksum[33];

        memset(chksum, 0, 33);
        sscanf( strLine.c_str(), "%d%d%d%s", &docid, &pos, &length,chksum );
        iDocument.m_nDocId = docid;
        iDocument.m_nPos = pos;
        iDocument.m_nLength = length;
        iDocument.m_sChecksum = chksum;
        vecCDocument.push_back(iDocument);
    }

 

    strFileName += ".seg";
    ofstream fout(strFileName.c_str(), ios::in|ios::out|ios::trunc|ios::binary);    //讄完成分词后的数据输出文g
    for ( docId=0; docId<MAX_DOC_ID; docId++ )
    {

        // find document according to docId
        int length = vecCDocument[docId+1].m_nPos - vecCDocument[docId].m_nPos -1;
        char *pContent = new char[length+1];
        memset(pContent, 0, length+1);
        ifs.seekg(vecCDocument[docId].m_nPos);
        ifs.read(pContent, length);

        char *s;
        s = pContent;

        // skip Head
        int bytesRead = 0,newlines = 0;
        while (newlines != 2 && bytesRead != HEADER_BUF_SIZE-1) 
        {
            if (*s == '\n')
                newlines++;
            else
                newlines = 0;
            s++;
            bytesRead++;
        }
        if (bytesRead == HEADER_BUF_SIZE-1) continue;


        // skip header
        bytesRead = 0,newlines = 0;
        while (newlines != 2 && bytesRead != HEADER_BUF_SIZE-1) 
        {
            if (*s == '\n')
                newlines++;
            else
                newlines = 0;
            s++;
            bytesRead++;
        }
        if (bytesRead == HEADER_BUF_SIZE-1) continue;

        //iDocument.m_sBody = s;
        iDocument.RemoveTags(s);    //去除<>
        iDocument.m_sBodyNoTags = s;

        delete[] pContent;
        string strLine = iDocument.m_sBodyNoTags;

        CStrFun::ReplaceStr(strLine, " ", " ");
        CStrFun::EmptyStr(strLine); // set " \t\r\n" to " "


        // segment the document 具体分词处理
        CHzSeg iHzSeg;
        strLine = iHzSeg.SegmentSentenceMM(iDict,strLine);
        fout << docId << endl << strLine;
        fout << endl;
        
    }

    return(0);
}
q里只是光掠媄式的q一遍大概的代码Q后面我会有专题详细讲解 parse html ?segment docment {技?/p>

 

 



]]>
自顶向下学搜索引擎——北大天|搜索引擎TSE分析及完全注释[6]倒排索引的徏立的E序分析(1)http://www.shnenglu.com/jrckkyy/archive/2009/12/10/102945.html学者站在巨人的肩膀?/dc:creator>学者站在巨人的肩膀?/author>Thu, 10 Dec 2009 15:00:00 GMThttp://www.shnenglu.com/jrckkyy/archive/2009/12/10/102945.htmlhttp://www.shnenglu.com/jrckkyy/comments/102945.htmlhttp://www.shnenglu.com/jrckkyy/archive/2009/12/10/102945.html#Feedback1http://www.shnenglu.com/jrckkyy/comments/commentRss/102945.htmlhttp://www.shnenglu.com/jrckkyy/services/trackbacks/102945.htmlauthor:http://hi.baidu.com/jrckkyy

author:http://blog.csdn.net/jrckkyy

上一主要介l了倒排索引建立相关的文件及中间文g?br>TSE建立索引在运行程序上的大致步骤可以简化分Z下几步:

1、运行命?./DocIndex
会用C个文?tianwang.raw.520    //爬取回来的原始文Ӟ包含多个|页的所有信息,所以很大,q也是一个有待解决的问题Q到底存成大文gQ如果过大会过2G?G的限Ӟ而且文gq大索引效率q低Q还是小文gQ文件数q多用于打开关闭文g句柄的消耗过大)q有待思考,q就是存储方案的解决最l肯定是要存为分布式的,最lL仉肯定是会上TB的,TSE只支持小型的搜烦引擎需求?nbsp;         
会生一下三个文?Doc.idx, Url.idx, DocId2Url.idx    //Data文g夹中的Doc.idx DocId2Url.idx和Doc.idx

2、运行命?sort Url.idx|uniq > Url.idx.sort_uniq    //Data文g夹中的Url.idx.sort_uniq
会用C个文?Url.idx文g //md5 hash 之后的url完整地址和document id值对
会生一个文?Url.idx.sort_uniq //URL消重Qmd5 hash排序Q提高检索效?/p>

3、运行命?./DocSegment Tianwang.raw.2559638448 
会用C个文?Tianwang.raw.2559638448  //Tianwang.raw.2559638448为爬回来的文?Q每个页面包含http_分词为后面徏立到排烦引做准备
会生一个文?Tianwang.raw.2559638448.seg //分词文gQ由一行document id号和一行文档分词组Q只Ҏ个文?lt;html></html>?lt;head></head><body></body>{文字标C的文本进行分l)构成

4、运行命?./CrtForwardIdx Tianwang.raw.2559638448.seg > moon.fidx //建立独立的正向烦?/p>

5、运行命?br>#set | grep "LANG"
#LANG=en; export LANG;
#sort moon.fidx > moon.fidx.sort

6、运行命?./CrtInvertedIdx moon.fidx.sort > sun.iidx //建立倒排索引

我们先从建立索引的第一个程序DocIndex.cpp开始分析?注释U定QTianwang.raw.2559638448是抓回来合ƈ成的大文Ӟ后面叫大文Ӟ里面包含了很多篇html文档Q里面的文有规律的分隔叫做一一的文)


//DocIndex.h start-------------------------------------------------------------

 


#ifndef _COMM_H_040708_
#define _COMM_H_040708_

#include

#include
#include
#include
#include
#include
#include
#include


using namespace std;

const unsigned HEADER_BUF_SIZE = 1024;
const unsigned RstPerPage = 20; //前台搜烦l果数据集返回条?/p>

//iceway
//const unsigned MAX_DOC_IDX_ID = 21312;  //DocSegment.cpp中要用到
const unsigned MAX_DOC_IDX_ID = 22104;


//const string IMG_INFO_NAME("./Data/s1.1");
const string INF_INFO_NAME("./Data/sun.iidx"); //倒排索引文g
//朱d  14383 16151 16151 16151 1683 207 6302 7889 8218 8218 8637
//朱古?nbsp; 1085 1222

//9万多?字元文g 包括ҎW号Q标点,汉字
const string DOC_IDX_NAME("./Data/Doc.idx"); //倒排索引文g
const string RAWPAGE_FILE_NAME("./Data/Tianwang.swu.iceway.1.0");

//iceway
const string DOC_FILE_NAME = "Tianwang.swu.iceway.1.0";  //Docindex.cpp中要用到
const string Data_DOC_FILE_NAME = "./Data/Tianwang.swu.iceway.1.0";  //Snapshot.cpp中要用到


//const string RM_THUMBNAIL_FILES("rm -f ~/public_html/ImgSE/timg/*");

//const string THUMBNAIL_DIR("/ImgSE/timg/");


#endif _COMM_H_040708_
//DocIndex.h end--------------------------------------------------------------//DocIndex.cpp start-----------------------------------------------------------

#include
#include
#include "Md5.h"
#include "Url.h"
#include "Document.h"

//iceway(mnsc)
#include "Comm.h"
#include

using namespace std;

int main(int argc, char* argv[])
{
    //ifstream ifs("Tianwang.raw.2559638448");
 //ifstream ifs("Tianwang.raw.3023555472");
 //iceway(mnsc)
 ifstream ifs(DOC_FILE_NAME.c_str()); //打开Tianwang.raw.3023555472文gQ最原始的文?br> if (!ifs)
 {
     cerr << "Cannot open " << "tianwang.img.info" << " for input\n";
     return -1;
    }
 ofstream ofsUrl("Url.idx", ios::in|ios::out|ios::trunc|ios::binary); //建立q打开Url.idx文g
 if( !ofsUrl )
 {
  cout << "error open file " << endl;
 }

 ofstream ofsDoc("Doc.idx", ios::in|ios::out|ios::trunc|ios::binary); //建立q打开Doc.idx文g
 if( !ofsDoc )
 {
  cout << "error open file " << endl;
 }

 ofstream ofsDocId2Url("DocId2Url.idx", ios::in|ios::out|ios::trunc|ios::binary); //建立q打开DocId2Url.idx文g
 if( !ofsDocId2Url )
 {
  cout << "error open file " << endl;
 }

 int cnt=0; //文~号?开始计?br> string strLine,strPage;
 CUrl iUrl;
 CDocument iDocument;
 CMD5 iMD5;
 
 int nOffset = ifs.tellg();
 while (getline(ifs, strLine))
 {
  if (strLine[0]=='\0' || strLine[0]=='#' || strLine[0]=='\n')
  {
   nOffset = ifs.tellg();
   continue;
  }

  if (!strncmp(strLine.c_str(), "version: 1.0", 12)) //判断W一行是否是version: 1.0如果是就解析下去
  { 
   if(!getline(ifs, strLine)) break;
   if (!strncmp(strLine.c_str(), "url: ", 4)) //判断W二行是否是url: 如果是则解析下去
   {
    iUrl.m_sUrl = strLine.substr(5); //截取url: 五个字符之后的url内容
    iMD5.GenerateMD5( (unsigned char*)iUrl.m_sUrl.c_str(), iUrl.m_sUrl.size() ); //对url用md5 hash处理
    iUrl.m_sChecksum = iMD5.ToString(); //字W数l组合成字符串这个函数在Md5.h中实?/p>

   } else
   {
    continue;
   }

   while (getline(ifs, strLine))
   {
    if (!strncmp(strLine.c_str(), "length: ", 8)) //一直读下去直到判断Ҏ(相对W五?惺欠袯ength: 是则接下下去
    {
     sscanf(strLine.substr(8).c_str(), "%d", &(iDocument.m_nLength)); //该块所代表|页的实际网内定w度放入iDocument数据l构?br>     break;
    }
   }

   getline(ifs, strLine); //跌相对W六行故意留的一个空?/p>

   iDocument.m_nDocId = cnt; //文档编可值到iDocument数据l构?br>   iDocument.m_nPos = nOffset; //文l尾在大文g中的l束行号
   char *pContent = new char[iDocument.m_nLength+1]; //新徏该文长度的字符串指?/p>

   memset(pContent, 0, iDocument.m_nLength+1); //每一位初始化?
   ifs.read(pContent, iDocument.m_nLength); //Ҏ获得的文长度读取澹(其中包含协议?d文档内容
   iMD5.GenerateMD5( (unsigned char*)pContent, iDocument.m_nLength );
   iDocument.m_sChecksum = iMD5.ToString(); //字W数l组合成字符串这个函数在Md5.h中实?br>   
   delete[] pContent;
   
   ofsUrl << iUrl.m_sChecksum ; //md5hash后的url写入Url.idx文g
   ofsUrl << "\t" << iDocument.m_nDocId << endl; //在一行中一个tab距离分隔Q将文g~号写入Url.idx文g

   ofsDoc << iDocument.m_nDocId ; //文件编号写入Doc.idx文g
   ofsDoc << "\t" << iDocument.m_nPos ; //在一行中一个tab距离分隔Q将该文结束行h(同样也是下一文档开始行?写入Doc.idx文g
   //ofsDoc << "\t" << iDocument.m_nLength ;
   ofsDoc << "\t" << iDocument.m_sChecksum << endl; //在一行中一个tab距离分隔Q将md5hash后的url写入Doc.idx文g

   ofsDocId2Url << iDocument.m_nDocId ; //文件编号写入DocId2Url.idx文g
   ofsDocId2Url << "\t" << iUrl.m_sUrl << endl; //该文的完整url写入DocId2Url.idx文g

   cnt++; //文档~号加一说明该以文档分析完毕Q生成下一文的编?br>  }

  nOffset = ifs.tellg();

 }

 //最后一行只有文档号和上一文档结束号
 ofsDoc << cnt ;
 ofsDoc << "\t" << nOffset << endl;


 return(0);
}

//DocIndex.cpp end-----------------------------------------------------------author:http://hi.baidu.com/jrckkyy

author:http://blog.csdn.net/jrckkyy

 

 



]]>
自顶向下学搜索引擎——北大天|搜索引擎TSE分析及完全注释[5]倒排索引的徏立及文g介绍http://www.shnenglu.com/jrckkyy/archive/2009/12/10/102943.html学者站在巨人的肩膀?/dc:creator>学者站在巨人的肩膀?/author>Thu, 10 Dec 2009 14:55:00 GMThttp://www.shnenglu.com/jrckkyy/archive/2009/12/10/102943.htmlhttp://www.shnenglu.com/jrckkyy/comments/102943.htmlhttp://www.shnenglu.com/jrckkyy/archive/2009/12/10/102943.html#Feedback1http://www.shnenglu.com/jrckkyy/comments/commentRss/102943.htmlhttp://www.shnenglu.com/jrckkyy/services/trackbacks/102943.html不好意思让大家久等了,前一阵一直在忙考试Q终于结束了。呵呵!废话不多说了下面我们开始吧Q?/p>

TSE用的是将抓取回来的网|档全部装入一个大文Q让后对q一个大文内的数据整体l一的徏索引Q其中包含了几个步骤?/p>

view plaincopy to clipboardprint?
1.  The document index (Doc.idx) keeps information about each document.  
 
It is a fixed width ISAM (Index sequential access mode) index, orderd by docID.  
 
The information stored in each entry includes a pointer into the repository,  
 
a document length, a document checksum.  
 
 
 
//Doc.idx  文~号 文长度    checksum hash?nbsp; 
 
0   0   bc9ce846d7987c4534f53d423380ba70  
 
1   76760   4f47a3cad91f7d35f4bb6b2a638420e5  
 
2   141624  d019433008538f65329ae8e39b86026c  
 
3   142350  5705b8f58110f9ad61b1321c52605795  
 
//Doc.idx   end  
 
 
 
  The url index (url.idx) is used to convert URLs into docIDs.  
 
 
 
//url.idx  
 
5c36868a9c5117eadbda747cbdb0725f    0 
 
3272e136dd90263ee306a835c6c70d77    1 
 
6b8601bb3bb9ab80f868d549b5c5a5f3    2 
 
3f9eba99fa788954b5ff7f35a5db6e1f    3 
 
//url.idx   end  
 
 
 
It is a list of URL checksums with their corresponding docIDs and is sorted by  
 
checksum. In order to find the docID of a particular URL, the URL's checksum  
 
is computed and a binary search is performed on the checksums file to find its  
 
docID.  
 
 
 
    ./DocIndex  
 
        got Doc.idx, Url.idx, DocId2Url.idx //Data文g夹中的Doc.idx DocId2Url.idx和Doc.idx?nbsp; 
 
 
 
//DocId2Url.idx  
 
0   http://*.*.edu.cn/index.aspx  
 
1   http://*.*.edu.cn/showcontent1.jsp?NewsID=118  
 
2   http://*.*.edu.cn/0102.html  
 
3   http://*.*.edu.cn/0103.html  
 
//DocId2Url.idx end  
 
 
 
2.  sort Url.idx|uniq > Url.idx.sort_uniq    //Data文g夹中的Url.idx.sort_uniq  
 
 
 
//Url.idx.sort_uniq  
 
//对hashD行排?nbsp; 
 
000bfdfd8b2dedd926b58ba00d40986b    1111 
 
000c7e34b653b5135a2361c6818e48dc    1831 
 
0019d12f438eec910a06a606f570fde8    366 
 
0033f7c005ec776f67f496cd8bc4ae0d    2103 
 
 
 
3. Segment document to terms, (with finding document according to the url)  
 
    ./DocSegment Tianwang.raw.2559638448        //Tianwang.raw.2559638448为爬回来的文?Q每个页面包含http?nbsp; 
 
        got Tianwang.raw.2559638448.seg       
 
 
 
//Tianwang.raw.2559638448   爬取的原始网|件在文内部每一个文之间应该是通过versionQ?lt;/html>和回车做标志位分割的  
 
version: 1.0 
 
url: http://***.105.138.175/Default2.asp?lang=gb  
 
origin: http://***.105.138.175/  
 
date: Fri, 23 May 2008 20:01:36 GMT  
 
ip: 162.105.138.175 
 
length: 38413 
 
 
 
HTTP/1.1 200 OK  
 
Server: Microsoft-IIS/5.0 
 
Date: Fri, 23 May 2008 11:17:49 GMT  
 
Connection: keep-alive  
 
Connection: Keep-Alive  
 
Content-Length: 38088 
 
Content-Type: text/html; Charset=gb2312  
 
Expires: Fri, 23 May 2008 11:17:49 GMT  
 
Set-Cookie: ASPSESSIONIDSSTRDCAB=IMEOMBIAIPDFCKPAEDJFHOIH; path=/  
 
Cache-control: private 
 
 
 
 
 
 
 
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 
 
" 
<html>  
 
<head>  
 
<title>Apabi数字资源q_</title>  
 
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">  
 
<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">  
 
<META NAME="DESCRIPTION" CONTENT="数字图书?Ҏ数字图书?电子图书 电子?ebook e?Apabi 数字资源q_">  
 
<link rel="stylesheet" type="text/css" href="css\common.css">  
 
 
 
<style type="text/css">  
 
<!--  
 
.style4 {color: #666666}  
 
-->  
 
</style>  
 
 
 
<script LANGUAGE="vbscript">  
 
...  
 
</script>  
 
 
 
<Script Language="javascript">  
 
...  
 
</Script>  
 
</head>  
 
<body leftmargin="0" topmargin="0">  
 
</body>  
 
</html>  
 
//Tianwang.raw.2559638448   end  
 
 
 
//Tianwang.raw.2559638448.seg   每个页面分成一行如?注意中间没有回R作ؓ分隔)  
 

 
...  
 
...  
 
...  
 

 
...  
 
...  
 
...  
 
//Tianwang.raw.2559638448.seg   end  
 
 
 
//下是 Tiny search 非必d?nbsp; 
 
4. Create forward index (docic-->termid)     //建立正向索引  
 
    ./CrtForwardIdx Tianwang.raw.2559638448.seg > moon.fidx  
 
 
 
//Tianwang.raw.2559638448.seg 每个页面分成一行如?lt;BR>//分词   DocID<BR>1<BR>三星/  s/  手机/  论坛/  ,/  手机/  铃声/  下蝲/  ,/  手机/  囄/  下蝲/  ,/  手机/<BR>2<BR>...<BR>...<BR>... 

1.  The document index (Doc.idx) keeps information about each document.

It is a fixed width ISAM (Index sequential access mode) index, orderd by docID.

The information stored in each entry includes a pointer into the repository,

a document length, a document checksum.

 

//Doc.idx  文~号 文档长度 checksum hash?/p>

0 0 bc9ce846d7987c4534f53d423380ba70

1 76760 4f47a3cad91f7d35f4bb6b2a638420e5

2 141624 d019433008538f65329ae8e39b86026c

3 142350 5705b8f58110f9ad61b1321c52605795

//Doc.idx end

 

  The url index (url.idx) is used to convert URLs into docIDs.

 

//url.idx

5c36868a9c5117eadbda747cbdb0725f 0

3272e136dd90263ee306a835c6c70d77 1

6b8601bb3bb9ab80f868d549b5c5a5f3 2

3f9eba99fa788954b5ff7f35a5db6e1f 3

//url.idx end

 

It is a list of URL checksums with their corresponding docIDs and is sorted by

checksum. In order to find the docID of a particular URL, the URL's checksum

is computed and a binary search is performed on the checksums file to find its

docID.

 

 ./DocIndex

  got Doc.idx, Url.idx, DocId2Url.idx //Data文g夹中的Doc.idx DocId2Url.idx和Doc.idx?/p>

 

//DocId2Url.idx

http://*.*.edu.cn/index.aspx

http://*.*.edu.cn/showcontent1.jsp?NewsID=118

http://*.*.edu.cn/0102.html

http://*.*.edu.cn/0103.html

//DocId2Url.idx end

 

2.  sort Url.idx|uniq > Url.idx.sort_uniq //Data文g夹中的Url.idx.sort_uniq

 

//Url.idx.sort_uniq

//对hashD行排?/p>

000bfdfd8b2dedd926b58ba00d40986b 1111

000c7e34b653b5135a2361c6818e48dc 1831

0019d12f438eec910a06a606f570fde8 366

0033f7c005ec776f67f496cd8bc4ae0d 2103

 

3. Segment document to terms, (with finding document according to the url)

 ./DocSegment Tianwang.raw.2559638448  //Tianwang.raw.2559638448为爬回来的文?Q每个页面包含http?/p>

  got Tianwang.raw.2559638448.seg  

 

//Tianwang.raw.2559638448 爬取的原始网|件在文内部每一个文之间应该是通过versionQ?lt;/html>和回车做标志位分割的

version: 1.0

url: http://***.105.138.175/Default2.asp?lang=gb

origin: http://***.105.138.175/

date: Fri, 23 May 2008 20:01:36 GMT

ip: 162.105.138.175

length: 38413

 

HTTP/1.1 200 OK

Server: Microsoft-IIS/5.0

Date: Fri, 23 May 2008 11:17:49 GMT

Connection: keep-alive

Connection: Keep-Alive

Content-Length: 38088

Content-Type: text/html; Charset=gb2312

Expires: Fri, 23 May 2008 11:17:49 GMT

Set-Cookie: ASPSESSIONIDSSTRDCAB=IMEOMBIAIPDFCKPAEDJFHOIH; path=/

Cache-control: private

 

 

 

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"

"

<html>

<head>

<title>Apabi数字资源q_</title>

<meta http-equiv="Content-Type" content="text/html; charset=gb2312">

<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">

<META NAME="DESCRIPTION" CONTENT="数字图书?Ҏ数字图书?电子图书 电子?ebook e?Apabi 数字资源q_">

<link rel="stylesheet" type="text/css" href="css\common.css">

 

<style type="text/css">

<!--

.style4 {color: #666666}

-->

</style>

 

<script LANGUAGE="vbscript">

...

</script>

 

<Script Language="javascript">

...

</Script>

</head>

<body leftmargin="0" topmargin="0">

</body>

</html>

//Tianwang.raw.2559638448 end

 

//Tianwang.raw.2559638448.seg 每个页面分成一行如?注意中间没有回R作ؓ分隔)

1

...

...

...

2

...

...

...

//Tianwang.raw.2559638448.seg end

 

//下是 Tiny search 非必d?/p>

4. Create forward index (docic-->termid)  //建立正向索引

 ./CrtForwardIdx Tianwang.raw.2559638448.seg > moon.fidx

 

//Tianwang.raw.2559638448.seg 每个页面分成一行如?/分词   DocID1三星/  s/  手机/  论坛/  ,/  手机/  铃声/  下蝲/  ,/  手机/  囄/  下蝲/  ,/  手机/2.........view plaincopy to clipboardprint?
//Tianwang.raw.2559638448.seg end  
 
 
//moon.fidx  
 
//每篇文档号对应文内分出来的    分词  DocID  
 
都会  2391 
 
?nbsp;  2391 
 
那些  2391 
 
拥有  2391 
 
?nbsp;  2391 
 
?nbsp;  2391 
 
?nbsp;  2391 
 
?nbsp;  2391 
 
视野  2391 
 
?nbsp;  2391 
 
H?nbsp;  2391 
 
?nbsp;  2180 
 
研究生部    2180 
 
主页  2180 
 
培养  2180 
 
理  2180 
 
栏目  2180 
 
下蝲  2180 
 
Q?nbsp;  2180 
 
?nbsp;  2180 
 
关于  2180 
 
做好  2180 
 
q?nbsp;  2180 
 
国家  2180 
 
公派  2180 
 
研究?2180 
 
目  2180 
 
//moon.fidx end  
 
 
 
5.# set | grep "LANG" 
 
LANG=en; export LANG;  
 
sort moon.fidx > moon.fidx.sort  
 
 
 
6. Create inverted index (termid-->docid)    //建立倒排索引  
 
    ./CrtInvertedIdx moon.fidx.sort > sun.iidx  
 
 
 
//sun.iidx  //文g规模大概减少1/2  
 
花工   236 
 
花v   2103 
 
花卉   1018 1061 1061 1061 1730 1730 1730 1730 1730 1852 949 949 
 
p   447 447 
 
花木   1061 
 
花呢   1430 
 
花期   447 447 447 447 447 525 
 
花钱   174 236 
 
p   1730 1730 
 
p品种     1660 
 
q   450 526 
 
花式   1428 1430 1430 1430 
 
q   1430 1430 
 
花序   447 447 447 447 447 450 
 
qQ   136 137 
 
p   450 450 
 
//sun.iidx  end  
 
 
 
TSESearch   CGI program for query  
 
Snapshot    CGI program for page snapshot  
 
 
<P>  
author:http://hi.baidu.com/jrckkyy  
 
author:http://blog.csdn.net/jrckkyy  
</P> 

 



]]>
自顶向下学搜索引擎——北大天|搜索引擎TSE分析及完全注释[4]结http://www.shnenglu.com/jrckkyy/archive/2009/12/10/102942.html学者站在巨人的肩膀?/dc:creator>学者站在巨人的肩膀?/author>Thu, 10 Dec 2009 14:54:00 GMThttp://www.shnenglu.com/jrckkyy/archive/2009/12/10/102942.htmlhttp://www.shnenglu.com/jrckkyy/comments/102942.htmlhttp://www.shnenglu.com/jrckkyy/archive/2009/12/10/102942.html#Feedback0http://www.shnenglu.com/jrckkyy/comments/commentRss/102942.htmlhttp://www.shnenglu.com/jrckkyy/services/trackbacks/102942.html通过前面的三文章相信你已经对神U的搜烦引擎有了一个感性的认识Q和普通的phpcM的脚本语a服务器类|通过获取前台关键字,通过字典分词Q和事先建立建立好的倒排索引q行相关性分析,得出查询l构格式化输出结果。而这里的技术难点在?/p>

1、字典的选取Q事实上Ҏ不同时代不同地方Z的语a习惯是不一L所以说字典的最元的取值是不同的)

2、倒排索引的徏立(q里p涉及到爬虫的抓取和烦引的建立后面重点介l这2点,搜烦引擎的效率和服务质量实效性瓶颈在q里Q?/p>

3、相x分析(Ҏ回来的文档分词徏索引和用户关键字分词法上要对应Q?/p>

后面文章会重点介l爬虫的抓取和烦引的建立?/p>

]]>
自顶向下学搜索引擎——北大天|搜索引擎TSE分析及完全注释[3]来到关键字分词及相关性分析程?http://www.shnenglu.com/jrckkyy/archive/2009/12/10/102941.html学者站在巨人的肩膀?/dc:creator>学者站在巨人的肩膀?/author>Thu, 10 Dec 2009 14:53:00 GMThttp://www.shnenglu.com/jrckkyy/archive/2009/12/10/102941.htmlhttp://www.shnenglu.com/jrckkyy/comments/102941.htmlhttp://www.shnenglu.com/jrckkyy/archive/2009/12/10/102941.html#Feedback0http://www.shnenglu.com/jrckkyy/comments/commentRss/102941.htmlhttp://www.shnenglu.com/jrckkyy/services/trackbacks/102941.html有前面注释我们可以知道查询关键字和字典文件准备好好后Q将q入用户关键字分词阶D?/p>

//TSESearch.cpp中:

view plaincopy to clipboardprint?
CHzSeg iHzSeg;      //include ChSeg/HzSeg.h  
 
//  
iQuery.m_sSegQuery = iHzSeg.SegmentSentenceMM(iDict, iQuery.m_sQuery);  //get到的查询变量分词分成 "?        ?      你们/ ?      格式"  
 
vector<STRING></STRING> vecTerm;  
iQuery.ParseQuery(vecTerm);     //以"/"划分开的关键字一一序攑օ一个向量容器中  
 
set<STRING></STRING> setRelevantRst;   
iQuery.GetRelevantRst(vecTerm, mapBuckets, setRelevantRst);   
 
gettimeofday(&end_tv,&tz);  
// search end  
//搜烦完毕 

 CHzSeg iHzSeg;  //include ChSeg/HzSeg.h

 //
 iQuery.m_sSegQuery = iHzSeg.SegmentSentenceMM(iDict, iQuery.m_sQuery); //get到的查询变量分词分成 "?  ?  你们/ ?  格式"
 
 vector vecTerm;
 iQuery.ParseQuery(vecTerm);  //以"/"划分开的关键字一一序攑օ一个向量容器中
 
 set setRelevantRst;
 iQuery.GetRelevantRst(vecTerm, mapBuckets, setRelevantRst);
 
 gettimeofday(&end_tv,&tz);
 // search end
 //搜烦完毕view plaincopy to clipboardprint?
看CHzSeg 中的q个Ҏ 

看CHzSeg 中的q个Ҏview plaincopy to clipboardprint?
//ChSeg/HzSeg.h 

//ChSeg/HzSeg.hview plaincopy to clipboardprint?
/**  
 * E序译说明  
 * q一步净化数据,转换汉字  
 * @access  public  
 * @param   CDict, string 参数的汉字说?字典Q查询字W串  
 * @return  string 0  
 */  
// process a sentence before segmentation  
//在分词前处理句子  
string CHzSeg::SegmentSentenceMM (CDict &dict, string s1) const  
{  
    string s2="";  
    unsigned int i,len;  
 
    while (!s1.empty())   
    {  
        unsigned char ch=(unsigned char) s1[0];  
        if(ch<128)   
        { // deal with ASCII  
            i=1;  
            len = s1.size();  
            while (i<LEN len="s1.length();" i="0;" 中文标点{非汉字字符="" if="" else="" yhf="" s1="s1.substr(i);" by="" added="" ch="=13)" s2="" cr=""></LEN>=161)  
              && (!((unsigned char)s1[i]==161 && ((unsigned char)s1[i+1]>=162 && (unsigned char)s1[i+1]<=168)))  
              && (!((unsigned char)s1[i]==161 && ((unsigned char)s1[i+1]>=171 && (unsigned char)s1[i+1]<=191)))  
              && (!((unsigned char)s1[i]==163 && ((unsigned char)s1[i+1]==172 || (unsigned char)s1[i+1]==161)   
              || (unsigned char)s1[i+1]==168 || (unsigned char)s1[i+1]==169 || (unsigned char)s1[i+1]==186  
              || (unsigned char)s1[i+1]==187 || (unsigned char)s1[i+1]==191)))   
                {   
                    ii=i+2; // 假定没有半个汉字  
                }  
 
                if (i==0) ii=i+2;  
 
                // 不处理中文空?nbsp; 
                if (!(ch==161 && (unsigned char)s1[1]==161))   
                {   
                    if (i <= s1.size())  // yhf  
                        // 其他的非汉字双字节字W可能连l输?nbsp; 
                        s2 += s1.substr(0, i) + SEPARATOR;   
                    else break; // yhf  
                }  
 
                if (i <= s1.size())  // yhf  
                    s1s1=s1.substr(i);  
                else break;     //yhf  
 
                continue;  
            }  
        }  
      
 
    // 以下处理汉字?nbsp; 
 
        i = 2;  
        len = s1.length();  
 
        while(i<LEN></LEN>=176)   
//    while(i<LEN></LEN>=128 && (unsigned char)s1[i]!=161)  
            i+=2;  
 
        s2+=SegmentHzStrMM(dict, s1.substr(0,i));  
 
        if (i <= len)    // yhf  
            s1s1=s1.substr(i);  
        else break; // yhf  
    }  
 
    return s2;  

/**
 * E序译说明
 * q一步净化数据,转换汉字
 * @access  public
 * @param   CDict, string 参数的汉字说?字典Q查询字W串
 * @return  string 0
 */
// process a sentence before segmentation
//在分词前处理句子
string CHzSeg::SegmentSentenceMM (CDict &dict, string s1) const
{
 string s2="";
 unsigned int i,len;

 while (!s1.empty())
 {
  unsigned char ch=(unsigned char) s1[0];
  if(ch<128)
  { // deal with ASCII
   i=1;
   len = s1.size();
   while (i=161)
              && (!((unsigned char)s1[i]==161 && ((unsigned char)s1[i+1]>=162 && (unsigned char)s1[i+1]<=168)))
              && (!((unsigned char)s1[i]==161 && ((unsigned char)s1[i+1]>=171 && (unsigned char)s1[i+1]<=191)))
              && (!((unsigned char)s1[i]==163 && ((unsigned char)s1[i+1]==172 || (unsigned char)s1[i+1]==161)
              || (unsigned char)s1[i+1]==168 || (unsigned char)s1[i+1]==169 || (unsigned char)s1[i+1]==186
              || (unsigned char)s1[i+1]==187 || (unsigned char)s1[i+1]==191)))
    {
     i=i+2; // 假定没有半个汉字
    }

    if (i==0) i=i+2;

    // 不处理中文空?br>    if (!(ch==161 && (unsigned char)s1[1]==161))
    {
     if (i <= s1.size()) // yhf
      // 其他的非汉字双字节字W可能连l输?br>      s2 += s1.substr(0, i) + SEPARATOR;
     else break; // yhf
    }

    if (i <= s1.size()) // yhf
     s1=s1.substr(i);
    else break;  //yhf

    continue;
   }
  }
   

    // 以下处理汉字?/p>

  i = 2;
  len = s1.length();

  while(i=176)
//    while(i=128 && (unsigned char)s1[i]!=161)
   i+=2;

  s2+=SegmentHzStrMM(dict, s1.substr(0,i));

  if (i <= len) // yhf
   s1=s1.substr(i);
  else break; // yhf
 }

 return s2;
}view plaincopy to clipboardprint?
  

 view plaincopy to clipboardprint?
//Query.cpp 

//Query.cppview plaincopy to clipboardprint?
<PRE class=csharp name="code">/**  
 * E序译说明  
 * 以"/"划分开的关键字一一序攑օ一个向量容器中  
 *  
 * @access  public  
 * @param   vector<STRING></STRING> 参数的汉字说明:向量容器  
 * @return  void  
 */  
void CQuery::ParseQuery(vector<STRING></STRING> &vecTerm)  
{  
    string::size_type idx;   
    while ( (idx = m_sSegQuery.find("/  ")) != string::npos ) {   
        vecTerm.push_back(m_sSegQuery.substr(0,idx));   
        m_sSegQuerym_sSegQuery = m_sSegQuery.substr(idx+3);   
    }  
}  
</PRE> 
<PRE class=csharp name="code"> </PRE> 
<PRE class=csharp name="code"><PRE class=csharp name="code">/**  
 * E序译说明  
 * 相关性分析查询,构造结果集合setRelevantRst //瓉所?nbsp; 
 *  
 * @access  public  
 * @param   vector<STRING></STRING> map set<STRING></STRING> 参数的汉字说明: 用户提交关键字的分词l,倒排索引映射Q相x结果集?nbsp; 
 * @return  string 0  
 */  
bool CQuery::GetRelevantRst  
(  
    vector<STRING></STRING> &vecTerm,   
    map &mapBuckets,   
    set<STRING></STRING> &setRelevantRst  
) const  
{  
    set<STRING></STRING> setSRst;  
 
    bool bFirst=true;  
    vector<STRING></STRING>::iterator itTerm = vecTerm.begin();  
 
    for ( ; itTerm != vecTerm.end(); ++itTerm )  
    {  
 
        setSRst.clear();  
        copy(setRelevantRst.begin(), setRelevantRst.end(), inserter(setSRst,setSRst.begin()));  
 
        map mapRstDoc;  
        string docid;  
        int doccnt;  
 
        map::iterator itBuckets = mapBuckets.find(*itTerm);  
        if (itBuckets != mapBuckets.end())  
        {  
            string strBucket = (*itBuckets).second;  
            string::size_type idx;  
            idx = strBucket.find_first_not_of(" ");  
            strBucketstrBucket = strBucket.substr(idx);  
 
            while ( (idx = strBucket.find(" ")) != string::npos )   
            {  
                docid = strBucket.substr(0,idx);  
                doccnt = 0;  
 
                if (docid.empty()) continue;  
 
                map::iterator it = mapRstDoc.find(docid);  
                if ( it != mapRstDoc.end() )  
                {  
                    doccnt = (*it).second + 1;  
                    mapRstDoc.erase(it);  
                }  
                mapRstDoc.insert( pair(docid,doccnt) );  
 
                strBucketstrBucket = strBucket.substr(idx+1);  
            }  
 
            // remember the last one  
            docid = strBucket;  
            doccnt = 0;  
            map::iterator it = mapRstDoc.find(docid);  
            if ( it != mapRstDoc.end() )  
            {  
                doccnt = (*it).second + 1;  
                mapRstDoc.erase(it);  
            }  
            mapRstDoc.insert( pair(docid,doccnt) );  
        }  
 
        // sort by term frequencty  
        multimap > newRstDoc;  
        map::iterator it0 = mapRstDoc.begin();  
        for ( ; it0 != mapRstDoc.end(); ++it0 ){  
            newRstDoc.insert( pair((*it0).second,(*it0).first) );  
        }  
 
        multimap::iterator itNewRstDoc = newRstDoc.begin();  
        setRelevantRst.clear();  
        for ( ; itNewRstDoc != newRstDoc.end(); ++itNewRstDoc ){  
            string docid = (*itNewRstDoc).second;  
 
            if (bFirst==true) {  
                setRelevantRst.insert(docid);  
                continue;  
            }  
 
            if ( setSRst.find(docid) != setSRst.end() ){      
                setRelevantRst.insert(docid);  
            }  
        }  
 
        //cout << "setRelevantRst.size(): " << setRelevantRst.size() << "<BR>";  
        bFirst = false;  
    }  
    return true;  
}</PRE> 
</PRE> 
接下来的是现实了,前面都只是处理数据得?setRelevantRst q个查询l构集合,q里׃多说了下面就和php之类的脚本语a差不多,格式化结果集合ƈ昄出来?nbsp;

view plaincopy to clipboardprint?/**   * E序译说明   * 以"/"划分开的关键字一一序攑օ一个向量容器中   *   * @access  public   * @param   vector<STRING></STRING> 参数的汉字说明:向量容器   * @return  void   */  void CQuery::ParseQuery(vector<STRING></STRING> &vecTerm)   {       string::size_type idx;        while ( (idx = m_sSegQuery.find("/  ")) != string::npos ) {            vecTerm.push_back(m_sSegQuery.substr(0,idx));            m_sSegQuery = m_sSegQuery.substr(idx+3);        }   }  /**
 * E序译说明
 * 以"/"划分开的关键字一一序攑օ一个向量容器中
 *
 * @access  public
 * @param   vector 参数的汉字说明:向量容器
 * @return  void
 */
void CQuery::ParseQuery(vector &vecTerm)
{
 string::size_type idx;
 while ( (idx = m_sSegQuery.find("/  ")) != string::npos ) {
  vecTerm.push_back(m_sSegQuery.substr(0,idx));
  m_sSegQuery = m_sSegQuery.substr(idx+3);
 }
}

view plaincopy to clipboardprint?   
view plaincopy to clipboardprint?<PRE class=csharp name="code">/**   * E序译说明   * 相关性分析查询,构造结果集合setRelevantRst //瓉所?nbsp;  *   * @access  public   * @param   vector<STRING></STRING> map set<STRING></STRING> 参数的汉字说明: 用户提交关键字的分词l,倒排索引映射Q相x结果集?nbsp;  * @return  string 0   */  bool CQuery::GetRelevantRst   (       vector<STRING></STRING> &vecTerm,        map &mapBuckets,        set<STRING></STRING> &setRelevantRst   ) const  {       set<STRING></STRING> setSRst;         bool bFirst=true;       vector<STRING></STRING>::iterator itTerm = vecTerm.begin();         for ( ; itTerm != vecTerm.end(); ++itTerm )       {             setSRst.clear();           copy(setRelevantRst.begin(), setRelevantRst.end(), inserter(setSRst,setSRst.begin()));             map mapRstDoc;           string docid;           int doccnt;             map::iterator itBuckets = mapBuckets.find(*itTerm);           if (itBuckets != mapBuckets.end())           {               string strBucket = (*itBuckets).second;               string::size_type idx;               idx = strBucket.find_first_not_of(" ");               strBucket = strBucket.substr(idx);                 while ( (idx = strBucket.find(" ")) != string::npos )                {                   docid = strBucket.substr(0,idx);                   doccnt = 0;                     if (docid.empty()) continue;                     map::iterator it = mapRstDoc.find(docid);                   if ( it != mapRstDoc.end() )                   {                       doccnt = (*it).second + 1;                       mapRstDoc.erase(it);                   }                   mapRstDoc.insert( pair(docid,doccnt) );                     strBucket = strBucket.substr(idx+1);               }                 // remember the last one               docid = strBucket;               doccnt = 0;               map::iterator it = mapRstDoc.find(docid);               if ( it != mapRstDoc.end() )               {                   doccnt = (*it).second + 1;                   mapRstDoc.erase(it);               }               mapRstDoc.insert( pair(docid,doccnt) );           }             // sort by term frequencty           multimap > newRstDoc;           map::iterator it0 = mapRstDoc.begin();           for ( ; it0 != mapRstDoc.end(); ++it0 ){               newRstDoc.insert( pair((*it0).second,(*it0).first) );           }             multimap::iterator itNewRstDoc = newRstDoc.begin();           setRelevantRst.clear();           for ( ; itNewRstDoc != newRstDoc.end(); ++itNewRstDoc ){               string docid = (*itNewRstDoc).second;                 if (bFirst==true) {                   setRelevantRst.insert(docid);                   continue;               }                 if ( setSRst.find(docid) != setSRst.end() ){                       setRelevantRst.insert(docid);               }           }             //cout << "setRelevantRst.size(): " << setRelevantRst.size() << "<BR>";           bFirst = false;       }       return true;   }</PRE>  view plaincopy to clipboardprint?/**   * E序译说明   * 相关性分析查询,构造结果集合setRelevantRst //瓉所?nbsp;  *   * @access  public   * @param   vector<STRING></STRING> map set<STRING></STRING> 参数的汉字说明: 用户提交关键字的分词l,倒排索引映射Q相x结果集?nbsp;  * @return  string 0   */  bool CQuery::GetRelevantRst   (       vector<STRING></STRING> &vecTerm,        map &mapBuckets,        set<STRING></STRING> &setRelevantRst   ) const  {       set<STRING></STRING> setSRst;         bool bFirst=true;       vector<STRING></STRING>::iterator itTerm = vecTerm.begin();         for ( ; itTerm != vecTerm.end(); ++itTerm )       {             setSRst.clear();           copy(setRelevantRst.begin(), setRelevantRst.end(), inserter(setSRst,setSRst.begin()));             map mapRstDoc;           string docid;           int doccnt;             map::iterator itBuckets = mapBuckets.find(*itTerm);           if (itBuckets != mapBuckets.end())           {               string strBucket = (*itBuckets).second;               string::size_type idx;               idx = strBucket.find_first_not_of(" ");               strBucket = strBucket.substr(idx);                 while ( (idx = strBucket.find(" ")) != string::npos )                {                   docid = strBucket.substr(0,idx);                   doccnt = 0;                     if (docid.empty()) continue;                     map::iterator it = mapRstDoc.find(docid);                   if ( it != mapRstDoc.end() )                   {                       doccnt = (*it).second + 1;                       mapRstDoc.erase(it);                   }                   mapRstDoc.insert( pair(docid,doccnt) );                     strBucket = strBucket.substr(idx+1);               }                 // remember the last one               docid = strBucket;               doccnt = 0;               map::iterator it = mapRstDoc.find(docid);               if ( it != mapRstDoc.end() )               {                   doccnt = (*it).second + 1;                   mapRstDoc.erase(it);               }               mapRstDoc.insert( pair(docid,doccnt) );           }             // sort by term frequencty           multimap > newRstDoc;           map::iterator it0 = mapRstDoc.begin();           for ( ; it0 != mapRstDoc.end(); ++it0 ){               newRstDoc.insert( pair((*it0).second,(*it0).first) );           }             multimap::iterator itNewRstDoc = newRstDoc.begin();           setRelevantRst.clear();           for ( ; itNewRstDoc != newRstDoc.end(); ++itNewRstDoc ){               string docid = (*itNewRstDoc).second;                 if (bFirst==true) {                   setRelevantRst.insert(docid);                   continue;               }                 if ( setSRst.find(docid) != setSRst.end() ){                       setRelevantRst.insert(docid);               }           }             //cout << "setRelevantRst.size(): " << setRelevantRst.size() << "<BR>";           bFirst = false;       }       return true;   }  /**
 * E序译说明
 * 相关性分析查询,构造结果集合setRelevantRst //瓉所?br> *
 * @access  public
 * @param   vector map set 参数的汉字说明: 用户提交关键字的分词l,倒排索引映射Q相x结果集?br> * @return  string 0
 */
bool CQuery::GetRelevantRst
(
 vector &vecTerm,
 map &mapBuckets,
 set &setRelevantRst
) const
{
 set setSRst;

 bool bFirst=true;
 vector::iterator itTerm = vecTerm.begin();

 for ( ; itTerm != vecTerm.end(); ++itTerm )
 {

  setSRst.clear();
  copy(setRelevantRst.begin(), setRelevantRst.end(), inserter(setSRst,setSRst.begin()));

  map mapRstDoc;
  string docid;
  int doccnt;

  map::iterator itBuckets = mapBuckets.find(*itTerm);
  if (itBuckets != mapBuckets.end())
  {
   string strBucket = (*itBuckets).second;
   string::size_type idx;
   idx = strBucket.find_first_not_of(" ");
   strBucket = strBucket.substr(idx);

   while ( (idx = strBucket.find(" ")) != string::npos )
   {
    docid = strBucket.substr(0,idx);
    doccnt = 0;

    if (docid.empty()) continue;

    map::iterator it = mapRstDoc.find(docid);
    if ( it != mapRstDoc.end() )
    {
     doccnt = (*it).second + 1;
     mapRstDoc.erase(it);
    }
    mapRstDoc.insert( pair(docid,doccnt) );

    strBucket = strBucket.substr(idx+1);
   }

   // remember the last one
   docid = strBucket;
   doccnt = 0;
   map::iterator it = mapRstDoc.find(docid);
   if ( it != mapRstDoc.end() )
   {
    doccnt = (*it).second + 1;
    mapRstDoc.erase(it);
   }
   mapRstDoc.insert( pair(docid,doccnt) );
  }

  // sort by term frequencty
  multimap > newRstDoc;
  map::iterator it0 = mapRstDoc.begin();
  for ( ; it0 != mapRstDoc.end(); ++it0 ){
   newRstDoc.insert( pair((*it0).second,(*it0).first) );
  }

  multimap::iterator itNewRstDoc = newRstDoc.begin();
  setRelevantRst.clear();
  for ( ; itNewRstDoc != newRstDoc.end(); ++itNewRstDoc ){
   string docid = (*itNewRstDoc).second;

   if (bFirst==true) {
    setRelevantRst.insert(docid);
    continue;
   }

   if ( setSRst.find(docid) != setSRst.end() ){ 
    setRelevantRst.insert(docid);
   }
  }

  //cout << "setRelevantRst.size(): " << setRelevantRst.size() << "";
  bFirst = false;
 }
 return true;
}

接下来的是现实了,前面都只是处理数据得?setRelevantRst q个查询l构集合,q里׃多说了下面就和php之类的脚本语a差不多,格式化结果集合ƈ昄出来?br>//TSESearch.cpp

view plaincopy to clipboardprint?
//下面开始显C?nbsp; 
    CDisplayRst iDisplayRst;   
    iDisplayRst.ShowTop();   
 
    float used_msec = (end_tv.tv_sec-begin_tv.tv_sec)*1000   
        +((float)(end_tv.tv_usec-begin_tv.tv_usec))/(float)1000;   
 
    iDisplayRst.ShowMiddle(iQuery.m_sQuery,used_msec,   
            setRelevantRst.size(), iQuery.m_iStart);  
 
    iDisplayRst.ShowBelow(vecTerm,setRelevantRst,vecDocIdx,iQuery.m_iStart);

 



]]>
þþۺ| ƷŮþþ| þùۺϾƷ| þùȾƷҰAV| AVպƷþþþþþ| Ʒ99þþþƷ | þþûƬ| ձƷþ| þþù׮| þþþþaëƬ| þAV˳׽| ɫվWWWþþž| ԭƷ99þþƷ66| þþƷa޹v岻| þþþþ޾Ʒ| AAAþþþƷƬ| ޹Ʒþþ| þeֻйľƷ99| 99þþƷһ| þþþAVרվ| 69Ʒþþþ9999APGF| ƷþþþjkƷ| Ʒŷһþþ| ԭۺϾþô˵| vavavaþ| þerƷѹۿ2| պþþþĻ| vaĻþ| þѾƷƵ| þþþþþþþþ| ݹƷþ| þþþþþþþþþƷ | þҹɫƷվ| ޳˾þ| þ˽˹ƷvA| ŷ˾þô߽ۺ69 | þ99Ʒþþþþ| þˬˬƬAV | ھƷþþӰԺ| 99þó18վ| 99þùۺϾƷŮͬͼƬ|