欧美成人免费在线,欧美成人乱码一区二区三区,国产日韩欧美精品一区

自頂向下學(xué)搜索引擎——北大天網(wǎng)搜索引擎TSE分析及完全注釋[6]倒排索引的建立的程序分析(2)

前面的DocIndex程序輸入一個(gè)Tianwang.raw.*****文件，會(huì)產(chǎn)生一下三個(gè)文件 Doc.idx, Url.idx, DocId2Url.idx，我們這里對(duì)DocSegment程序進(jìn)行分析。

這里輸入 Tianwang.raw.*****，Doc.idx，Url.idx.sort_uniq等三個(gè)文件，輸出一個(gè)Tianwang.raw.***.seg 分詞完畢的文件

int main(int argc, char* argv[])
{
    string strLine, strFileName=argv[1];
    CUrl iUrl;
    vector<CUrl> vecCUrl;
    CDocument iDocument;
    vector<CDocument> vecCDocument;
    unsigned int docId = 0;

    //ifstream ifs("Tianwang.raw.2559638448");
    ifstream ifs(strFileName.c_str()); //DocSegment Tianwang.raw.****
    if (!ifs)
    {
        cerr << "Cannot open tianwang.img.info for input\n";
        return -1;
    }

    ifstream ifsUrl("Url.idx.sort_uniq");   //排序并消重后的url字典
    if (!ifsUrl)
    {
        cerr << "Cannot open Url.idx.sort_uniq for input\n";
        return -1;
    }
    ifstream ifsDoc("Doc.idx"); //字典文件
    if (!ifsDoc)
    {
        cerr << "Cannot open Doc.idx for input\n";
        return -1;
    }

    while (getline(ifsUrl,strLine)) //偏離url字典存入一個(gè)向量?jī)?nèi)存中
    {
        char chksum[33];
        int docid;

        memset(chksum, 0, 33);
        sscanf( strLine.c_str(), "%s%d", chksum, &docid );
        iUrl.m_sChecksum = chksum;
        iUrl.m_nDocId = docid;
        vecCUrl.push_back(iUrl);
    }

    while (getline(ifsDoc,strLine))     //偏離字典文件將其放入一個(gè)向量?jī)?nèi)存中
    {
        int docid,pos,length;
        char chksum[33];

        memset(chksum, 0, 33);
        sscanf( strLine.c_str(), "%d%d%d%s", &docid, &pos, &length,chksum );
        iDocument.m_nDocId = docid;
        iDocument.m_nPos = pos;
        iDocument.m_nLength = length;
        iDocument.m_sChecksum = chksum;
        vecCDocument.push_back(iDocument);
    }

    strFileName += ".seg";
    ofstream fout(strFileName.c_str(), ios::in|ios::out|ios::trunc|ios::binary);    //設(shè)置完成分詞后的數(shù)據(jù)輸出文件
    for ( docId=0; docId<MAX_DOC_ID; docId++ )
    {

        // find document according to docId
        int length = vecCDocument[docId+1].m_nPos - vecCDocument[docId].m_nPos -1;
        char *pContent = new char[length+1];
        memset(pContent, 0, length+1);
        ifs.seekg(vecCDocument[docId].m_nPos);
        ifs.read(pContent, length);

char *s;
s = pContent;

        // skip Head
        int bytesRead = 0,newlines = 0;
        while (newlines != 2 && bytesRead != HEADER_BUF_SIZE-1)
        {
            if (*s == '\n')
                newlines++;
            else
                newlines = 0;
            s++;
            bytesRead++;
        }
        if (bytesRead == HEADER_BUF_SIZE-1) continue;

        // skip header
        bytesRead = 0,newlines = 0;
        while (newlines != 2 && bytesRead != HEADER_BUF_SIZE-1)
        {
            if (*s == '\n')
                newlines++;
            else
                newlines = 0;
            s++;
            bytesRead++;
        }
        if (bytesRead == HEADER_BUF_SIZE-1) continue;

        //iDocument.m_sBody = s;
        iDocument.RemoveTags(s);    //去除<>
        iDocument.m_sBodyNoTags = s;

delete[] pContent;
string strLine = iDocument.m_sBodyNoTags;

CStrFun::ReplaceStr(strLine, " ", " ");
CStrFun::EmptyStr(strLine); // set " \t\r\n" to " "

        // segment the document 具體分詞處理
        CHzSeg iHzSeg;
        strLine = iHzSeg.SegmentSentenceMM(iDict,strLine);
        fout << docId << endl << strLine;
        fout << endl;

    }

return(0);
}
這里只是浮光掠影式的過(guò)一遍大概的代碼，后面我會(huì)有專題詳細(xì)講解 parse html 和 segment docment 等技術(shù)

posted on 2009-12-10 23:02 學(xué)者站在巨人的肩膀上閱讀(1167) 評(píng)論(1) 編輯收藏引用所屬分類: 中文文本信息處理

評(píng)論

# re: 自頂向下學(xué)搜索引擎——北大天網(wǎng)搜索引擎TSE分析及完全注釋[6]倒排索引的建立的程序分析(2) 2009-12-12 13:17 凡客誠(chéng)品網(wǎng)

捱三頂四看來(lái)達(dá)到回復(fù) 更多評(píng)論

刷新評(píng)論列表

只有注冊(cè)用戶登錄后才能發(fā)表評(píng)論。
【推薦】100%開(kāi)源！大型工業(yè)跨平臺(tái)軟件C++源碼提供，建模，組態(tài)！

相關(guān)文章: 自頂向下學(xué)搜索引擎——北大天網(wǎng)搜索引擎TSE分析及完全注釋[6]倒排索引的建立的程序分析(4) 自頂向下學(xué)搜索引擎——北大天網(wǎng)搜索引擎TSE分析及完全注釋[6]倒排索引的建立的程序分析(3) 自頂向下學(xué)搜索引擎——北大天網(wǎng)搜索引擎TSE分析及完全注釋[6]倒排索引的建立的程序分析(2) 自頂向下學(xué)搜索引擎——北大天網(wǎng)搜索引擎TSE分析及完全注釋[6]倒排索引的建立的程序分析(1) 自頂向下學(xué)搜索引擎——北大天網(wǎng)搜索引擎TSE分析及完全注釋[5]倒排索引的建立及文件介紹自頂向下學(xué)搜索引擎——北大天網(wǎng)搜索引擎TSE分析及完全注釋[4]小結(jié) 自頂向下學(xué)搜索引擎——北大天網(wǎng)搜索引擎TSE分析及完全注釋[3]來(lái)到關(guān)鍵字分詞及相關(guān)性分析程序自頂向下學(xué)搜索引擎——北大天網(wǎng)搜索引擎TSE分析及完全注釋[2]路過(guò)查詢處理程序自頂向下學(xué)搜索引擎——北大天網(wǎng)搜索引擎TSE分析及完全注釋[1]尋找搜索引擎入口

網(wǎng)站導(dǎo)航: 博客園 IT新聞 BlogJava 博問(wèn) Chat2DB 管理

學(xué)著站在巨人的肩膀上

公告

常用鏈接

留言簿(1)

隨筆分類

隨筆檔案

搜索

最新評(píng)論

閱讀排行榜

評(píng)論排行榜

評(píng)論