不好意思讓大家久等了,前一陣一直在忙考試,終于結(jié)束了。呵呵!廢話不多說了下面我們開始吧!
TSE用的是將抓取回來的網(wǎng)頁文檔全部裝入一個大文檔,讓后對這一個大文檔內(nèi)的數(shù)據(jù)整體統(tǒng)一的建索引,其中包含了幾個步驟。
view plaincopy to clipboardprint?
1. The document index (Doc.idx) keeps information about each document.
It is a fixed width ISAM (Index sequential access mode) index, orderd by docID.
The information stored in each entry includes a pointer into the repository,
a document length, a document checksum.
//Doc.idx 文檔編號 文檔長度 checksum hash碼
0 0 bc9ce846d7987c4534f53d423380ba70
1 76760 4f47a3cad91f7d35f4bb6b2a638420e5
2 141624 d019433008538f65329ae8e39b86026c
3 142350 5705b8f58110f9ad61b1321c52605795
//Doc.idx end
The url index (url.idx) is used to convert URLs into docIDs.
//url.idx
5c36868a9c5117eadbda747cbdb0725f 0
3272e136dd90263ee306a835c6c70d77 1
6b8601bb3bb9ab80f868d549b5c5a5f3 2
3f9eba99fa788954b5ff7f35a5db6e1f 3
//url.idx end
It is a list of URL checksums with their corresponding docIDs and is sorted by
checksum. In order to find the docID of a particular URL, the URL's checksum
is computed and a binary search is performed on the checksums file to find its
docID.
./DocIndex
got Doc.idx, Url.idx, DocId2Url.idx //Data文件夾中的Doc.idx DocId2Url.idx和Doc.idx中
//DocId2Url.idx
0 http://*.*.edu.cn/index.aspx
1 http://*.*.edu.cn/showcontent1.jsp?NewsID=118
2 http://*.*.edu.cn/0102.html
3 http://*.*.edu.cn/0103.html
//DocId2Url.idx end
2. sort Url.idx|uniq > Url.idx.sort_uniq //Data文件夾中的Url.idx.sort_uniq
//Url.idx.sort_uniq
//對hash值進行排序
000bfdfd8b2dedd926b58ba00d40986b 1111
000c7e34b653b5135a2361c6818e48dc 1831
0019d12f438eec910a06a606f570fde8 366
0033f7c005ec776f67f496cd8bc4ae0d 2103
3. Segment document to terms, (with finding document according to the url)
./DocSegment Tianwang.raw.2559638448 //Tianwang.raw.2559638448為爬回來的文件 ,每個頁面包含http頭
got Tianwang.raw.2559638448.seg
//Tianwang.raw.2559638448 爬取的原始網(wǎng)頁文件在文檔內(nèi)部每一個文檔之間應(yīng)該是通過version,</html>和回車做標志位分割的
version: 1.0
url: http://***.105.138.175/Default2.asp?lang=gb
origin: http://***.105.138.175/
date: Fri, 23 May 2008 20:01:36 GMT
ip: 162.105.138.175
length: 38413
HTTP/1.1 200 OK
Server: Microsoft-IIS/5.0
Date: Fri, 23 May 2008 11:17:49 GMT
Connection: keep-alive
Connection: Keep-Alive
Content-Length: 38088
Content-Type: text/html; Charset=gb2312
Expires: Fri, 23 May 2008 11:17:49 GMT
Set-Cookie: ASPSESSIONIDSSTRDCAB=IMEOMBIAIPDFCKPAEDJFHOIH; path=/
Cache-control: private
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"
<html>
<head>
<title>Apabi數(shù)字資源平臺</title>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">
<META NAME="DESCRIPTION" CONTENT="數(shù)字圖書館 方正數(shù)字圖書館 電子圖書 電子書 ebook e書 Apabi 數(shù)字資源平臺">
<link rel="stylesheet" type="text/css" href="css\common.css">
<style type="text/css">
<!--
.style4 {color: #666666}
-->
</style>
<script LANGUAGE="vbscript">
...
</script>
<Script Language="javascript">
...
</Script>
</head>
<body leftmargin="0" topmargin="0">
</body>
</html>
//Tianwang.raw.2559638448 end
//Tianwang.raw.2559638448.seg 將每個頁面分成一行如下(注意中間沒有回車作為分隔)
1
...
...
...
2
...
...
...
//Tianwang.raw.2559638448.seg end
//下是 Tiny search 非必須因素
4. Create forward index (docic-->termid) //建立正向索引
./CrtForwardIdx Tianwang.raw.2559638448.seg > moon.fidx
//Tianwang.raw.2559638448.seg 將每個頁面分成一行如下<BR>//分詞 DocID<BR>1<BR>三星/ s/ 手機/ 論壇/ ,/ 手機/ 鈴聲/ 下載/ ,/ 手機/ 圖片/ 下載/ ,/ 手機/<BR>2<BR>...<BR>...<BR>...
1. The document index (Doc.idx) keeps information about each document.
It is a fixed width ISAM (Index sequential access mode) index, orderd by docID.
The information stored in each entry includes a pointer into the repository,
a document length, a document checksum.
//Doc.idx 文檔編號 文檔長度 checksum hash碼
0 0 bc9ce846d7987c4534f53d423380ba70
1 76760 4f47a3cad91f7d35f4bb6b2a638420e5
2 141624 d019433008538f65329ae8e39b86026c
3 142350 5705b8f58110f9ad61b1321c52605795
//Doc.idx end
The url index (url.idx) is used to convert URLs into docIDs.
//url.idx
5c36868a9c5117eadbda747cbdb0725f 0
3272e136dd90263ee306a835c6c70d77 1
6b8601bb3bb9ab80f868d549b5c5a5f3 2
3f9eba99fa788954b5ff7f35a5db6e1f 3
//url.idx end
It is a list of URL checksums with their corresponding docIDs and is sorted by
checksum. In order to find the docID of a particular URL, the URL's checksum
is computed and a binary search is performed on the checksums file to find its
docID.
./DocIndex
got Doc.idx, Url.idx, DocId2Url.idx //Data文件夾中的Doc.idx DocId2Url.idx和Doc.idx中
//DocId2Url.idx
0 http://*.*.edu.cn/index.aspx
1 http://*.*.edu.cn/showcontent1.jsp?NewsID=118
2 http://*.*.edu.cn/0102.html
3 http://*.*.edu.cn/0103.html
//DocId2Url.idx end
2. sort Url.idx|uniq > Url.idx.sort_uniq //Data文件夾中的Url.idx.sort_uniq
//Url.idx.sort_uniq
//對hash值進行排序
000bfdfd8b2dedd926b58ba00d40986b 1111
000c7e34b653b5135a2361c6818e48dc 1831
0019d12f438eec910a06a606f570fde8 366
0033f7c005ec776f67f496cd8bc4ae0d 2103
3. Segment document to terms, (with finding document according to the url)
./DocSegment Tianwang.raw.2559638448 //Tianwang.raw.2559638448為爬回來的文件 ,每個頁面包含http頭
got Tianwang.raw.2559638448.seg
//Tianwang.raw.2559638448 爬取的原始網(wǎng)頁文件在文檔內(nèi)部每一個文檔之間應(yīng)該是通過version,</html>和回車做標志位分割的
version: 1.0
url: http://***.105.138.175/Default2.asp?lang=gb
origin: http://***.105.138.175/
date: Fri, 23 May 2008 20:01:36 GMT
ip: 162.105.138.175
length: 38413
HTTP/1.1 200 OK
Server: Microsoft-IIS/5.0
Date: Fri, 23 May 2008 11:17:49 GMT
Connection: keep-alive
Connection: Keep-Alive
Content-Length: 38088
Content-Type: text/html; Charset=gb2312
Expires: Fri, 23 May 2008 11:17:49 GMT
Set-Cookie: ASPSESSIONIDSSTRDCAB=IMEOMBIAIPDFCKPAEDJFHOIH; path=/
Cache-control: private
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"
<html>
<head>
<title>Apabi數(shù)字資源平臺</title>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">
<META NAME="DESCRIPTION" CONTENT="數(shù)字圖書館 方正數(shù)字圖書館 電子圖書 電子書 ebook e書 Apabi 數(shù)字資源平臺">
<link rel="stylesheet" type="text/css" href="css\common.css">
<style type="text/css">
<!--
.style4 {color: #666666}
-->
</style>
<script LANGUAGE="vbscript">
...
</script>
<Script Language="javascript">
...
</Script>
</head>
<body leftmargin="0" topmargin="0">
</body>
</html>
//Tianwang.raw.2559638448 end
//Tianwang.raw.2559638448.seg 將每個頁面分成一行如下(注意中間沒有回車作為分隔)
1
...
...
...
2
...
...
...
//Tianwang.raw.2559638448.seg end
//下是 Tiny search 非必須因素
4. Create forward index (docic-->termid) //建立正向索引
./CrtForwardIdx Tianwang.raw.2559638448.seg > moon.fidx
//Tianwang.raw.2559638448.seg 將每個頁面分成一行如下//分詞 DocID1三星/ s/ 手機/ 論壇/ ,/ 手機/ 鈴聲/ 下載/ ,/ 手機/ 圖片/ 下載/ ,/ 手機/2.........view plaincopy to clipboardprint?
//Tianwang.raw.2559638448.seg end
//moon.fidx
//每篇文檔號對應(yīng)文檔內(nèi)分出來的 分詞 DocID
都會 2391
使 2391
那些 2391
擁有 2391
它 2391
的 2391
人 2391
的 2391
視野 2391
變 2391
窄 2391
在 2180
研究生部 2180
主頁 2180
培養(yǎng) 2180
管理 2180
欄目 2180
下載 2180
) 2180
、 2180
關(guān)于 2180
做好 2180
年 2180
國家 2180
公派 2180
研究生 2180
項目 2180
//moon.fidx end
5.# set | grep "LANG"
LANG=en; export LANG;
sort moon.fidx > moon.fidx.sort
6. Create inverted index (termid-->docid) //建立倒排索引
./CrtInvertedIdx moon.fidx.sort > sun.iidx
//sun.iidx //文件規(guī)模大概減少1/2
花工 236
花海 2103
花卉 1018 1061 1061 1061 1730 1730 1730 1730 1730 1852 949 949
花蕾 447 447
花木 1061
花呢 1430
花期 447 447 447 447 447 525
花錢 174 236
花色 1730 1730
花色品種 1660
花生 450 526
花式 1428 1430 1430 1430
花紋 1430 1430
花序 447 447 447 447 447 450
花絮 136 137
花芽 450 450
//sun.iidx end
TSESearch CGI program for query
Snapshot CGI program for page snapshot
<P>
author:http://hi.baidu.com/jrckkyy
author:http://blog.csdn.net/jrckkyy
</P>