不好意思讓大家久等了,前一陣一直在忙考試,終于結(jié)束了。呵呵!廢話不多說了下面我們開始吧!
TSE用的是將抓取回來的網(wǎng)頁文檔全部裝入一個(gè)大文檔,讓后對(duì)這一個(gè)大文檔內(nèi)的數(shù)據(jù)整體統(tǒng)一的建索引,其中包含了幾個(gè)步驟。
view plaincopy to clipboardprint?
1. The document index (Doc.idx) keeps information about each document.
It is a fixed width ISAM (Index sequential access mode) index, orderd by docID.
The information stored in each entry includes a pointer into the repository,
a document length, a document checksum.
//Doc.idx 文檔編號(hào) 文檔長(zhǎng)度 checksum hash碼
0 0 bc9ce846d7987c4534f53d423380ba70
1 76760 4f47a3cad91f7d35f4bb6b2a638420e5
2 141624 d019433008538f65329ae8e39b86026c
3 142350 5705b8f58110f9ad61b1321c52605795
//Doc.idx end
The url index (url.idx) is used to convert URLs into docIDs.
//url.idx
5c36868a9c5117eadbda747cbdb0725f 0
3272e136dd90263ee306a835c6c70d77 1
6b8601bb3bb9ab80f868d549b5c5a5f3 2
3f9eba99fa788954b5ff7f35a5db6e1f 3
//url.idx end
It is a list of URL checksums with their corresponding docIDs and is sorted by
checksum. In order to find the docID of a particular URL, the URL's checksum
is computed and a binary search is performed on the checksums file to find its
docID.
./DocIndex
got Doc.idx, Url.idx, DocId2Url.idx //Data文件夾中的Doc.idx DocId2Url.idx和Doc.idx中
//DocId2Url.idx
0 http://*.*.edu.cn/index.aspx
1 http://*.*.edu.cn/showcontent1.jsp?NewsID=118
2 http://*.*.edu.cn/0102.html
3 http://*.*.edu.cn/0103.html
//DocId2Url.idx end
2. sort Url.idx|uniq > Url.idx.sort_uniq //Data文件夾中的Url.idx.sort_uniq
//Url.idx.sort_uniq
//對(duì)hash值進(jìn)行排序
000bfdfd8b2dedd926b58ba00d40986b 1111
000c7e34b653b5135a2361c6818e48dc 1831
0019d12f438eec910a06a606f570fde8 366
0033f7c005ec776f67f496cd8bc4ae0d 2103
3. Segment document to terms, (with finding document according to the url)
./DocSegment Tianwang.raw.2559638448 //Tianwang.raw.2559638448為爬回來的文件 ,每個(gè)頁面包含http頭
got Tianwang.raw.2559638448.seg
//Tianwang.raw.2559638448 爬取的原始網(wǎng)頁文件在文檔內(nèi)部每一個(gè)文檔之間應(yīng)該是通過version,</html>和回車做標(biāo)志位分割的
version: 1.0
url: http://***.105.138.175/Default2.asp?lang=gb
origin: http://***.105.138.175/
date: Fri, 23 May 2008 20:01:36 GMT
ip: 162.105.138.175
length: 38413
HTTP/1.1 200 OK
Server: Microsoft-IIS/5.0
Date: Fri, 23 May 2008 11:17:49 GMT
Connection: keep-alive
Connection: Keep-Alive
Content-Length: 38088
Content-Type: text/html; Charset=gb2312
Expires: Fri, 23 May 2008 11:17:49 GMT
Set-Cookie: ASPSESSIONIDSSTRDCAB=IMEOMBIAIPDFCKPAEDJFHOIH; path=/
Cache-control: private
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"
<html>
<head>
<title>Apabi數(shù)字資源平臺(tái)</title>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312">
<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">
<META NAME="DESCRIPTION" CONTENT="數(shù)字圖書館 方正數(shù)字圖書館 電子圖書 電子書 ebook e書 Apabi 數(shù)字資源平臺(tái)">
<link rel="stylesheet" type="text/css" href="css\common.css">
<style type="text/css">
<!--
.style4 {color: #666666}
-->
</style>
<script LANGUAGE="vbscript">
...
</script>
<Script Language="javascript">
...
</Script>
</head>
<body leftmargin="0" topmargin="0">
</body>
</html>
//Tianwang.raw.2559638448 end
//Tianwang.raw.2559638448.seg 將每個(gè)頁面分成一行如下(注意中間沒有回車作為分隔)
1
...
...
...
2
...
...
...
//Tianwang.raw.2559638448.seg end
//下是 Tiny search 非必須因素
4. Create forward index (docic-->termid) //建立正向索引
./CrtForwardIdx Tianwang.raw.2559638448.seg > moon.fidx
//Tianwang.raw.2559638448.seg 將每個(gè)頁面分成一行如下<BR>//分詞 DocID<BR>1<BR>三星/ s/ 手機(jī)/ 論壇/ ,/ 手機(jī)/ 鈴聲/ 下載/ ,/ 手機(jī)/ 圖片/ 下載/ ,/ 手機(jī)/<BR>2<BR>...<BR>...<BR>...
1. The document index (Doc.idx) keeps information about each document.
It is a fixed width ISAM (Index sequential access mode) index, orderd by docID.
The information stored in each entry includes a pointer into the repository,
a document length, a document checksum.
//Doc.idx 文檔編號(hào) 文檔長(zhǎng)度 checksum hash碼
0 0 bc9ce846d7987c4534f53d423380ba70
1 76760 4f47a3cad91f7d35f4bb6b2a638420e5
2 141624 d019433008538f65329ae8e39b86026c
3 142350 5705b8f58110f9ad61b1321c52605795
//Doc.idx end
The url index (url.idx) is used to convert URLs into docIDs.
//url.idx
5c36868a9c5117eadbda747cbdb0725f 0
3272e136dd90263ee306a835c6c70d77 1
6b8601bb3bb9ab80f868d549b5c5a5f3 2
3f9eba99fa788954b5ff7f35a5db6e1f 3
//url.idx end
It is a list of URL checksums with their corresponding docIDs and is sorted by
checksum. In order to find the docID of a particular URL, the URL's checksum
is computed and a binary search is performed on the checksums file to find its
docID.
./DocIndex
got Doc.idx, Url.idx, DocId2Url.idx //Data文件夾中的Doc.idx DocId2Url.idx和Doc.idx中
//DocId2Url.idx
0 http://*.*.edu.cn/index.aspx
1 http://*.*.edu.cn/showcontent1.jsp?NewsID=118
//DocId2Url.idx end
2. sort Url.idx|uniq > Url.idx.sort_uniq //Data文件夾中的Url.idx.sort_uniq
//Url.idx.sort_uniq
//對(duì)hash值進(jìn)行排序
000bfdfd8b2dedd926b58ba00d40986b 1111
000c7e34b653b5135a2361c6818e48dc 1831
0019d12f438eec910a06a606f570fde8 366
0033f7c005ec776f67f496cd8bc4ae0d 2103
3. Segment document to terms, (with finding document according to the url)
./DocSegment Tianwang.raw.2559638448 //Tianwang.raw.2559638448為爬回來的文件 ,每個(gè)頁面包含http頭
got Tianwang.raw.2559638448.seg
//Tianwang.raw.2559638448 爬取的原始網(wǎng)頁文件在文檔內(nèi)部每一個(gè)文檔之間應(yīng)該是通過version,</html>和回車做標(biāo)志位分割的
version: 1.0
url: http://***.105.138.175/Default2.asp?lang=gb
origin: http://***.105.138.175/
date: Fri, 23 May 2008 20:01:36 GMT
ip: 162.105.138.175
length: 38413
HTTP/1.1 200 OK
Server: Microsoft-IIS/5.0
Date: Fri, 23 May 2008 11:17:49 GMT
Connection: keep-alive
Connection: Keep-Alive
Content-Length: 38088
Content-Type: text/html; Charset=gb2312
Expires: Fri, 23 May 2008 11:17:49 GMT
Set-Cookie: ASPSESSIONIDSSTRDCAB=IMEOMBIAIPDFCKPAEDJFHOIH; path=/
Cache-control: private
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"