Javen-Studio 咖啡小屋

http://javenstudio.org - C++ Java 分布式搜索引擎
Naven's Research Laboratory - Thinking of Life, Imagination of Future

C++博客 :: 首頁 :: 新隨筆 :: 聯系 :: 聚合

:: 管理 ::

24 隨筆 :: 57 文章 :: 170 評論 :: 4 Trackbacks

<

2025年11月

>

日

一

二

三

四

五

六

26

27

28

29

30

31

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

1

2

3

4

5

6

常用鏈接

留言簿(42)

文章檔案

4 索引是如何創建的

為了使用Lucene來索引數據，首先你比把它轉換成一個純文本（plain-text）tokens的數據流（stream），并通過它創建出Document對象，其包含的Fields成員容納這些文本數據。一旦你準備好些Document對象，你就可以調用IndexWriter類的addDocument(Document)方法來傳遞這些對象到Lucene并寫入索引中。當你做這些的時候，Lucene首先分析（analyzer）這些數據來使得它們更適合索引。詳見《Lucene In Action》

    // Store the index on disk
    Directory directory = FSDirectory.getDirectory("/tmp/testindex");
    // Use standard analyzer
    Analyzer analyzer = new StandardAnalyzer();
    // Create IndexWriter object
    IndexWriter iwriter = new IndexWriter(directory, analyzer, true);
    iwriter.setMaxFieldLength(25000);
    // make a new, empty document
    Document doc = new Document();
    File f = new File("/tmp/test.txt");
    // Add the path of the file as a field named "path".  Use a field that is
    // indexed (i.e. searchable), but don't tokenize the field into words.
    doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.UN_TOKENIZED));
    String text = "This is the text to be indexed.";
    doc.add(new Field("fieldname", text, Field.Store.YES,      Field.Index.TOKENIZED));
    // Add the last modified date of the file a field named "modified".  Use
    // a field that is indexed (i.e. searchable), but don't tokenize the field
    // into words.
    doc.add(new Field("modified",
        DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE),
        Field.Store.YES, Field.Index.UN_TOKENIZED));
    // Add the contents of the file to a field named "contents".  Specify a Reader,
    // so that the text of the file is tokenized and indexed, but not stored.
    // Note that FileReader expects the file to be in the system's default encoding.
    // If that's not the case searching for special characters will fail.
    doc.add(new Field("contents", new FileReader(f)));
    iwriter.addDocument(doc);
    iwriter.optimize();
    iwriter.close();

下面詳細介紹每一個類的處理機制。

4.1 索引創建類IndexWriter

一個IndexWriter對象創建并且維護(maintains) 一條索引.

它的構造函數(constructor)的create參數(argument)確定(determines)是否一條新的索引將被創建，或者是否一條已經存在的索引將被打開。需要注意的是你可以使用create=true參數打開一條索引，即使有其他readers也在在使用這條索引。舊的readers將繼續檢索它們已經打開的”point in time”快照（snapshot），并不能看見那些新已創建的索引，直到它們再次打開（re-open）。另外還有一個沒有create參數的構造函數，如果提供的目錄（provided path）中沒有已經存在的索引，它將創建它，否則將打開此存在的索引。

另一方面（in either case），添加文檔使用addDocument()方法，刪除文檔使用removeDocument()方法，而且一篇文檔可以使用updateDocument()方法來更新（僅僅是先執行delete在執行add操作而已）。當完成了添加、刪除、更新文檔，應該需要調用close方法。

這些修改會緩存在內存中（buffered in memory），并且定期地（periodically）刷新到（flush）Directory中（在上述方法的調用期間）。一次flush操作會在如下時候觸發（triggered）：當從上一次flush操作后有足夠多緩存的delete操作（參見setMaxBufferedDeleteTerms(int)），或者足夠多已添加的文檔（參見setMaxBufferedDocs(int)），無論哪個更快些（whichever is sooner）。當一次flush發生時，等待的（pending）delete和add文檔都會被flush到索引中。一次flush可能觸發一個或更多的片斷合并（segment merges）。

構造函數中的可選參數（optional argument）autoCommit控制（controls）修改對IndexReader實體（instance）讀取相同索引的能見度（visibility）。當設置為false時，修改操作將不可見（visible）直到close()方法被調用后。需要注意的是修改將依然被flush進Directory，就像新文件一樣（as new files），但是卻不會被提交（commit）（沒有新的引用那些新文件的segments_N文件會被寫入（written referencing the new files））直道close()方法被調用。如果在調用close()之前發生了某種嚴重錯誤（something goes terribly wrong）（例如JVM崩潰了），于是索引將反映（reflect）沒有任何修改發生過（none of changes made）（它將保留它開始的狀態（remain in its starting state））。你還可以調用close()，這樣可以關閉那些沒有提交任何修改操作的writers，并且清除所有那些已經flush但是現在不被引用的（unreferenced）索引文件。這個模式（mode）對防止（prevent）readers在一個錯誤的時間重新刷新（refresh）非常有用（例如在你完成所有delete操作后，但是在你完成添加操作前的時候）。它還能被用來實現簡單的single-writer的事務語義（transactional semantics）（"all or none"）。

當autoCommit設為true的時候，每次flush也會是一次提交（IndexReader實體將會把每次flush當作一次提交）。這是缺省的設置，目的是為了匹配（match）2.2版本之前的行為（behavior）。當以這種模式運行時，當優化（optimize）或者片斷合并（segment merges）正在進行（take place）的時候需要小心地重新刷新（refresh）你的readers，因為這兩個操作會綁定（tie up）可觀的（substantial）磁盤空間。

當一條索引暫時（for a while）將不會有更多的文檔被添加，并且期望（desired）得到最理想（optimal）的檢索性能（performance），于是optimize()方法應該在索引被關閉之前被調用。

打開IndexWriter會為使用的Directory創建一個lock文件。嘗試對相同的Directory打開另一個IndexWriter將會導致（lead to）一個LockObtainFailedException異常。如果一個建立在相同的Directory的IndexReader對象被用來從這條索引中刪除文檔的時候，這個異常也會被拋出。

專家（Expert）：IndexWriter允許指定（specify）一個可選的（optional）IndexDeletionPolicy實現。你可以通過這個控制什么時候優先的提交（prior commit）從索引中被刪除。缺省的策略（policy）是KeepOnlyLastCommitDeletionPolicy類，在一個新的提交完成的時候它會馬上所有的優先提交（prior commit）（這匹配2.2版本之前的行為）。創建你自己的策略能夠允許你明確地（explicitly）保留以前的”point in time”提交（commit）在索引中存在（alive）一段時間。為了讓readers刷新到新的提交，在它們之下沒有被刪除的舊的提交（without having the old commit deleted out from under them）。這對那些不支持“在最后關閉時才刪除”語義（”delete on last close” semantics）的文件系統（filesystem）如NFS，而這是Lucene的“point in time”檢索通常所依賴的（normally rely on）。

Annotated Lucene 作者：naven 日期：2007-5-1

posted on 2007-05-10 00:07 Javen-Studio 閱讀(1252) 評論(0) 編輯收藏引用

只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 BlogJava 博問 Chat2DB 管理

青青草原综合久久大伊人导航_色综合久久天天综合_日日噜噜夜夜狠狠久久丁香五月_热久久这里只有精品

Javen-Studio 咖啡小屋

常用鏈接

留言簿(42)

文章檔案

blogs

friends

myblogs

最新評論

4 索引是如何創建的

4.1 索引創建類IndexWriter