為生存而奔跑

:: 首頁 :: 聯(lián)系 :: 聚合

:: 管理

271 Posts :: 0 Stories :: 58 Comments :: 0 Trackbacks

留言簿(5)

我參與的團(tuán)隊(duì)

隨筆分類

隨筆檔案

相冊

Girl

搜索

積分與排名

積分 - 330226
排名 - 74

閱讀排行榜

評論排行榜

Lucene索引中term的頻率

用Lucene建立索引時(shí)，需要指定索引的TermVector.YES.

Document document = new Document();

document.Add(new Field("word", pageText, Field.Store.NO, Field.Index.TOKENIZED, Field.TermVector.YES));

document.Add(new Field("concept",pageTitle,Field.Store.YES,Field.Index.NO));

indexWriter.AddDocument(document);

建立索引后，如下代碼可以遍歷索引中所有的term，并且得到該term在每個(gè)文檔中的頻率

           IndexReader reader = IndexReader.Open(indexDir);
            TermEnum termEnum = reader.Terms();
            while (termEnum.Next())
            {
                Console.WriteLine(termEnum.Term());
                Console.WriteLine("DocFreq="+termEnum.DocFreq());

                TermDocs termDocs = reader.TermDocs(termEnum.Term());
                while (termDocs.Next())
                {
                    Console.WriteLine("DocNo:   "+termDocs.Doc()+"  Freq:   "+termDocs.Freq());
                }
            }

轉(zhuǎn)自http://lqgao.spaces.live.com/blog/cns!3BB36966ED98D3E5!408.entry?_c11_blogpart_blogpart=blogview&_c=blogpart#permalink

Lucene源碼分析(1) -- 如何讀取Lucene索引數(shù)據(jù)

終于清楚如何用讀Lucene的索引 :-)。本文要介紹一下如何利用IndexReader獲取信息。為什么要讀索引呢？因?yàn)槲倚枰獙?shí)現(xiàn)這些功能：
(1) 統(tǒng)計(jì)term在整個(gè)collection中的文檔頻度(document frequency, DF)；
(2) 統(tǒng)計(jì)term在整個(gè)collection中出現(xiàn)的詞次(term frequency in whole collection)；
(3) 統(tǒng)計(jì)term在某個(gè)文檔中出現(xiàn)的頻度(term frequency, TF)；
(4) 列出term在某文檔中出現(xiàn)的位置(position)；
(5) 整個(gè)collection中文檔的個(gè)數(shù)；

那么為什么要用到這些數(shù)據(jù)呢？這些數(shù)據(jù)是實(shí)現(xiàn)TR(Text Retrieval，文本檢索)的必備的“原料”，而且是經(jīng)過加工的。在檢索之前，只有原始文本(raw data)；經(jīng)過索引器(indexer)的處理之后，原始文本變成了一個(gè)一個(gè)的term(或者token)，然后被indexer紀(jì)錄下來所在的位置、出現(xiàn)的次數(shù)。有了這些數(shù)據(jù)，應(yīng)用一些模型，就可以實(shí)現(xiàn)搜索引擎實(shí)現(xiàn)的功能——文本檢索。

聰明的讀者您可能會(huì)說，這看起來似乎很好做，不過就是計(jì)數(shù)(count)么。不錯(cuò)，就是計(jì)數(shù)，或者說是統(tǒng)計(jì)。但是看似簡單的過程，如果加上空間(內(nèi)存容量)的限制，就顯得不那么簡單了。假設(shè)如果每篇文檔有100個(gè)term，每個(gè)term需要存儲(chǔ)10字節(jié)信息，存1,000,000篇文檔需要 10x100x10^6=10^9=2^30字節(jié)，也就是1GB。雖然現(xiàn)在1G內(nèi)存不算什么，可是總不能把1GB的數(shù)據(jù)時(shí)時(shí)刻刻都放入內(nèi)存吧。那么放入硬盤好了，現(xiàn)在需要用數(shù)據(jù)的時(shí)候，再把1GB數(shù)據(jù)從硬盤搬到內(nèi)存。OK，可以先去沖杯咖啡，回來在繼續(xù)下面的操作。這是1,000,000的文檔，如果更多一點(diǎn)呢，現(xiàn)在沒有任何輔助數(shù)據(jù)結(jié)構(gòu)的方式，會(huì)導(dǎo)致很差的效率。

Lucene的索引會(huì)把數(shù)據(jù)分成段，并且在需要的時(shí)候才讀，不需要的時(shí)候就讓數(shù)據(jù)乖乖地呆在硬盤上。Lucene本身是一個(gè)優(yōu)秀的索引引擎，能夠提供有效的索引和檢索機(jī)制。文本的目的是，介紹如用利用Lucene的API，如何從已經(jīng)建好的索引的數(shù)據(jù)中讀取需要的信息。至于Lucene如何使用，我會(huì)在后續(xù)的文章中逐漸介紹。

我們一步一步來看。這里建設(shè)已經(jīng)有實(shí)現(xiàn)建好索引，存放在index目錄下。好，要讀索引，總得先生成一個(gè)讀索引器(即Lucene中IndexReader的實(shí)例)。好，寫下面的程序(程序?yàn)镃#程序，本文使用DotLucene)。
IndexReader reader;
問題出來了，IndexReader是一個(gè)abstract類，不能實(shí)例化。那好，換派生類試試看。找到IndexReader的兩個(gè)孩子——SegmentReader和MultiReader。用哪個(gè)呢？無論是哪個(gè)都需要一大堆參數(shù)(我是頗費(fèi)了周折才搞清楚它們的用途，后面再解釋)，似乎想用Lucene的索引數(shù)據(jù)不是那么容易啊。通過跟蹤代碼和查閱文檔，我終于找到使用IndexReader的鑰匙。原來IndexReader有一個(gè)“工廠模式”的static interface——IndexReader.Open。定義如下：
#0001 public static IndexReader Open(System.String path)
#0002 public static IndexReader Open(System.IO.FileInfo path)
#0003 public static IndexReader Open(Directory directory)
#0004 private static IndexReader Open(Directory directory, bool closeDirectory)
其中有三個(gè)是public的接口，可供調(diào)用。打開一個(gè)索引，就是這么簡單：
#0001 IndexReader reader = IndexReader.Open(index);

實(shí)際上，這個(gè)打開索引經(jīng)歷了這樣的一個(gè)過程：
#0001 SegmentInfos infos = new SegmentInfos();
#0002 Directory directory = FSDirectory.GetDirectory(index, false);
#0003 infos.Read(directory);
#0004 bool closeDirectory = false;
#0005 if (infos.Count == 1)
#0006 {
#0007 // index is optimized
#0008 return new SegmentReader(infos, infos.Info(0), closeDirectory);
#0009 }
#0010 else
#0011 {
#0012 IndexReader[] readers = new IndexReader[infos.Count];
#0013 for (int i = 0; i < infos.Count; i++)
#0014 readers[i] = new SegmentReader(infos.Info(i));
#0015 return new MultiReader(directory, infos, closeDirectory, readers);
#0016 }

首先要讀入索引的段信息(segment information, #0001~#0003)，然后看一下有幾個(gè)段：如果只有一個(gè)，那么可能是優(yōu)化過的，直接讀取這一個(gè)段就可以(#0008)；否則需要一次讀入各個(gè)段(#0013~#0014)，然后再拼成一個(gè)MultiReader(#0015)。打開索引文件的過程就是這樣。

接下來我們要看看如何讀取信息了。用下面這段代碼來說明。
#0001 public static void PrintIndex(IndexReader reader)
#0002 {
#0003      //顯示有多少個(gè)document
#0004      System.Console.WriteLine(reader + "\tNumDocs = " + reader.NumDocs());
#0005      for (int i = 0; i < reader.NumDocs(); i++)
#0006      {
#0007          System.Console.WriteLine(reader.Document(i));
#0008      }
#0009
#0010      //枚舉term，獲得<document, term freq, position* >信息
#0011      TermEnum termEnum = reader.Terms();
#0012      while (termEnum.Next())
#0013      {
#0014          System.Console.Write(termEnum.Term());
#0015          System.Console.WriteLine("\tDocFreq=" + termEnum.DocFreq());
#0016
#0017          TermPositions termPositions = reader.TermPositions(termEnum.Term());
#0018          int i = 0;
#0019          int j = 0;
#0020          while (termPositions.Next())
#0021          {
#0022              System.Console.WriteLine((i++) + "->" + " DocNo:" + termPositions.Doc() + ", Freq:" + termPositions.Freq());
#0023              for (j = 0; j < termPositions.Freq(); j++)
#0024                  System.Console.Write("[" + termPositions.NextPosition() + "]");
#0025              System.Console.WriteLine();
#0026          }
#0027
#0028          //直接獲取 <term freq, document> 的信息
#0029          TermDocs termDocs = reader.TermDocs(termEnum.Term());
#0030          while (termDocs.Next())
#0031          {
#0032              System.Console.WriteLine((i++) + "->" + " DocNo:" + termDocs.Doc() + ", Freq:" + termDocs.Freq());
#0033          }
#0034      }
#0035
#0036      // FieldInfos fieldInfos = reader.fieldInfos;
#0037      // FieldInfo pathFieldInfo = fieldInfos.FieldInfo("path");
#0038
#0039      //顯示 term frequency vector
#0040      for (int i = 0; i < reader.NumDocs(); i++)
#0041      {
#0042          //對contents的token之后的term存于了TermFreqVector
#0043          TermFreqVector termFreqVector = reader.GetTermFreqVector(i, "contents");
#0044
#0045          if (termFreqVector == null)
#0046          {
#0047              System.Console.WriteLine("termFreqVector is null.");
#0048              continue;
#0049          }
#0050
#0051          String fieldName = termFreqVector.GetField();
#0052          String[] terms = termFreqVector.GetTerms();
#0053          int[] frequences = termFreqVector.GetTermFrequencies();
#0054
#0055          System.Console.Write("FieldName:" + fieldName);
#0056          for (int j = 0; j < terms.Length; j++)
#0057          {
#0058              System.Console.Write("[" + terms[j] + ":" + frequences[j] + "]");
#0059          }
#0060          System.Console.WriteLine();
#0061      }
#0062      System.Console.WriteLine();
#0063 }

#0004 計(jì)算document的個(gè)數(shù)
#0012~#0034 枚舉collection中所有的term
其中#0017~#0026 枚舉每個(gè)term在出現(xiàn)的document中的所有位置(第幾個(gè)詞，從1開始計(jì)數(shù))；#0029~#0033 計(jì)算每個(gè)term出現(xiàn)在哪些文檔和相應(yīng)的出現(xiàn)頻度(即DF和TF)。
#0036~#0037在reader是SegmentReader類型的情況下有效。
#0040~#0061可以快速的讀取某篇文檔中出現(xiàn)的term和相應(yīng)的頻度。但是這部分需要在建索引時(shí)，設(shè)置storeTermVector為true。比如
doc.Add(Field.Text("contents", reader, true));
其中的第三項(xiàng)即是。默認(rèn)為false。

有了這些數(shù)據(jù)，就可以統(tǒng)計(jì)我需要的數(shù)據(jù)了。以后我會(huì)介紹如何建立索引，如何應(yīng)用Lucene。

posted on 2010-03-08 15:31 baby-fly 閱讀(4525) 評論(0) 編輯收藏引用所屬分類: Information Retrival / Data Mining

只有注冊用戶登錄后才能發(fā)表評論。
【推薦】100%開源！大型工業(yè)跨平臺(tái)軟件C++源碼提供，建模，組態(tài)！

相關(guān)文章: 轉(zhuǎn)自水木NLP，duckyaya版主總結(jié)的關(guān)于文本分類的若干資源。 NLP常用工具 List of English stop words [Lucene.Net] 基本用法 WordNet-based semantic similarity measurement Java WordNet API Lucene索引中term的頻率【轉(zhuǎn)】Lucene 搜索引擎倒排索引原理 Querying DBpedia 實(shí)戰(zhàn) Lucene，第 1 部分: 初識(shí) Lucene 轉(zhuǎn)自IBM

網(wǎng)站導(dǎo)航: 博客園 IT新聞 BlogJava 博問 Chat2DB 管理