99国产精品久久久久老师,久久av一区二区三区,国内伊人久久久久久网站视频

尋找熱點查詢

搜索引擎會通過日志文件把用戶每次檢索使用的所有檢索串都記錄下來，每個查詢串的長度為1-255字節(jié)。
假設(shè)目前有一千萬個記錄（這些查詢串的重復(fù)度比較高，雖然總數(shù)是1千萬，但如果除去重復(fù)后，不超過3百萬個。一個查詢串的重復(fù)度越高，說明查詢它的用戶越多，也就是越熱門。請你統(tǒng)計最熱門的10個查詢串，要求使用的內(nèi)存不能超過1G。

先統(tǒng)計所有查詢的次數(shù)，所有查詢有 300 萬個，255 * 300 * 10000B = 765 MB，可以存入內(nèi)存。這里使用 STL 中的 map。所得時間復(fù)雜度為 O(NlogM)，N 為所有的查詢，包括重復(fù)的，M 為不重復(fù)的查詢。更好的方法是用散列。

然后遍歷 map，維護一個大小為 10 的集合，在遍歷 map 時，比較當(dāng)前查詢的出現(xiàn)次數(shù)與集合中出現(xiàn)次數(shù)最小的查詢的出現(xiàn)此時比較，如果大于，將當(dāng)前查詢替換到集合中。
這里的集合還是用的 map，時間復(fù)雜度為 O（MlogK），這里 K = 10。

總的時間復(fù)雜度為 O(NlogM) + O(MlogK)

也可以將這個過程合二為一。即每次在統(tǒng)計的過程中，查詢大小為 K 的集合。如果符合條件，則將當(dāng)前查詢替換到集合中。但是還要考慮實時更新集合中的元素。
這種方法的時間復(fù)雜度為 O(N(logM + logK + K))。

由于第二種方法還得考慮實時更新。效率遠沒有第一種方案高。

實現(xiàn)：

1 #include <iostream>
2 #include <fstream>
3 #include <map>
4 #include <string>
5 using namespace std;
6
7 void statistics(map<string, int>& data, const string& query)
8 {
9     ++data[query];
10 }
11
12 void findTopK(multimap<int, string>& topK, int k, const map<string, int>& data)
13 {
14     topK.clear();
15     for (map<string, int>::const_iterator cit = data.begin(); cit != data.end(); ++cit)
16     {
17         if (topK.size() < k)
18         {
19             topK.insert(make_pair(cit->second, cit->first));
20         }
21         else
22         {
23             if (cit->second > topK.begin()->first)
24             {
25                 topK.erase(topK.begin());
26                 topK.insert(make_pair(cit->second, cit->first));
27             }
28         }
29     }
30 }
31
32 int main()
33 {
34     ifstream fin("queryfile.txt");
35     map<string, int> data;
36     multimap<int, string> top10;
37     string query;
38     while (getline(fin, query))
39     {
40         statistics(data, query);
41     }
42
43     //for (map<string, int>::const_iterator cit = data.begin(); cit != data.end(); ++cit)
44     //{
45     //    cout << cit->first << '\t' << cit->second << endl;
46     //}
47
48     //cout << endl;
49     findTopK(top10, 10, data);
50
51     for (multimap<int, string>::const_reverse_iterator cit = top10.rbegin(); cit != top10.rend(); ++cit)
52     {
53         cout << cit->second << '\t' << cit->first << endl;
54     }
55
56     return 0;
57 }

http://blog.donews.com/jiji262/2011/03/baidu_top_k_interview/
http://blog.redfox66.com/post/2010/09/23/top-k-algoriyhm-analysis.aspx
http://blog.csdn.net/jasonblog/archive/2010/08/19/5825026.aspx

posted on 2011-04-30 18:06 unixfy 閱讀(210) 評論(0) 編輯收藏引用

常用鏈接

留言簿(3)

隨筆檔案

搜索

最新評論

閱讀排行榜

評論排行榜

只有注冊用戶登錄后才能發(fā)表評論。
【推薦】100%開源！大型工業(yè)跨平臺軟件C++源碼提供，建模，組態(tài)！



網(wǎng)站導(dǎo)航: 博客園 IT新聞 BlogJava 博問 Chat2DB 管理