丰满少妇高潮惨叫久久久,国内精品久久久久久麻豆,综合久久给合久久狠狠狠97色

關于map/reduce的combiner運行時機的問題

Posted on 2012-11-06 23:52 whspecial 閱讀(943) 評論(0) 編輯收藏引用所屬分類: hadoop

map/reduce的combiner到底在什么時候運行？

在網上大多數資料中，都是說combiner在map端運行，發生在map輸出數據之后，經過combiner再傳遞給reducer。但是之前在工作中出現的一個問題導致我發現原來combiner居然也會在reducer端運行，并且會多次運行。
在網上查了之后發現，這是hadoop-0.18版本引入的新feature：
Changed policy for running combiner. The combiner may be run multiple times as the map's output is sorted and merged. Additionally, it may be run on the reduce side as data is merged. The old semantics are available in Hadoop 0.18 if the user calls: job.setCombineOnlyOnce(true)。
實際上combiner會在mapper端和reducer端分別運運行，看了下代碼，發生combine的時機在以下：
1）在mapper端的spill階段，在緩存中的記錄超過閾值時會進行combine

if (spstart != spindex) {

…

combineAndSpill(kvIter, combineInputCounter);

}

2）在mapper端的merge階段，進行merge的spill文件數目>=3時會進行combine

if (null == combinerClass || numSpills < minSpillsForCombine) {

Merger.writeFile(kvIter, writer, reporter);

} else {

combineCollector.setWriter(writer);

combineAndSpill(kvIter, combineInputCounter);

}

3）在reducer端，一定會進行combine

只有注冊用戶登錄后才能發表評論。
【推薦】100%開源！大型工業跨平臺軟件C++源碼提供，建模，組態！

相關文章: 跨機房的hadoop集群 Dremel存儲格式解析 Orcfile文件格式解析（2） Orcfile文件格式解析（1）關于map/reduce的combiner運行時機的問題

網站導航: 博客園 IT新聞 BlogJava 博問 Chat2DB 管理

實驗室宅男的一畝三分地

導航

常用鏈接

留言簿

隨筆分類

隨筆檔案

搜索

最新評論

閱讀排行榜

評論排行榜

關于map/reduce的combiner運行時機的問題