Posted on 2012-11-06 23:52
whspecial 閱讀(930)
評(píng)論(0) 編輯 收藏 引用 所屬分類:
hadoop
map/reduce的combiner到底在什么時(shí)候運(yùn)行? 在網(wǎng)上大多數(shù)資料中,都是說combiner在map端運(yùn)行,發(fā)生在map輸出數(shù)據(jù)之后,經(jīng)過combiner再傳遞給reducer。但是之前在工作中出現(xiàn)的一個(gè)問題導(dǎo)致我發(fā)現(xiàn)原來combiner居然也會(huì)在reducer端運(yùn)行,并且會(huì)多次運(yùn)行。
在網(wǎng)上查了之后發(fā)現(xiàn),這是hadoop-0.18版本引入的新feature:
Changed policy for running combiner. The combiner may be run multiple times as the map's output is sorted and merged. Additionally, it may be run on the reduce side as data is merged. The old semantics are available in Hadoop 0.18 if the user calls: job.setCombineOnlyOnce(true)。
實(shí)際上combiner會(huì)在mapper端和reducer端分別運(yùn)運(yùn)行,看了下代碼,發(fā)生combine的時(shí)機(jī)在以下:
1) 在mapper端的spill階段,在緩存中的記錄超過閾值時(shí)會(huì)進(jìn)行combine
if (spstart != spindex) {
…
combineAndSpill(kvIter, combineInputCounter);
}
2) 在mapper端的merge階段,進(jìn)行merge的spill文件數(shù)目>=3時(shí)會(huì)進(jìn)行combine
if (null == combinerClass || numSpills < minSpillsForCombine) {
Merger.writeFile(kvIter, writer, reporter);
} else {
combineCollector.setWriter(writer);
combineAndSpill(kvIter, combineInputCounter);
}
3) 在reducer端,一定會(huì)進(jìn)行combine