青青草原综合久久大伊人导航_色综合久久天天综合_日日噜噜夜夜狠狠久久丁香五月_热久久这里只有精品

C++ Coder

HCP高性能計算架構(gòu),實現(xiàn),編譯器指令優(yōu)化,算法優(yōu)化, LLVM CLANG OpenCL CUDA OpenACC C++AMP OpenMP MPI

C++博客 首頁 新隨筆 聯(lián)系 聚合 管理
  98 Posts :: 0 Stories :: 0 Comments :: 0 Trackbacks
http://devgurus.amd.com/thread/159558

Understanding performance counters

此問題被 假設(shè)已回答。

chersanyaNewbie
chersanya 2012-8-5 下午12:03

I have a kernel, and each workitem processes tens of elements (firstly perform some computation and then global memory read + write). The profiler gave me much help in optimizing it, however I want to go further  Now the profiling data looks like this (almost the same for all kernel runs):

 

  1. GlobalWorkSize     126720  
  2. WorkGroupSize     256  
  3. VGPRs     13  
  4. FCStacks     2  
  5. ALUInsts     6787.63  
  6. FetchInsts     52.47  
  7. WriteInsts     26.47  
  8. ALUBusy     98.31  
  9. ALUFetchRatio     129.37  
  10. ALUPacking     72.16  
  11. FetchSize     411503.38  
  12. CacheHit     0.09  
  13. FetchUnitBusy     89.44  
  14. FetchUnitStalled     93.86  
  15. WriteUnitStalled 0.00  
  16. FastPath     9.19  
  17. CompletePath     28.91  
  18. PathUtilization     30.52  

 

LDS not used at all, kernel occupancy is 50% (VGPR-limited 16 waves).

I can't understand several points here:

  • What exactly means FCStacks value? I have only one loop (for), and no if statements, but its value is two.
  • How can be ALUBusy 98% with low ALUPacking (72%)? As I see from ALUPacking, not all VLIWs are filled at full, so ALUBusy shouldn't be so close to 100%
  • FetchUnitStalled > FetchUnitBusy while it's written that FetchUnitBusy includes stalled time - how?

 

And how to improve ALUPacking up to 100%?

  • Understanding performance counters
    nouExpert
    nou 2012-8-6 上午6:18 (回復 chersanya)

    ALU is counted as busy even with pure scalar code. ALU packing at 72% is quite high. you can try put code of work item into static loop for(int i=0;i<2;i++). compler will it unroll and you get quick/dirty way to vectorize code.

    • Re: Understanding performance counters
      chersanyaNewbie
      chersanya 2012-8-6 上午7:57 (回復 nou)

      I already have a static loop in the kernel, but unrolling it (either with #pragma or manually) leads to poorer performance - probably because of registers, but not sure.

      And what exactly is ALUPacking value? I think it's average of (used VLIW instructions)/(available VLIW instructions - i.e 5 in the case of VLIW5), but it is just speculation.

      • Re: Understanding performance counters
        nouExpert
        nou 2012-8-6 下午2:43 (回復 chersanya)

        from profiler manual ALUBusy - The percentage of GPUTime ALU instructions are processed.

        and also to ALUpacking - The ALU vector packing efficiency (in percentage). This value indicates how well the Shader Compiler packs the scalar or vector ALU in your kernel to the 5-way VLIW instructions. Value range: 0% (bad) to 100% (optimal). Values below 70 percent indicate that ALU dependency chains may be preventing full utilization of the processor.

        • Re: Understanding performance counters
          chersanyaNewbie
          chersanya 2012-8-6 下午3:22 (回復 nou)

          Yes, I've read this 

          But "how well (the Shader Compiler packs...)" is not a very concrete description of a counter  That's why I'm asking if my guess is right (ALUPacking = [used VLIW instructions]/[available VLIW instructions - i.e 5 in the case of VLIW5]).

        • Re: Understanding performance counters
          binyingNovice
          binying 2012-8-7 上午8:34 (回復 nou)
          FCStacksThe size of the flow control stack used by the kernel (valid only for AMD Radeon HD 6000 series GPU devices or older). This number may affect the number of wavefronts in-flight. To reduce the stack size, reduce the amount of flow control nesting in the kernel.

          This is from the profiler manual. Note that it is valid only for HD6000 or older

          • Re: Understanding performance counters
            binyingNovice
            binying 2012-8-7 上午8:58 (回復 binying)

            ALUBusy measures the percentage of GPU time ALU instructions are processed. There are many reasons for a low ALUBusy number, for example, not enough active wavefront to hide instruction latency or heavy memory access.

            Code divergence can be measured with VALUUtilization counter if you have SI hardware.

             

            http://devgurus.amd.com/thread/158655

             

            ALUPacking mesures the ALU vector packing efficiency (in percentage). This value indicates how well the Shader Compiler packs the scalar or vector ALU in your kernel to the 5-way VLIW instructions. Value range: 0% (bad) to 100% (optimal). Values below 70 percent indicate that ALU dependency chains may be preventing full utilization of the processor.

             

            So I think it makes sense that  ALUBusy 98% and low ALUPacking (72%) occur at the same time.

            • Re: Understanding performance counters
              binyingNovice
              binying 2012-8-7 上午9:00 (回復 binying)
              FetchUnitBusyThe percentage of GPUTime the Fetch unit is active. The result includes the stall time (FetchUnitStalled). This is measured with all extra fetches and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
              FetchUnitStalledThe percentage of GPUTime the Fetch unit is stalled. Try reducing the number of fetches or reducing the amount per fetch if possible. Value range: 0% (optimal) to 100% (bad).
              • Re: Understanding performance counters
                binyingNovice
                binying 2012-8-7 上午9:07 (回復 binying)

                how to improve ALUPacking?

                 

                Use int4/float4/etc for memory accesses and total element operations, as this is the type of workload a graphics card is optimized for memory access and alu load.

                 

                Try to avoid bank conflicts across the device...

                • Re: Understanding performance counters
                  chersanyaNewbie
                  chersanya 2012-8-7 上午10:53 (回復 binying)

                  To be honest, your answers didn't give any new information: everything exists in the manual. But it's not 100% clear, for example let's look at FetchUnitBusy and FetchUnitStalled counters. According to the documentation, Stalled time can't be greater than Busy, but it is.

                  • Re: Understanding performance counters
                    binyingNovice
                    binying 2012-8-7 下午10:54 (回復 chersanya)

                    ALUBusy is the % of time ALU isactually executing.

                     

                    ALUPAcking is the percentage of code that has been successfully packed into VLIW. Well, the compiler takes scalar code and generates vector or VLIW code for the hardware. Sometimes, code cannot be vectorized, this results in lower performance, e.g., you have lower ALUPacking.

                     

                    To improve APUPacking, you could also reduce conditional statements etc /for loops etc so that compiler can vectorize easily. But I think it is very difficult to have 100% ALUPacking.

                     

                    As for FetchUnitBusy and FetchUnitStalled counters, I would speculate that their relationship is sth. like that of ALUBusy and ALUPacking.

posted on 2013-01-09 13:36 jackdong 閱讀(531) 評論(0)  編輯 收藏 引用 所屬分類: OpenCL
青青草原综合久久大伊人导航_色综合久久天天综合_日日噜噜夜夜狠狠久久丁香五月_热久久这里只有精品
  • <ins id="pjuwb"></ins>
    <blockquote id="pjuwb"><pre id="pjuwb"></pre></blockquote>
    <noscript id="pjuwb"></noscript>
          <sup id="pjuwb"><pre id="pjuwb"></pre></sup>
            <dd id="pjuwb"></dd>
            <abbr id="pjuwb"></abbr>
            久久久伊人欧美| 免费欧美在线视频| 国产精品免费看久久久香蕉| 麻豆freexxxx性91精品| 久久久亚洲一区| 免费在线国产精品| 男女激情视频一区| 欧美成人免费全部| 亚洲国产毛片完整版 | 久久久久久免费| 午夜日韩在线观看| 久久精品国产精品| 亚洲精品免费在线播放| 一区二区欧美日韩视频| 欧美一区二区成人6969| 久久综合电影| 亚洲国产精品第一区二区三区| 亚洲精品乱码久久久久| 亚洲天堂av高清| 久久精品欧美| 欧美国产视频在线观看| 日韩视频―中文字幕| 久久精品国产99国产精品澳门 | 久久久久免费| 午夜亚洲伦理| 久久久91精品国产一区二区精品| 欧美gay视频激情| 欧美一区国产一区| 尤物九九久久国产精品的分类| 亚洲国产精品成人综合色在线婷婷| 日韩小视频在线观看| 久久久久国色av免费看影院 | 亚洲最新在线视频| 欧美在线免费一级片| 欧美成人午夜激情在线| 欧美激情第4页| 国产亚洲精品久久飘花| 在线视频欧美一区| 美女日韩欧美| 亚洲图片激情小说| 又紧又大又爽精品一区二区| 亚洲成色777777女色窝| 欧美一区二区在线视频| 亚洲靠逼com| 久久久伊人欧美| 国产精品外国| 亚洲一区二区三区精品动漫| 亚洲国产导航| 久久久久国产精品www| 国产日韩欧美中文在线播放| 亚洲免费在线观看| 一本色道久久综合亚洲精品婷婷| 欧美成人午夜剧场免费观看| 久久人人精品| 国内外成人免费视频| 欧美一区二区三区啪啪| 一本久久知道综合久久| 欧美精品国产精品日韩精品| 亚洲国产一区二区三区在线播 | 亚洲欧洲日产国码二区| 香蕉免费一区二区三区在线观看| 欧美午夜一区二区福利视频| 中文一区字幕| 日韩午夜激情av| 欧美香蕉视频| 午夜精品一区二区三区在线 | 蜜桃av一区| 久久久久久久综合色一本| 国产亚洲欧美中文| 国产精品久久国产精麻豆99网站| 亚洲精品之草原avav久久| 亚洲人成人一区二区三区| 欧美久久一区| 亚洲一区二区精品在线| 正在播放欧美视频| 欧美成年人视频| 欧美14一18处毛片| 欧美成人精品h版在线观看| 亚洲美女精品成人在线视频| 亚洲精品久久久久久久久久久| 亚洲国产精品va在线看黑人| 免费国产一区二区| 欧美成人一区二区| 亚洲一区二区三区在线视频| 亚洲一区久久久| 亚洲欧美国产精品va在线观看| 亚洲精品一区在线| 亚洲深夜福利视频| 国内精品久久久久久久果冻传媒| 在线成人中文字幕| 亚洲青色在线| 国产精品电影在线观看| 久久精品女人的天堂av| 男女视频一区二区| 午夜精品影院| 一区二区三区久久| 久久99伊人| 欧美jizz19hd性欧美| 亚洲自拍偷拍色片视频| 久久久久九九九九| 亚洲一区www| 久久免费精品日本久久中文字幕| 99精品国产在热久久| 午夜视频一区在线观看| 亚洲精品日韩在线观看| 欧美亚洲三区| 妖精视频成人观看www| 欧美在线一级视频| 一本色道久久综合狠狠躁篇的优点 | 欧美黄污视频| 久久久久久亚洲精品杨幂换脸| 亚洲乱码国产乱码精品精98午夜| 亚洲视频综合在线| 日韩网站在线看片你懂的| 欧美在线观看视频一区二区| 亚洲一区二区三区在线看| 免费不卡在线观看| 久久久久久久精| 国产精品欧美久久久久无广告| 亚洲欧洲另类| 亚洲欧洲精品一区二区| 久久精品一二三区| 欧美综合第一页| 国产精品尤物| 亚洲欧美日本视频在线观看| 欧美成人乱码一区二区三区| 久久综合一区二区| 国精品一区二区三区| 欧美亚洲一区三区| 欧美在线二区| 国产情人节一区| 欧美日韩亚洲一区在线观看| 久久疯狂做爰流白浆xx| 国产精品视频免费观看www| 巨乳诱惑日韩免费av| 久久国产精品网站| 国产精品午夜春色av| 一本一本久久a久久精品综合妖精 一本一本久久a久久精品综合麻豆 | 国产永久精品大片wwwapp| 亚洲视频视频在线| 亚洲一级片在线观看| 欧美性色综合| 一本一本久久a久久精品综合麻豆 一本一本久久a久久精品牛牛影视 | 亚洲免费婷婷| 亚洲免费av片| 99热免费精品| 欧美女同在线视频| 日韩视频免费观看高清在线视频 | 91久久夜色精品国产网站| 亚洲国产欧美一区二区三区同亚洲| 久久久天天操| 欧美国产专区| 99riav国产精品| 欧美视频免费看| 亚洲欧美国产高清va在线播| 久久福利电影| 美女尤物久久精品| 欧美一级一区| 国产综合色在线视频区| 久久久夜色精品亚洲| 欧美黑人在线播放| 夜夜嗨av一区二区三区四季av| 欧美色中文字幕| 久久久久久久性| 欧美激情亚洲精品| 亚洲欧美激情视频| 激情综合色丁香一区二区| 欧美成人资源| 亚洲一区欧美二区| 欧美777四色影视在线| 亚洲深夜激情| 国产一区二区三区四区| 欧美国产精品日韩| 亚洲综合国产激情另类一区| 欧美成人影音| 亚洲欧美高清| 亚洲国产一区视频| 国产精品影音先锋| 农夫在线精品视频免费观看| 亚洲一区日韩在线| 一色屋精品视频在线看| 欧美日韩综合在线免费观看| 久久久久国色av免费看影院 | 亚洲精品1区2区| 久久久久久亚洲精品中文字幕| 欧美日韩亚洲免费| 久久综合九色综合欧美就去吻| 一区二区欧美日韩| 一本久久a久久精品亚洲| 日韩一级精品| 免费看亚洲片| 香蕉免费一区二区三区在线观看| 亚洲国产高清在线观看视频| 国产精品视频导航| 欧美日韩国产限制| 欧美大片一区二区三区| 久久av在线| 午夜久久99| 亚洲午夜三级在线| 9色porny自拍视频一区二区|