青青草原综合久久大伊人导航_色综合久久天天综合_日日噜噜夜夜狠狠久久丁香五月_热久久这里只有精品

C++ Coder

HCP高性能計算架構,實現,編譯器指令優化,算法優化, LLVM CLANG OpenCL CUDA OpenACC C++AMP OpenMP MPI

C++博客 首頁 新隨筆 聯系 聚合 管理
  98 Posts :: 0 Stories :: 0 Comments :: 0 Trackbacks
http://devgurus.amd.com/thread/158866

VLIW on Cypress and vector addition

此問題被 假設已回答。

cadorinoNewbie
cadorino 2012-7-2 上午10:31

Hi to everybody.
I'm thinking about VLIW utilization on a 5870 HD.

Suppose you have the following kernel:

 

__kernel void saxpy(const __global float * x, __global float * y, const float a)

{

          uint guid = get_global_id(0);

                    y[guid] = a * x[guid] + y[guid];

}

 

Each work item operates on a single vector element and no vectorization (float4).
Is the compiler still capable of packing instructions to exploit the 4 ALUs of each processing element?

Is there any tool to determine the way instructions are packed into VLIW?

 

Thank you very much!

  • 633 瀏覽次數
  • Re: VLIW on Cypress and vector addition
    MicahVillmowModerator
    MicahVillmow 2012-7-2 上午10:44 (回復 cadorino)

    The compiler only packs the VLIW with computation within a work-item, not across work-items. Multicore ware has some software that will allow you to pack across work-items, which can be found here:http://www.multicorewareinc.com/index.php?option=com_content&view=article&id=71&Itemid=106

    Micah Villmow
    Advanced Micro Devices Inc.
    --------------------------------
    The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
    • Re: VLIW on Cypress and vector addition
      cadorinoNewbie
      cadorino 2012-7-2 上午11:18 (回復 MicahVillmow)

      Ok, so what if there are not enough alu instructions per work-item to be packed to exploit the 4 ALUs?

      • Re: VLIW on Cypress and vector addition
        MicahVillmowModerator
        MicahVillmow 2012-7-2 上午11:27 (回復 cadorino)

        Then your program is not utilizing the entire machine and cannot reach peak efficiency.

        Micah Villmow
        Advanced Micro Devices Inc.
        --------------------------------
        The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
        • Re: VLIW on Cypress and vector addition
          cadorinoNewbie
          cadorino 2012-7-2 下午1:00 (回復 MicahVillmow)

          Ok, so it is possible for a PE to get a VLIW containing less than 4 operations, and in this case some ALUs are idle. Right?

        • Re: VLIW on Cypress and vector addition
          cadorinoNewbie
          cadorino 2012-7-2 下午1:00 (回復 MicahVillmow)

          Ok, so it is possible for a PE to get a VLIW containing less than 4 operations, and in this case some ALUs are idle. Right?

          • Re: VLIW on Cypress and vector addition
            realhetNovice
            realhet 2012-7-3 上午3:05 (回復 cadorino)

            Peak efficiency is only when all your 4 or 5 ALUs have work to do in every cycles.

            If you disassemble your code (use kernel analyzer), you can easily spot the idle ALUs.

             

            for example:

            10  x: SUB_INT     T0.x,  PV9.z,  KC0[2].x

                       w: SETGE_UINT  ____,  PV9.z,  KC0[2].x    ;y,z,t sleeps

            11  z: AND_INT     ____,  T0.y,  PV10.w              ;x,y,w,t sleeps

            12  y: CNDE_INT    T1.y,  PV11.z,  T0.z,  T0.x     ;x,z,w,t sleeps

            13  x: ADD_INT     ____,  KC0[2].x,  PV12.y        ;y,z,w,t sleeps

            this is so unoptimal that is does only 5 operations under 4 clocks, the possible maximum would be 4*5=20 operations (on vliw5)

             

            826  x: XOR_INT     T1.x,  R28.w,  T0.w

                         y: SETGT_UINT  ____,  T1.x,  T0.w

                         z: XOR_INT     T3.z,  KC0[13].z,  R20.y      VEC_021

                         w: SETGT_UINT  T2.w,  T2.w,  R15.y      VEC_201

                         t: SETGT_UINT  T0.w,  R9.x,  T1.y

            827  x: ADD_INT     ____,  T0.z,  T2.z

                         y: ADD_INT     T2.y,  T0.y,  T2.x      VEC_021

                         z: ADD_INT     T0.z,  T1.z,  T2.y      VEC_210

                         w: ADD_INT     ____,  PV826.y,  T3.y      VEC_021

                         t: SETGT_UINT  ____,  T3.w,  R5.x

            this one is maximum utilization. 10 operationc in 2 clocks.

             

            There are tricks to improve local paralellism in code (other than simply vectorizing everything) like breaking dependency chains:

            for example  a+b+c+d  -> (a+b)+(c+d)

            • Re: VLIW on Cypress and vector addition
              nouExpert
              nou 2012-7-3 上午4:06 (回復 realhet)

              or you can proceed 2-4 work items in one work item. like this. but it has disadvantage as it increase register usage.

              __kernel void saxpy(const __global float * x, __global float * y, const float a)

              {

                        uint guid = get_global_id(0);

                             for(int i=0;i<4;i++)

                                  y[guid*4+1] = a * x[guid*4+i] + y[guid*4+i];

              }

  • Re: VLIW on Cypress and vector addition
    kbraffordNovice
    kbrafford 2012-7-3 下午12:24 (回復 cadorino)

    If he rewrote the kernel to use float4 types, would that also cause MMX/SSE registers to be used with compiled for a CPU device?

     

    __kernel void saxpy(const __global float4 * x, __global float4 * y, const float a)

    {

              uint guid = get_global_id(0);

     

              y[guid] = (float4)a * x[guid] + y[guid];

    }

    • Re: VLIW on Cypress and vector addition
      nouExpert
      nou 2012-7-3 下午1:58 (回復 kbrafford)

      yes you need use explicit vector types to utilize SSE instructions on CPU with AMD OpenCL

      • Re: VLIW on Cypress and vector addition
        cadorinoNewbie
        cadorino 2012-7-3 下午7:48 (回復 nou)

        Thank you very much you all, your answers are really useful!
        I already wrote a float4 version of the kernel, but I posted the float version since I'm trying to get into details of low-level aspects of VLIW compilation and execution.

         

        For what regards vectorized types, I agree that on CPU this enables SSE execution. Bu what for GPUs? For example, is the sum of two float4 elements spread across the 4 ALUs or the sum of each component is executed sequentially on a single ALU?

        • Re: VLIW on Cypress and vector addition
          nouExpert
          nou 2012-7-3 下午11:34 (回復 cadorino)

          I am not entirely sure what you mean with single ALU. but one workitem is executed only on one 5D/4D unit. and you will get packed instruction if is there enough independent instructions.

        • Re: VLIW on Cypress and vector addition
          realhetNovice
          realhet 2012-7-4 上午1:26 (回復 cadorino)

          On hd6xxx there are 4 physical ALUs for each workitems. The compiler will schedule operations for each of the ALUs on every single clocks. Also the compiler must ensure that there are no data dependency across the ALUs (eg. ALU x cannot use the result from ALU y in a single clock).

          In our case [Very Large Instruction Word] means that one instruction contains at most 4 subinstructions for each of the four ALUs.

           

          If it's not complicated enough: on the HD4xxx,5xxx there is a fifth ALU which handles the complicated instructions. So 4 ALUs can do simple math like mul, add, and the 5th can handle special things like cos().

           

          SSE is SIMD. It means one instruction will do the same operation on 4 different datas. SIMD can interpreted as a special case of VLIW where all ALU's have to do the same operation on data packed into vectors (eg. float4)

           

          GCN architecture dropped VLIW. On that a simple sequential code which even contains long dependency chains will do fine. So there is no need to have 4 or 5 independent execution paths in your algo, but for maximum utilization you'll have to feed it with 4x more workitems.

           

          (you know, the example code here is rather theoretical: Its bottleneck is memory IO, all the ALUs are sleeping and waiting for the memory units. Also the hardest ALU calculation is not the a*b+c (1 mad instruction) but get_global_id(0) (modulo/rangecheck/add operations) and address calculations for the 3 indirectly addressed buffers.)

          • Re: VLIW on Cypress and vector addition
            cadorinoNewbie
            cadorino 2012-7-4 下午3:01 (回復 realhet)

            Great answer! Thank you 
            So, summarizing, the grouping of instructions may allow to exploit all the 4/5 ALUs and this exploitation depends on the program and on the compiler. 

          • Re: VLIW on Cypress and vector addition
            bridgmanNewbie
            bridgman 2012-7-4 下午7:31 (回復 realhet)

            >>On hd6xxx there are 4 physical ALUs for each workitems.

             

            It's actually only the hd69xx (and Trinity) which use VLIW4. The rest of the hd6xxx family uses VLIW5, similar to the earlier parts.

             

            The GCN parts (hd77xx and higher) use 4 scalar SIMDs in a CU rather than 1 VLIW4 SIMD.

            The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
  • Re: VLIW on Cypress and vector addition
    cadorinoNewbie
    cadorino 2012-7-5 上午6:46 (回復 cadorino)

    Great topic. I would like to ask you another thing without starting another topic.
    Given a disassembled kernel like:

     

    00 ALU: ADDR(32) CNT(11) KCACHE0(CB0:0-15) KCACHE1(CB1:0-15)

          0  t: MULLO_INT   ____,  R1.x,  KC0[1].x     

          1  x: ADD_INT     ____,  R0.x,  PS0     

          2  w: ADD_INT     ____,  PV1.x,  KC0[6].x     

          3  z: LSHL        ____,  PV2.w,  2     

          4  y: ADD_INT     ____,  KC1[0].x,  PV3.z     

             z: ADD_INT     ____,  KC1[1].x,  PV3.z     

          5  x: LSHR        R2.x,  PV4.z,  2     

             y: LSHR        R0.y,  PV4.z,  2     

             w: LSHR        R0.w,  PV4.y,  2     

    01 TEX: ADDR(48) CNT(2)

          6  VFETCH R0.x___, R0.w, fc153  MEGA(4)

             FETCH_TYPE(NO_INDEX_OFFSET)

          7  VFETCH R1.x___, R0.y, fc153  MEGA(4)

             FETCH_TYPE(NO_INDEX_OFFSET)

    02 ALU: ADDR(43) CNT(1) KCACHE0(CB1:0-15)

          8  x: MULADD_e    R0.x,  KC0[2].x,  R0.x,  R1.x     

    03 MEM_RAT_CACHELESS_STORE_RAW: RAT(11)[R2].x___, R0, ARRAY_SIZE(4)  MARK  VPM

    END_OF_PROGRAM

     

    can I assume that each instruction 0 - 8 is executed in a clock cycle?

    In the example I've 9 instructions. Can I say that they take 9 cycles to get executed? (obviously, after the fetch clause the wavefront will be switched off until data is available, but I'm referring exclusively to executing instructions, ignoring the time spent in waiting for memory accesses to complete)

    • Re: VLIW on Cypress and vector addition
      realhetNovice
      realhet 2012-7-5 上午7:51 (回復 cadorino)

      yes, the first ALU clause takes 6 core clock cycles and the second takes one.

      But these are relatively small clauses interleaved with memory clauses so there are lots of penalties at the transitions of the clauses.

      Ideally alu clauses can hold 128 slots but these are small ones.

       

      Lets say you have 2 wavefronts A,B and C assigned to a compute unit:

      ALU               memory unit

      A: 00 ALU       idle

      B: 00 ALU       A: 01 TEX

      C: 00 ALU       A: 01 TEX still (it's slow operation compared to small ALU stuff)

      A: 02 ALU       B: 01 TEX

      idle                 B: 01 TEX still

      B: 02 ALU       A: 03 MEM

      idle                 C: 01 TEX

      idle                 C: 01 TEX still

      C: 02 ALU       B: 03 MEM

      idle                 C: 03 MEM

       

      (oups maybe the memory output unit is separated from the texture unit but it's only an illustration of how the different parts of a compute unit can work in paralell)

      (and the compiler did a good job compiling that MULADD into one instruction)

posted on 2013-01-09 16:37 jackdong 閱讀(512) 評論(0)  編輯 收藏 引用 所屬分類: OpenCL
青青草原综合久久大伊人导航_色综合久久天天综合_日日噜噜夜夜狠狠久久丁香五月_热久久这里只有精品
  • <ins id="pjuwb"></ins>
    <blockquote id="pjuwb"><pre id="pjuwb"></pre></blockquote>
    <noscript id="pjuwb"></noscript>
          <sup id="pjuwb"><pre id="pjuwb"></pre></sup>
            <dd id="pjuwb"></dd>
            <abbr id="pjuwb"></abbr>
            亚洲电影免费观看高清完整版在线观看| 国产精品99久久久久久久女警 | 欧美视频日韩视频在线观看| 欧美激情小视频| 麻豆精品91| 欧美国产精品v| 欧美日韩国产精品专区| 国产精品裸体一区二区三区| 国产日韩av一区二区| 国内综合精品午夜久久资源| 伊人激情综合| 夜夜嗨av色综合久久久综合网| 亚洲性感激情| 久久精品国内一区二区三区| 欧美成人精品三级在线观看| 亚洲欧洲午夜| 最新中文字幕亚洲| 亚洲伊人网站| 久久国产精品99精品国产| 欧美成人精品不卡视频在线观看| 欧美日本不卡高清| 国产精品视频男人的天堂| 精品va天堂亚洲国产| 日韩一级片网址| 亚洲欧美在线观看| 欧美成人精品在线| 中文在线一区| 亚洲天堂av电影| 久久精品视频在线看| 欧美日韩激情网| 国产日韩欧美中文| 一区电影在线观看| 久久久蜜臀国产一区二区| 亚洲国产精品一区二区第一页| 一区二区电影免费在线观看| 久久久精品国产免费观看同学| 欧美激情无毛| 永久久久久久| 午夜精品久久一牛影视| 欧美成人在线免费视频| 亚洲欧美精品suv| 欧美国产国产综合| 在线色欧美三级视频| 羞羞视频在线观看欧美| 亚洲精品久久久久| 美女视频黄免费的久久| 国产综合色精品一区二区三区| 一区二区国产在线观看| 亚洲国产高清aⅴ视频| 性久久久久久久| 国产精品久久国产精品99gif | 国产精品久久久久一区二区三区共| 狠狠干成人综合网| 欧美在线资源| 亚洲网站在线看| 欧美日韩一区二区视频在线观看 | 亚洲免费精品| 免费在线观看日韩欧美| 亚洲第一色在线| 久久久久综合| 欧美有码视频| 狠狠色丁香婷婷综合| 久久久久久精| 欧美在线亚洲在线| 黄色成人av网| 美女精品网站| 另类综合日韩欧美亚洲| 亚洲第一区在线观看| 欧美chengren| 久久躁日日躁aaaaxxxx| 亚洲国产成人91精品| 欧美高清成人| 男女精品网站| 一本大道久久a久久综合婷婷| 亚洲欧洲一区二区三区久久| 欧美福利视频一区| 日韩亚洲欧美成人一区| 一片黄亚洲嫩模| 国产美女搞久久| 久久夜色精品国产欧美乱| 久久人人九九| 日韩天堂在线观看| 亚洲一区欧美二区| 在线免费日韩片| 最新中文字幕亚洲| 一区二区亚洲精品国产| 一区二区精品国产| 亚洲伦理在线| 欧美在线看片| 久久精品国产精品亚洲| 在线看日韩欧美| 亚洲欧洲视频在线| 国产精品毛片va一区二区三区 | 欧美承认网站| 中文av一区二区| 亚洲欧美日韩在线一区| 亚洲二区在线| 日韩亚洲欧美综合| 国产一区二区精品在线观看| 亚洲国产精品成人久久综合一区| 欧美日韩一级片在线观看| 欧美有码在线观看视频| 久热这里只精品99re8久| 亚洲视频在线看| 欧美一区二区三区日韩| 日韩一级精品视频在线观看| 香蕉久久夜色精品| 这里只有精品丝袜| 老司机精品视频网站| 亚洲一级在线观看| 美女视频黄 久久| 久久久久高清| 欧美四级在线观看| 久久久久五月天| 国产精品高潮在线| 亚洲国产高清在线| 黄色工厂这里只有精品| 一区二区三区免费网站| 国产日产精品一区二区三区四区的观看方式 | 国产精品日韩久久久| 欧美成年人网站| 国产人久久人人人人爽| 99一区二区| 日韩一区二区精品| 久久综合九色| 久久久一区二区| 国产伦精品一区二区三| 一区二区三区四区蜜桃| 亚洲精品久久久久久一区二区| 久久激情视频免费观看| 性视频1819p久久| 欧美午夜精品久久久久久人妖| 亚洲风情在线资源站| 在线观看一区视频| 久久福利一区| 久久久久久久精| 国产区精品视频| 性伦欧美刺激片在线观看| 香蕉久久夜色| 国产精品久久久久久久久久久久久 | 亚洲精品日产精品乱码不卡| 午夜久久久久久久久久一区二区| 亚洲与欧洲av电影| 欧美日韩一区在线观看| 91久久精品久久国产性色也91| 亚洲高清在线| 你懂的一区二区| 亚洲国产精品欧美一二99| 日韩视频在线观看一区二区| 欧美激情一区二区三区| 亚洲另类一区二区| 亚洲视频一区| 国产精品视频xxx| 亚洲欧美福利一区二区| 欧美一乱一性一交一视频| 国产午夜亚洲精品羞羞网站 | 亚洲综合久久久久| 久久精品视频在线看| 悠悠资源网亚洲青| 欧美高清在线一区二区| 99精品热6080yy久久| 先锋a资源在线看亚洲| 国产午夜亚洲精品理论片色戒| 久久一综合视频| 亚洲理论在线观看| 欧美综合国产| 亚洲黄一区二区三区| 欧美系列电影免费观看| 久久久综合免费视频| 亚洲一区二区三区精品动漫| 国产精品高潮呻吟| 久久视频精品在线| 99精品99| 久久婷婷久久一区二区三区| 亚洲美洲欧洲综合国产一区| 欧美性猛片xxxx免费看久爱| 亚洲女人天堂成人av在线| 美女国产精品| 亚洲一区二区少妇| 亚洲国产精品激情在线观看| 欧美系列一区| 狂野欧美激情性xxxx| 一区二区冒白浆视频| 久久亚洲春色中文字幕| 一区二区三区www| 激情自拍一区| 国产精品成人va在线观看| 欧美在线视频网站| 一区二区三区高清在线观看| 欧美.日韩.国产.一区.二区| 亚洲欧美在线免费观看| 亚洲青色在线| 国产一区二区无遮挡| 欧美日韩美女一区二区| 久久久久91| 亚洲欧美日韩爽爽影院| 99精品欧美| 亚洲精品一区二区在线| 男女激情视频一区| 午夜精品福利在线|