??xml version="1.0" encoding="utf-8" standalone="yes"?>
]]>
]]>
]]>
]]>
]]>
]]>
]]>
Continuing our series of occasional interviews with game developers about current and upcoming hardware and game graphics engines, we chat with Marko Kylmamaa, senior graphics programmer for Digital Illusion' Canadian studio.
本期的采访对象是来自DICE的高U图像程序员Marko Kylmamaa先生?/p>
FiringSquad: First, Intel and AMD are pushing dual core processors and within the next year four core processors are due to be released. How will DICE support this kind of tech in the Battlefield 2/2142 engine and will there be any need for special programming to fully support multi core CPUs in PCs?
提问Q目前Intel与AMD力推双核CPUQ目前明q都准备推出Q核?j)的CPU。DICE准备如何在BF2引擎中加入对q种技术的支持Q如果这样做需要什么特D的~程技术么Q?/p>
Marko Kylmamaa: While a program geared towards a single-core machine may run fine, with some exceptions, and perhaps even somewhat faster on a multi-core machine, in order to realize the real performance benefits a careful attention has to be paid into structuring the code for the correct granularity in mind, to make it suitable for multi-core execution. With the introduction of the next generation consoles and the PC hardware, the whole industry is in a learning phase for understanding the differences between the traditional multi-threading approaches, and multi-threading for multiple cores. DICE is working closely with hardware vendors in making sure that all of the future titles make the maximum use of the available multi-core architecture.
回答Q本来单核心(j)的机器就可以q行得很好,有些时候甚臌快于多核机器。其实问题主要是在多核心(j)的处理比单核?j)复杂(cM于痛苦的多线E)(j)Q需要正的处理代码的结构与处理同步。随着下一代硬件的普及(qing)Q整个领域开始学?fn)多U程~程技术。DICE也在不断和硬件厂商深入合作发挥多核架构的性能?/p>
FiringSquad: The 64-bit CPU has taken longer to really appear in mainstream PCs than some people expected. Do you think 64-bit CPUs will become more popular and how does DICE support it in their Battlefield 2/2142 engine ?
提问Q?4位CPU的普?qing)速度过Z的预计到来得如此之快Q?zhn)认?f)Q4位cpu?x)流行v来么QDICE在BF2引擎中如何支持它呢?
Marko Kylmamaa: One of the problems with harnessing the full power of 64-bit CPU?is the lack of adoption of 64-bit operating systems. Due to this it?difficult for the game developers to make full use of the 64-bit execution potential without providing a separate set of executables compiled for the different operating systems. The current Battlefield 2 technology has been thoroughly tested on the 64-bit architecture for guaranteeing a solid performance, and optimizations have been made where possible with such architectures in mind.
回答Q由于现?4bit操作pȝ?4位EQͼ的支持不是非常好Q所以还无法完全发挥Q4位EQͼ的性能。如果不分别的ؓ(f)不同q_~写E序无法发挥6Q位的性能Q这是个隄。BF2已经在6Q位q_上经q测试与优化q?/p>
FiringSquad: Game physics are getting more and more attention as well with more attention being put into destructible objects and better collisions. Where does DICE stand on this kind of support for its engine and what solution is best; having a dedicated card (AGEIA) using a graphics card (ATI/Havok) or using a CPU to handle it?
提问Q游戏的物理Ҏ(gu)越来越受到重视。DICE如何看待它?(zhn)认为哪U方案最好呢Q是独立的AGEIA物理卡,q是NV/Havok的图形卡Q还是用CPU处理Q?/p>
Marko Kylmamaa: Especially with multiplayer games in mind, it is difficult to make use of scaleable physics, since especially from the gameplay perspective all of the players must experience the same end result in simulation regardless of their hardware. This leads to a lot of the scalability of the physics being used for visual effects such as richer particle effects or fluid simulation. The GPU can of course be used for offloading the physics simulation from the CPU, but this will compete with the remaining processing time for graphics. Therefore in most cases it is necessary to strike the right balance between the CPU and GPU usage with the needs of the particular game in mind. The next generation technology at DICE is being built on the bleeding edge and will make use of very comprehensive physical modeling.
回答Q在多h游戏中用物理特性是相当隑ց的,从玩家的视角来说Q所有的交互角色必须体验到相同的物理Ҏ(gu)而不关系他们说用的是何U硬件。已l用的物理Ҏ(gu)有比如体模拟_子pȝ{等。GQͼ可以分担一些EQͼ的物理模拟计工作,但是q样和囑Ş计算争抢?jin)宝늚资源。虽然如此,我们依旧需要^衡EQͼ和GQͼ之间的负载。DICE会(x)充分的利用下一代技术ؓ(f)玩家构徏最优秀的物理体验?/p>
FiringSquad: HDR lighting is also getting a lot of attention in more PC games. How does the Battlefield 2/2142 engine support those features and how will that help the graphics in games that use it?
提问QHDR光照效果也被来多的提?qing)。BF2/2142引擎是如何支持这U特效,而且它将如何提升游戏画面呢?
Marko Kylmamaa: HDR lighting can add significantly to the perceived realism in the modern graphics engines. It is becoming an increasingly common feature as the new hardware supports full floating point surfaces and has the required processing power for supporting a multitude of such high end features.
Some aspects of the HDR lighting were simulated especially in the Battlefield 2 Expansion Pack: Special Forces, for adding a degree of realism to the night-time look. The effect is fairly settle and was used mainly for fine tuning the overall look. Battlefield 2142 does not have night-time levels, so the same technology was not applicable to it, however there are a great number of special lighting effects for enhancing the desired futuristic look of the game.
回答QHDR光照可以作ؓ(f)C囑Ş引擎的一个特性。在新硬件完全支持Q点计的方式下,它可以提高画面质量让它看h更真实,同时也需要相当的计算量。hdr在bf2特别武力 中被使用Q用于夜视效果。BF2142没有夜市(jng)场景Q所以也没有用这U技术(应该是HDRQ,不过我们使用其他的光照效果提高画面的真实感?/p>
FiringSquad: More and more games are using extensive pixel and vertex shading for visual and art effects. How does the Battlefield 2/2142 engine support these features currently and how will pixel and vertex shaders be used in the future, particularly with Windows Vista and DirectX10 support?
提问Q越来越q的游戏q泛使用PS?qing)VS技术提高画面质量。BF2/2142的引擎如何支持这些特Ԍ未来PS VS被如何使用Q特别是VISTA和DX10的来_(d)
Marko Kylmamaa: The Battlefield 2 engine has been built on the DirectX9 architecture and is a fully shader based model. This allowed for a great flexibility during the development, and not supporting the older fixed function pipeline model allowed us to concentrate solely on the high end features. Battlefield 2142 is based on the improved Battlefield 2 technology and will be released later this year, so considering that the DirectX10 hardware won?be widely available just yet, it hasn?been beneficial to re-architect the engine into a DirectX10 based model for this release. This allowed the available time to be used for adding a number of new special effects and polishing the overall look of the existing engine.
回答Q目前BF2引擎完全构徏于DX9架构Q这是个完全ZShader的模型。这提高?jin)开发的可~性,摆脱?jin)FF线模型让我们得以实现最高的特效。BF2142Z改进的BF2引擎技术,不久发布于世,所以考虑到DX10g不会(x)那么快的普及(qing)Q我们将引擎重新构徏以适应DX10的模型。这h们就有时间在以后的日子里l箋加入新的效果Q拓展现有的引擎?/p>
FiringSquad: What other advanced hardware and graphical features do you think will be supported in upcoming Battlefield 2/2142 engine games and in future graphics engine?
提问Q?zhn)认?f)BF2/2142引擎会(x)支持哪些高的硬件及(qing)其图形技术,未来的引擎呢Q?/p>
Marko Kylmamaa: Battlefield 2142 will support a large range of high end special effects geared towards creating the desired futuristic look. These involve for example new atmospheric effects for creating a unique look that is quite different from Battlefield 2.
回答QBF2142支持许多Ҏ(gu)用来构徏l丽真实的图像。比如,球体光照技术(Atomospheric EffectQ技术就和BF2中的不同?/p>
FiringSquad: Finally, Mark Rein from Epic has said that Intel is hurting the PC gaming industry through its use of intergrated graphics in PCs. Is this a real threat and if so what can be done about this from the game developer's side?
提问Q最后,EpicQ不要告诉我不知道,卛_发布的UT2007Q的Mark Rein_(d)Intel正在通过集成囑Şg损害PC游戏工业。从游戏开发者的角度来说(zhn)如何看待这个问题?
Marko Kylmamaa: Intel produces what you could call the ultra-low end graphics cards for a market segment that typically doesn?wish to invest the money into a higher end, gaming geared hardware. Clearly there is a demand for this type of hardware as Intel?graphics cards boast a large user base. However, this does impose challenges for the games industry in our attempts at reaching especially for the casual gamer market. Hardware requirements for the next generation games keep growing faster than what is needed for running general applications, which increases the rift between the casual and hardcore hardware markets. I believe that we as an industry will also have to recognize the different requirements these markets impose.
From the perspective of a developer, it can be difficult or in some cases practically impossible to make the high-end game run on the ultra-low end hardware. Supporting such scalability range in performance could be prohibitive with the required development time and cost in mind. It is ultimately up to each developer to find the correct range of hardware which allows for the desired market penetration.
回答Q买Intel的显卡的人,是那些你称之ؓ(f)C端货的那些hQ他们其实都不会(x)花钱构徏一个游戏^台。虽然事实如此,׃q个原因的媄(jing)响,我们q是不太Ҏ(gu)开拓这L(fng)一个市(jng)场。游戏对g的需求L要远高于商用软gQ其实这也扩大了(jin)g?jng)场的层ơ差距。我怿整个工业?x)对看清楚这个问题。从一个游戏开发者的角度来说Q让高端游戏q行在低端^C着实困难。因支持q些性能不一的硬仉要提高开发的旉和花贏V更本上q是要开发者根据他们所要开发的?jng)场q一角度q行g的^台的选择?/p>
最q抽I研I了(jin)一?/span> WOW 的服务器l构Q也Z从那些项目中又复?fn)?jin)一?/span> ManGOs ?/span> template 方式?/span> SingleTon 的用方法。不q有些不明白的,如果q样Q?/span> SingleTon<Master> q样的用,如果传入的类型不同,N传出?/span> static 是一L(fng)Q不可能吧,如果打印?/span> this 指针看看呢?抽空我再试试?/span> SingleTon 在游戏设计中是相当重要的设计模式Q大家一定要好好学习(fn)?/span>
认证q程
Wow 的服务器有两部分l成Q?/span> Logon Server Q以下简U?/span> LS Q和 Realm Server Q以下简U?/span> RS Q?/span> LS 接受来自 Wow 客户端的q接Q主要有以下几步完成Q?/span>
(g)查客L(fng)版本区域{信息,(g)察̎号密?/span>
开?/span> / l箋传?/span> Patch Q如果有Q?/span>
与客L(fng)q行 SRP6 的加密会(x)话,把生成的密匙写入数据?/span>
Ҏ(gu)客户端请求发?/span> Realms 列表
当客L(fng)选择?/span> Realms 后,客户端就?/span> LS 断开Q连接到 RS 上:(x)
认证Q用刚才生成的客户端密?/span>
如通过Q进行游戏@环的交互
RS ?/span> LS 使用相同的数据库Q?/span> SRP6 密匙?/span> LS 生成q写?/span> DB 后还要由 RS d出来q行下一步的认证?/span>
Logon Server 详解
基本的连接过E如下:(x)
客户端准备连接,发?/span> CMD_AUTH_LOGON_CHALLENGE 数据包,包含?jin)所有登陆所需要的数据比如用户名密码等
服务端返?/span> CMD_AUTH_LOGON_CHALLENGE 数据包,填充字段包括有效验证Q以?qing)计好的服务?/span> SRP6 数据
如果有效Q客L(fng)发?/span> CMD_AUTH_LOGON_PROOF 数据包,q把自己计算?/span> SRP6 数据填充q去
服务端进行验证,发送回 CMD_AUTH_LOGON_PROOF Q包含了(jin) SRP6 验证的结?/span>
如果一切正常,客户端发?/span> CMD_REALM_LIST 数据包,h发送有效的 Realm
服务器回?/span> CMD_REALM_LIST 数据报,q填充过客户端需要的 Realm 数据
客户端的 Realm 列表每隔 3-4 U就?x)从服务器端h一ơ?/span>
N N = 2q + 1 Q?/span> q 是一个素敎ͼ下面所有的取模q算都和q个 N 有关
g 一?/span> N 的模敎ͼ应该?/span> 2 个巨大的素数乘得?/span>
k k = H(N,G) ?/span> SRP6 ?/span> k = 3
s User’s Salt
I 用户?/span>
p 明文密码
H() 单向 hash 函数
^ 求幂q算
u 随机?/span>
a,b 保密的(f)时数?/span>
A,B 公开的(f)时数?/span>
x U有密匙Q从 p ?/span> s 计算得来Q?/span>
v 密码验证数字
其中 x = H(s,p) ?/span> v = g ^ x Q?/span> s 是随机选择的, v 用来来验证密码?/span>
L?/span> { I,s,v } 存入数据库。认证的q程如下Q?/span>
客户向主机发?/span> I Q?/span> A = g ^ a Q?/span> a 是一个随机数Q?/span>
L向客户发?/span>
s
Q?/span>
B = kv + g^b
Q发?/span>
salt
Q?/span>
b
是一个随机数字)(j)
双方同时计算 u = H(A,B)
客户计算机算 x = H(s,p) Q开?/span> hash 密码Q, S = ((B - kg^x) ^ (a + ux) ) Q?/span> K = H(S) Q(开始计会(x)?/span> Key Q?/span>
L计算 S = (Av^u)^b Q?/span> K = H(S) Q也生成?x)?/span> Key
Z(jin)完成认证Q双方交?/span> Key Q各自进行如下的计算Q?/span>
客户接收到来自主机的 key 后,计算 H(A,M,K)
同理Q主?/span> M = H(H(N) xor H(g), H(I), s, A, B, K) Q验证是否合自己储存的数值匹配。至此完成验证过E?/span>
三?/span> Realm Server 详解
?/span> LS 断开后,开始和 RS 认证Q?/span>
q接?/span> RS Q向服务器发?/span> SMSG_AUTH_CHALLENGE 数据包,包含上次所用的随机U子
服务器发送回 SMSG_AUTH_CHALLENG 。客L(fng)从服务器端发送回来的U子?/span> SRP6 数据中生随机种子,生成 SHA1 字符Ԍ用这些数据生?/span> CMSG_AUITH_SESSION 数据包,发送给服务端?/span>
需要注意的是,q个q程是没有经q加密的。当服务端收到认证回复后Q通过客户端生的U子也生成一?/span> SHA1 串和来自客户端的q行Ҏ(gu)Q如果相同,一?/span> OK ?/span>
下面看一下对账号创徏的角色等操作q行分析。一个̎h多可以徏 50 个角色吧Q我q没有玩q,只是看了(jin)一?/span> Manual ?/span>
客户端发送一?/span>CMSG_CHAR_ENUM数据包请求接受角?/span>
服务端发送回包含所有角色信息的 CMSG_CHAR_ENUM 数据?/span>
q里客户端可以对q些角色q行操作?jin)?/span> CMSG_CHAR_CREATE Q?/span> CMSG_CHAR_DELETE Q?/span> CMSG_CHAR_PLAYER_LOGIN
角色登陆完成后,服务器发送回 SMSG_CHAR_DATA 数据?/span>
在游戏@环中是如何操作的呢?
如果玩家立刻退出游戏,那么客户端发?/span> CMSG_PLAYER_LOGOUT Q服务器回复 SMSG_LOGOUT_COMPLETE
如果玩家选择E后退出游戏,发?/span> CMSG_LOGOUT_REQUEST 。服务端回复 SMSG_LOGOUT_RESPONSE 。如果玩家在倒计旉D退出,发?/span> CMSG_PLAYER_LOGOUT Q那么玩家的角色依旧{倒计时完成后再退出?/span>
如果玩家中断?jin)退出l游戏,发?/span>
CMSG_LOGOUT_CANCEL
Q服务器回复
SMSG_LOGOUT_CANCEL_ACK
?/span>
现在?/span> CPU 依旧采用冯诺伊曼体系Q喜Ƣ像d一样从头执行到,中途没有Q何的跌{停顿{待。可是现实情冉|Q大部分E序里面q是不?/span> IF ELSE 之类的判断,循环更加得多了(jin)。如何优化@环大家可以自q,其实不难Q可以参考一下《高质量 C\C++ ~程指南?/span>
现在 CPU 上都?/span> Level 1 指o(h)~存Q又叫做 L1 Trace Q与 Level 1 数据~存Q?/span> L1 Data Cache Q?/span> PMMX Q?/span> P2 Q?/span> P3 Z者都准备?/span> 16kb Q我?/span> P4 Northwood Q以下简U?/span> P4NW Q有 8kbL1 数据~存?/span> 12kb 指o(h)~存?/span> CPU d L1 Data Cache 中的数据只需?/span> 1 个时钟周期,速度非常快,应该是仅ơ于寄存器了(jin)。数据缓存是?/span> 256 或?/span> 512 ?/span> 32bytes l成的,也就?/span> 32bytes 寚w的,?/span> P4NW ?/span> 64bytes 字节寚w的,q行 4 路,d 128 行。当你处理的数据没有载入~存的时候, CPU 从内存d~存行大的数据Q所以缓存行L寚w到能?/span> 32 整除的物理地址?/span> CPU ?/span> L1 数据~存中的数据q行操作是最快速的。所以推荐内存地址最L(fng)?/span> 32byte 寚w的。目前编译器在这个地方的优化已经非常好了(jin)Q一般都?/span> 4byte 寚wQ当然也都是 32 寚w的。在后面你将?x)看刎ͼ?/span> SSE2 要求数据?/span> 16 字节寚w的?/span>
~存cM一?/span> C++ set 容器Q但是不能赋值到一个Q意的内存地址。每行本w都?/span> 1 ?/span> 7bit 大小的关联| set value Q要和目标内存地址?/span> 5 ?/span> 11 位对应( 0-4 位已l忽略了(jin)Q,也可以理解ؓ(f)Q关联值是内存D地址的一部分?/span> PPro 中,?/span> 128 个关联值对应到 2 行,所以最多可以ؓ(f)L的内存单元准?/span> 2 个缓存行?/span> PMMX P2 P3 P4NW ?/span> 4 个。由于内存是分段的,所以说 CPU 只能为, 5-11 位地址相同的内存准?/span> 2 或?/span> 4 个不同的~存行。如何ؓ(f)两个内存地址赋予相同的关联值呢Q把 2 个地址的低 5bit LQ这样就能被 32 整除?jin)。如果这 2 个截断了(jin)的地址都是 4096 Q?/span> 1000H Q的倍数Q那么这两个地址有?jin)相同的兌倹{?/span>
让我们用汇编加深一下印象,假设 ESI 中是 32 寚w的地址?/span>
AGAIN: MOV EAX, [ESI]
MOV EBX, [ESI+13*4096+4]
MOV ECX, [ESI+20*4096+28]
DEC EDX
JNZ AGAIN
Oh Year Q这?/span> 3 个地址都有相同的关联|而且地址跨度都超q了(jin)数据~存的大,可这个@环在 PPro 上效率会(x)相当低。当你想d ECX 的值的时候,没有空闲的~存行了(jin) —?/span> 因ؓ(f)׃n一个关联|而且 2 行已l被使用?jin)。此?/span> CPU 腾出最q用的 2 个缓存行Q一个已l被 EAX 使用。然?/span> CPU 把这个缓存行?/span> [ESI+20*4096] ?/span> [ESI+20*4096+31] 的内存数据填充,然后从缓存中d ECX 。听h好象相当的烦(ch)琐。更加糟p的是,当又需要读?/span> EAX 的时候,q需要重复上q的q程Q需要对内存~存来回操作Q效率相当的低,甚至不如不用~存。可是,如果我们把第三行Ҏ(gu)Q?/span>
MOV ECX, [ESI+20*4096+32]
哦,不好Q看hQ我们的地址过?/span> 32 Q不能被整除?jin)。可是这h?jin)不同的兌|也就意味着有了(jin) 1 个新行,不再׃n可怜的 2 个行。这样一来,对三个寄存器的操作就不需要反复的?/span> 2 个缓存行q行调度?jin),各有一个了(jin)。嘿嘿,q次只需?/span> 3 个时钟周期了(jin)Q而上一个要 60 个周期。这是在 PPro 上的Q在后来?/span> CPU 中都?/span> 4 路的Q也׃存在上面的问题了(jin)。搞W的是, Intel 的文档却错误的说 P2 的缓存是 2 路的。虽然说很少人在用那么古老的 CPU Q可是其中的道理大家应该明白?/span>
可是判断要访问的部分数据是否有相同的兌|也就是关于缓存是否能够命中的问题Q是相当困难的,汇编q好Q用高等U语a~译q的E序鬼知道是否对~存做过优化呢。所以么Q推荐,在程序的核心(j)部分Q对性能要求最高的部分Q先寚w数据Q然后确保用的单个数据块不要超q缓存大, 2 个数据块Q单个不要超q缓存大的一半(仔细xZ么,因ؓ(f)兌值的问题Q可以缓存分Z部分处理两块Q。可是大部分情况下,我们都是使用q比数据~存大的多的l构Q以?qing)编译器自己q回的指针,然后Z(jin)优化你可能希望把所有频J用的变量攑ֈ一个连l的数据块中以充分利用缓存。我们可以这样做Q把?rn)态变量数值拷贝到栈中的局部变量中Q等子函数或者@环结束后再拷贝回来。这样一来就相当于把?rn)态变量放入了(jin)q箋的地址I间中去?/span>
当读取的数据不在 L1 Cache 内时Q?/span> CPU 要?/span> L2 Cache d L1 ~存行大的数据?/span> L1 里去Q大概需?/span> 200ns 的时_(d)也就?/span> 100Mhz pȝ?/span> 20 个时钟周期)(j)Q但是直C能够使用q些数据前,又需要有 50-100ns 的gq。最p糕的是Q如果数据也不在 L2 Cache 中,那么只能从最慢速的内存里读取了(jin)Q内存的龟速哪能和全速的~存相比?/span>
好了(jin)Q关于缓存的知识可以此打住?jin),下面开始讲如何优化~存。无非就?/span> 3 U方法,g预取Q?/span> Prefetch Q、Y仉取、用缓存指令。关于预取的注意事项主要有这些:(x)
<!--[if !supportLists]--> 1?span style="FONT: 7pt 'Times New Roman'; font-size-adjust: none; font-stretch: normal"> <!--[endif]--> 合理安排内存的数据,使用块结构,提高~存命中率?/span>
<!--[if !supportLists]--> 2?span style="FONT: 7pt 'Times New Roman'; font-size-adjust: none; font-stretch: normal"> <!--[endif]--> 使用~译器提供的预取指o(h)。比?/span>ICC中的_mm_prefetch _mm_streamQ甚?/span>_mm_load{比较“传l”的指o(h)?/span>
<!--[if !supportLists]--> 3?span style="FONT: 7pt 'Times New Roman'; font-size-adjust: none; font-stretch: normal"> <!--[endif]--> 可能少的用全局的变量或者指针?/span>
<!--[if !supportLists]--> 4?span style="FONT: 7pt 'Times New Roman'; font-size-adjust: none; font-stretch: normal"> <!--[endif]--> E序可能少的进行判断蟩转@环?/span>
<!--[if !supportLists]--> 5?span style="FONT: 7pt 'Times New Roman'; font-size-adjust: none; font-stretch: normal"> <!--[endif]--> 使用const标记Q不要在代码中?/span>register声明?/span>
不过要提醒一句,真正提高E序效率的方法不是那U,从头到尾׃外科手术般的解剖Q一个一个地方的优化Q请抓住E序最核心(j)的部分进行优化,C 80-20 规则?/span>
使用 SIMD
先复?fn)一下对齐指令, __declspec(aliagn(#)) Q?/span> # 替换为字节数。比如想声明一?/span> 16 字结寚w的QҎ(gu)l, __declspec(aliagn(16)) float Array[128] 。需要注意的是,最好充分了(jin)解你 CPU 的类型,支持哪些指o(h)集?/span> SIMD 主要使用在需要同时操作大量数据的工作领域Q比?/span> 3D 囑Ş处理Q游戏)(j)Q物理徏模( CAD Q,加密Q以?qing)科学计领域。据我所知,目前 GPGPU 也是使用 SIMD 的代表之一?/span>
MMX
主要Ҏ(gu):(x) 57 条指令, 64bit ?/span> FP 寄存?/span> MM0-MM7 Q对齐到 8 ?/span> 80bit ?/span> FP 寄存?/span> ST0-ST7 。需要数?/span> 8 字节寚wQ也是使用 Packed 数字?/span>
PS Q这里冒Z(jin)一个问题,Z?/span> Intel 要把 MMX 的寄存器?/span> FPU 的寄存器混合h使用呢?因ؓ(f)q里牉|C?/span> FPU 状态切换问题,后面?x)提刎ͼ当你在一D代码中又要用到 MMX 指o(h)又要用到传统?/span> FPU 指o(h)Q那么需要保?/span> FPU 状态,或者退?/span> MMX 。可是这U操作对?/span> FPU 来说非常昂贵Q而且对于多Q务操作系l来_(d)q乎于不可能完成的Q?/span> —?/span> 同时有许多程序,有些需?/span> MMX Q有些不需要,而正地q行调度?x)变得非常困难。所?/span> Intel 保存状态的工作完全交给?/span> CPU 自己QY件h员无M太多q方面的工作Q这样一来,向前向后兼容了(jin)多Q务操作系l,比如 Windows ?/span> Linux 。后来随着操作pȝ?/span> CPU 的不断升U,操作pȝ开发h员发布了(jin)一个补丁包Q就可以让操作系l用新的寄存器。这时h们都发现 Intel 的这U做法是相当短视的,q可以当作一个重大的p。后?/span> Intel 通过引入?jin)新的QҎ(gu)令集Q这时才加入 XMM 寄存器。可造成q段故事的原因却Ҏ(gu)不是技术问题,保证兼容性也是一个方面,M真的说不清楚。你只要记得无法同时使用 MMX ?/span> FPU 可以了(jin)Q?/span> CPU 要进行模式切换?/span>
SSE1
主要Ҏ(gu):(x) 128bit ?/span> FP 寄存?/span> XMM0-XMM7 。增加了(jin)数据预取指o(h)。额外的 64bit 整数支持。支持同时处?/span> 4 个单_ֺ点敎ͼ也就?/span> C\C++ 里的 float ?/span>
适用范围Q多媒体信号处理
SSE2
主要Ҏ(gu):(x) 128bit ?/span> FP 寄存器支持处理同时处?/span> 2 个双_ֺ double 点敎ͼ以及(qing) 16byte 8word 4dword 2quadword 整数?/span>
适用范围Q?/span> 3D 处理 语音识别 视频~码解码
SSE3
主要Ҏ(gu):(x)增加支持非对U?/span> asymmetric 和水q?/span> horizontal 计算?/span> SIMD 指o(h)。ؓ(f) SIMD 提供?jin)一条特D的寄存?/span> load 指o(h)。线E同步指令?/span>
适用范围Q科学计?/span>
多线E程?br />
手头工具
1 、选择一个合适的~译器,推荐?/span> Intel C++ Compiler Q以下简U?/span> ICC Q,以及(qing) Visual Studio .NET 2003 ?qing)以?/span> IDE 附带?/span> C++ ~译器。同Ӟ Microsoft C++ Compiler 也支?/span> AMD ?/span> 3DNow ?/span> GCC C++ Compiler 没有试?/span>
2
?/span>
Intel
以及(qing)
AMD
的汇~指令集手册。这个是必需的,强烈每个C++ Coder人手准备一份?/b>
使用范例Q?/span>
向量乘法?/span> 3D 处理中非帔R常多Q多半用于计单位矢量的夹角?/span>
我们先定义一个顶点结构?/span>
w是其ơ坐标系的参敎ͼ处理向量的时候不需要用到。我的函数是q样的:(x)