• <ins id="pjuwb"></ins>
    <blockquote id="pjuwb"><pre id="pjuwb"></pre></blockquote>
    <noscript id="pjuwb"></noscript>
          <sup id="pjuwb"><pre id="pjuwb"></pre></sup>
            <dd id="pjuwb"></dd>
            <abbr id="pjuwb"></abbr>

            xiaoxiaoling

            C++博客 首頁(yè) 新隨筆 聯(lián)系 聚合 管理
              17 Posts :: 2 Stories :: 9 Comments :: 0 Trackbacks

            最近在研究DPDK,這是sigcomm 2014的論文,紀(jì)錄在此備忘

            Ps:  文中關(guān)鍵詞的概念:

            segment : 對(duì)應(yīng)于tcp的PDU(協(xié)議傳輸單元),這里應(yīng)該指tcp層的包,如果一個(gè)包太大tcp負(fù)責(zé)將它拆分成多個(gè)segment(這個(gè)概念對(duì)理解后文有幫助)

            根據(jù)unix網(wǎng)絡(luò)編程卷1 第8頁(yè)注解2:packet是IP層傳遞給鏈路層并且由鏈路層打包好封裝在幀中的數(shù)據(jù)(不包括幀頭)而IP層的包(不包括ip頭)應(yīng)該叫datagram,鏈路層的包叫幀(fragment),不過(guò)這里沒(méi)有特意區(qū)分packet只是數(shù)據(jù)包的意思

            DRAM: 動(dòng)態(tài)隨機(jī)訪(fǎng)問(wèn)存儲(chǔ)器,系統(tǒng)的主要內(nèi)存

            SRAM: 靜態(tài)隨機(jī)訪(fǎng)問(wèn)存儲(chǔ)器,cpu 的cache

             

            Abstract

            Contemporary network stacks are masterpieces of generality, supporting many edge-node and middle-node functions. Generality comes at a high performance cost: current APIs, memory models, and implementations drastically limit the effectiveness of increasingly powerful hardware. Generality has historically been required so that individual systems could perform many functions. 

            However,as providers have scaled services to support millions of users,

            they have transitioned toward thousands (or millions) of dedicated servers, each performing a few functions. We argue that the overhead of generality is now a key obstacle to effective scaling, making specialization not only viable, but necessary.


            現(xiàn)在網(wǎng)絡(luò)堆棧在通用性表現(xiàn)很好,支持許多邊緣節(jié)點(diǎn)和中間節(jié)點(diǎn)功能。通用性伴隨著高成本:當(dāng)前的API,內(nèi)存模型和實(shí)現(xiàn)極大地限制了日益強(qiáng)大的硬件的效能。為了使各個(gè)系統(tǒng)可以執(zhí)行許多功能通用性是必須的。

            然而,由于提供商服務(wù)數(shù)百萬(wàn)用戶(hù),他們已經(jīng)轉(zhuǎn)向數(shù)千(或數(shù)百萬(wàn))的專(zhuān)用服務(wù)器每個(gè)都執(zhí)行幾個(gè)功能(ps:垂直細(xì)分),我們認(rèn)為通用性的開(kāi)銷(xiāo)是有效擴(kuò)展的關(guān)鍵障礙,專(zhuān)業(yè)化不僅可行,而且是必要的。

            We present Sandstorm and Namestorm, web and DNS servers that utilize a clean-slate userspace network stack that exploits knowledge of application-specific workloads. Based on the netmap framework, our novel approach merges application and network-stack memory models, aggressively amortizes protocol-layer costs based on application-layer knowledge, couples tightly with the NIC event model, and exploits microarchitectural features. Simultaneously, the servers retain use of conventional programming frameworks. We compare our approach with the FreeBSD and Linux stacks using the nginx web server and NSD name server, demonstrating 2–10 and 9 improvements in web-server and DNS throughput, lower CPU usage, linear multicore scaling, and saturated NIC hardware.

            我們提出了Sandstorm和Namestorm,web和DNS服務(wù)器,采用一個(gè)干凈的用戶(hù)空間網(wǎng)絡(luò)堆棧并利用應(yīng)用程序特定的工作負(fù)載的知識(shí)。基于netmap框架,我們的新穎方法合并了應(yīng)用程序和網(wǎng)絡(luò)堆棧內(nèi)存模型根據(jù)應(yīng)用程序?qū)又R(shí)攤銷(xiāo)協(xié)議層成本與NIC事件模型緊密耦合,并利用微架構(gòu)特性。同時(shí),服務(wù)器保持使用常規(guī)編程框架。我們?cè)贔reeBSD系統(tǒng)上將使用linux協(xié)議棧的nginx web服務(wù)器和NSD name server 與我們的方案進(jìn)行比對(duì)。演示 2–10和 9 展示了Web服務(wù)器和DNS吞吐量,降低CPU使用率,線(xiàn)性多核縮放和跑滿(mǎn)NIC硬件的改進(jìn)。

            INTRODUCTION

            Conventional network stacks were designed in an era where individual systems had to perform multiple diverse functions. In the last decade, the advent of cloud computing and the ubiquity of networking has changed this model; today, large content providers serve hundreds of millions of customers. To scale their systems, they are forced to employ many thousands of servers, with each providing only a single network service. Yet most content is still served with conventional general-purpose network stacks.

             

            介紹

            傳統(tǒng)網(wǎng)絡(luò)堆棧是在各個(gè)系統(tǒng)必須執(zhí)行多種不同功能的時(shí)代設(shè)計(jì)的。 在過(guò)去十年中,云計(jì)算的出現(xiàn)和網(wǎng)絡(luò)的普及改變了這種模式; 今天,大型內(nèi)容提供商為數(shù)億客戶(hù)提供服務(wù)。 為了擴(kuò)展他們的系統(tǒng),他們被迫使用成千上萬(wàn)的服務(wù)器,每個(gè)服務(wù)器僅提供單個(gè)網(wǎng)絡(luò)服務(wù)。 然而,大多數(shù)內(nèi)容仍然服務(wù)于傳統(tǒng)的通用網(wǎng)絡(luò)棧。

             

            These general-purpose stacks have not stood still, but today’s stacks are the result of numerous incremental updates on top of codebases that were originally developed in the early 1990s. Arguably, these network stacks have proved to be quite efficient, flexible, and reliable, and this is the reason that they still form the core of contemporary networked systems. They also provide a stable programming API, simplifying software development. But this generality comes with significant costs, and we argue that the overhead of generality is now a key obstacle to effective scaling, making specialization not only viable, but necessary.

             

            這些通用棧并沒(méi)有停止前進(jìn),但今天的棧是在最初在20世紀(jì)90年代初開(kāi)發(fā)的代碼庫(kù)上的許多增量更新的結(jié)果。 可以說(shuō),這些網(wǎng)絡(luò)堆棧已經(jīng)被證明是相當(dāng)高效,靈活和可靠的,這就是它們?nèi)匀恍纬僧?dāng)代網(wǎng)絡(luò)系統(tǒng)的核心的原因。 它們還提供穩(wěn)定的編程API,簡(jiǎn)化軟件開(kāi)發(fā)。 但這種普遍性帶來(lái)了巨大的成本,我們認(rèn)為,通用性的開(kāi)銷(xiāo)現(xiàn)在是有效擴(kuò)展的關(guān)鍵障礙,專(zhuān)業(yè)化不僅可行,而且是必要的。

            In this paper we revisit the idea of specialized network stacks. In particular, we develop Sandstorm, a specialized userspace stack for serving static web content, and Namestorm, a specialized stack implementing a high performance DNS server. More importantly, however, our approach does not simply shift the network stack to userspace: we also promote tight integration and specialization of application and stack functionality, achieving cross-layer optimizations antithetical to current design practices.

             

            在本文中,我們重新思考了關(guān)于網(wǎng)絡(luò)堆棧專(zhuān)用化。 特別是我們開(kāi)發(fā)了Sandstorm,用于提供靜態(tài)Web內(nèi)容的專(zhuān)用用戶(hù)空間堆棧和Namestorm,這是一個(gè)實(shí)現(xiàn)高性能DNS服務(wù)器的專(zhuān)用堆棧。 更重要的是,我們的方法不是簡(jiǎn)單地將網(wǎng)絡(luò)棧轉(zhuǎn)移到用戶(hù)空間:我們還促進(jìn)應(yīng)用程序和堆棧功能的緊密集成和專(zhuān)業(yè)化,實(shí)現(xiàn)與當(dāng)前設(shè)計(jì)實(shí)踐相對(duì)立的跨越性?xún)?yōu)化。

             

            Servers such as Sandstorm could be used for serving images such as the Facebook logo, as OCSP [20] responders for certificate revocations, or as front end caches to popular dynamic content. This is a role that conventional stacks should be good at: nginx [6] uses the sendfile() system call to hand over serving static content to the operating system. FreeBSD and Linux then implement zero-copy stacks, at least for the payload data itself, using scatter-gather to directly DMA the payload from the disk buffer cache to the NIC. They also utilize the features of smart network hardware, such as TCP Segmentation Offload (TSO) and Large Receive Offload (LRO) to further improve performance. With such optimizations, nginx does perform well, but as we will demonstrate, a specialized stack can outperform it by a large margin.

             

            像Sandstorm這樣的服務(wù)器可以用于提供諸如Facebook徽標(biāo)的圖像,作為用于證書(shū)撤銷(xiāo)的OCSP [20]響應(yīng)者,或者作為流行動(dòng)態(tài)內(nèi)容的前端緩存。 這是常規(guī)堆棧應(yīng)該擅長(zhǎng)的角色:nginx [6]使用sendfile()系統(tǒng)調(diào)用將服務(wù)靜態(tài)內(nèi)容移交給操作系統(tǒng)。 FreeBSD和Linux然后實(shí)現(xiàn)零拷貝堆棧,至少對(duì)于有效載荷數(shù)據(jù)本身,使用scatter-gather散射 - 聚集直接將有效載荷從磁盤(pán)緩沖區(qū)緩存通過(guò)DMA轉(zhuǎn)發(fā)到NIC。 它們還利用智能網(wǎng)絡(luò)硬件的特性,如TCP分段(ps:segment是tcp層的包,這里Segmentation 指將大tcp包分段的功能放在硬件中完成)卸載(TSO)和大型接收卸載(LRO),以進(jìn)一步提高性能。有了這樣的優(yōu)化,nginx的表現(xiàn)良好,但正如我們將證明,一個(gè)專(zhuān)門(mén)的堆棧可以大幅度超越它。

             

            Namestorm is aimed at handling extreme DNS loads, such as might be seen at the root nameservers, or when a server is under a high-rate DDoS attack. The open-source state of the art here is NSD [5], which combined with a modern OS that minimizes data copies when sending and receiving UDP packets, performs well.Namestorm, however, can outperform it by a factor of nine.

            Namestorm旨在處理極端的DNS負(fù)載,例如可能在根名稱(chēng)服務(wù)器處看到,或者當(dāng)服務(wù)器受到高速DDoS攻擊時(shí)。 這里的開(kāi)源代表是NSD [5],它與現(xiàn)代操作系統(tǒng)相結(jié)合,在發(fā)送和接收UDP packet時(shí)最小化數(shù)據(jù)復(fù)制,性能良好。然而,Namestorm可以超過(guò)它的九倍。

             

            Our userspace web server and DNS server are built upon FreeBSD’s netmap [31] framework, which directly maps the NIC buffer rings to userspace.We will show that not only is it possible for a specialized stack to beat nginx, but on data-center-style networks when serving small files typical of many web pages, it can achieve  three times the throughput on older hardware, and more than six times the throughput on modern hardware supporting DDIO1.

             

            我們的用戶(hù)空間Web服務(wù)器和DNS服務(wù)器是基于FreeBSD的netmap [31]框架構(gòu)建的,它直接將NIC緩沖環(huán)映射到用戶(hù)空間。我們將展示一個(gè)專(zhuān)門(mén)的堆棧不僅可以擊敗nginx, 在為許多網(wǎng)頁(yè)提供典型的小文件時(shí),它可以實(shí)現(xiàn)舊硬件的三倍的吞吐量,并且是支持DDIO1的現(xiàn)代硬件的吞吐量的六倍多。

            The demonstrated performance improvements come from four places. First, we implement a complete zero-copy stack, not only for payload but also for all packet headers, so sending data is very efficient. Second, we allow aggressive amortization that spans traditionally stiff boundaries – e.g., application-layer code can request pre-segmentation of data intended to be sent multiple times, and extensive batching is used to mitigate system-call overhead from userspace. Third, our implementation is synchronous, clocked from received packets; this improves cache locality and minimizes the latency of sending the first packet of the response. Finally, on recent systems, Intel’s DDIO provides substantial benefits, but only if packets to be sent are already in the L3 cache and received packets are processed to completion immediately. It is hard to ensure this on conventional stacks, but a special-purpose stack can get much closer to this ideal.

            顯示的性能改進(jìn)來(lái)自四個(gè)地方。 首先,我們實(shí)現(xiàn)一個(gè)完整的零拷貝堆棧,不僅對(duì)于有效負(fù)載,而且對(duì)于所有packet header(ps: 應(yīng)該指包含ip頭的包),因此發(fā)送數(shù)據(jù)是非常有效的。 第二,我們?cè)试S跨越傳統(tǒng)上僵硬的邊界的積極的攤銷(xiāo) - 例如,應(yīng)用層代碼可以對(duì)要被發(fā)送多次的數(shù)據(jù)預(yù)分段(ps:segmentation   應(yīng)該指大包分小包),并且廣泛的將來(lái)自用戶(hù)空間的系統(tǒng)調(diào)用批量處理以降低開(kāi)銷(xiāo)。 第三,我們的實(shí)現(xiàn)是同步的,從接收的數(shù)據(jù)包開(kāi)始; 這改進(jìn)了緩存局部性并且使發(fā)送響應(yīng)的第一分組的等待時(shí)間最小化。 最后,在最近的系統(tǒng)上,英特爾的DDIO提供了巨大的好處,但是只有當(dāng)要發(fā)送的數(shù)據(jù)包已經(jīng)在L3緩存中,并且接收到的數(shù)據(jù)包立即被處理完成。 在常規(guī)堆棧中很難確保這一點(diǎn),但是專(zhuān)用堆棧可以更接近這個(gè)理想。

            Of course, userspace stacks are not a novel concept. Indeed, the Cheetah web server for MIT’s XOK Exokernel [19] operating system took a similar approach, and demonstrated significant performance gains over the NCSA web server in 1994. Despite this, the concept has never really taken off, and in the intervening years conventional stacks have improved immensely. Unlike XOK, our specialized userspace stacks are built on top of a conventional FreeBSD operating system. We will show that it is possible to get all the performance gains of a specialized stack without needing to rewrite all the ancillary support functions provided by a mature operating system (e.g., the filesystem). Combined with the need to scale server clusters, we believe that the time has come to re-evaluate specialpurpose stacks on today’s hardware.

            The key contributions of our work are:

             

            當(dāng)然,用戶(hù)空間堆棧不是一個(gè)新穎的概念。 事實(shí)上,用于MIT XOK Exokernel [19]操作系統(tǒng)的Cheetah Web服務(wù)器采用了類(lèi)似的方法,并且在1994年的NCSA Web服務(wù)器上顯示出顯著的性能提升。盡管如此,這一概念從未真正起飛,在其間, 堆棧已經(jīng)大大改善。 與XOK不同,我們的專(zhuān)用用戶(hù)空間堆棧是建立在常規(guī)FreeBSD操作系統(tǒng)之上的。 我們將顯示,有可能獲得專(zhuān)用堆棧的所有性能增益,而不需要重寫(xiě)成熟操作系統(tǒng)(例如,文件系統(tǒng))提供的所有輔助支持功能。 結(jié)合需要擴(kuò)展服務(wù)器集群,我們認(rèn)為,現(xiàn)在是重新評(píng)估當(dāng)今硬件上的專(zhuān)用堆棧的時(shí)候了。

            我們工作的主要貢獻(xiàn)是:

            We discuss many of the issues that affect performance in conventional stacks, even though they use APIs aimed at high performance such as sendfile() and recvmmsg().

            我們討論了許多影響傳統(tǒng)堆棧性能的問(wèn)題,盡管它們使用旨在實(shí)現(xiàn)高性能的API,如sendfile()和recvmmsg()。

            We describe the design and implementation of multiple modular, highly specialized, application-specific stacks built over a commodity operating system while avoiding these pitfalls. In contrast to prior work, we demonstrate that it is possible to utilize both conventional and specialized stacks in a single system. This allows us to deploy specialization selectively, optimizing networking while continuing to utilize generic OS components such as filesystems without disruption.

             

            我們描述了在操作系統(tǒng)上構(gòu)建的多個(gè)模塊化,高度專(zhuān)業(yè)化的應(yīng)用特定堆棧的設(shè)計(jì)和實(shí)現(xiàn),同時(shí)避免這些陷阱。 與以前的工作相反,我們證明,有可能在單個(gè)系統(tǒng)中利用常規(guī)和專(zhuān)用的堆棧。 這使我們能夠有選擇地部署專(zhuān)業(yè)化,優(yōu)化網(wǎng)絡(luò)連接,同時(shí)繼續(xù)使用通用操作系統(tǒng)組件,如文件系統(tǒng)而不丟失通用性。

            We demonstrate that specialized network stacks designed for aggressive cross-layer optimizations create opportunities for new and at times counter-intuitive hardware-sensitive optimizations. For example, we find that violating the long-held tenet of data-copy minimization can increase DMA performance for certain workloads on recent CPUs.

            我們展示專(zhuān)為積極的跨層優(yōu)化設(shè)計(jì)的專(zhuān)用的網(wǎng)絡(luò)堆棧為新的和偶爾反直覺(jué)的硬件敏感的優(yōu)化創(chuàng)造機(jī)會(huì)。 例如,我們發(fā)現(xiàn)違反數(shù)據(jù)拷貝盡量小的原則可以提高近代CPU上某些工作負(fù)載的DMA性能。

            We present hardware-grounded performance analyses of our specialized network stacks side-by-side with highly optimized conventional network stacks. We evaluate our optimizations over multiple generations of hardware, suggesting portability despite rapid hardware evolution.

            我們提供基于硬件的性能分析將我們的專(zhuān)用網(wǎng)絡(luò)堆棧與高度優(yōu)化的常規(guī)網(wǎng)絡(luò)堆棧并排比較。 我們?cè)u(píng)估我們?cè)诙啻布系膬?yōu)化,表明即使硬件快速更新?lián)Q代仍然可以使用。

            We explore the potential of a synchronous network stack blended with asynchronous application structures, in stark contrast to conventional asynchronous network stacks supporting synchronous applications. This approach optimizes cache utilization by both the CPU andDMA engines, yielding as much as 2-10 conventional stack performance.

            我們探索同步網(wǎng)絡(luò)堆棧與異步應(yīng)用程序結(jié)構(gòu)混合的潛力,與支持同步應(yīng)用程序的常規(guī)異步網(wǎng)絡(luò)堆棧形成鮮明對(duì)比。 這種方法優(yōu)化了CPU和DMA引擎的緩存利用率,產(chǎn)生了類(lèi)似2-10的 常規(guī)堆棧性能。

            2. SPECIAL-PURPOSE ARCHITECTURE

            What is the minimum amount of work that a web server can perform to serve static content at high speed? It must implement a MAC protocol, IP, TCP (including congestion control), and HTTP.

            However, their implementations do not need to conform to the conventional socket model, split between userspace and kernel, or even implement features such as dynamic TCP segmentation. For a web server that serves the same static content to huge numbers of clients (e.g., the Facebook logo or GMail JavaScript), essentially the same functions are repeated again and again. We wish to explore just how far it is possible to go to improve performance. In particular, we seek to answer the following questions:

            Web服務(wù)器可以執(zhí)行的高速服務(wù)靜態(tài)內(nèi)容的最少工作量是多少? 它必須實(shí)現(xiàn)MAC協(xié)議(ps:ARP),IP,TCP(包括擁塞控制)和HTTP。

            然而,他們的實(shí)現(xiàn)不需要符合常規(guī)套接字模型,分離用戶(hù)空間和內(nèi)核,或者甚至實(shí)現(xiàn)諸如動(dòng)態(tài)TCP分段的特征。 對(duì)于向巨大數(shù)量的客戶(hù)端(例如,F(xiàn)acebook徽標(biāo)或GMail JavaScript)提供相同靜態(tài)內(nèi)容的web服務(wù)器,基本上一次又一次地重復(fù)相同的功能。 我們希望探討可以提高性能的可能性。 特別是,我們尋求回答以下問(wèn)題:

            Conventional network stacks support zero copy for OSmaintained data – e.g., filesystem blocks in the buffer cache, but not for application-provided HTTP headers or TCP packet headers. Can we take the zero-copy concept to its logical extreme, in which received packet buffers are passed from the NIC all the way to the application, and application packets to be sent are DMAed to the NIC for transmission without even the headers being copied?

            傳統(tǒng)的網(wǎng)絡(luò)棧支持操作系統(tǒng)維護(hù)的數(shù)據(jù)結(jié)構(gòu)的零拷貝,例如,緩沖區(qū)高速緩存中的文件系統(tǒng)塊,但不支持應(yīng)用程序提供的HTTP報(bào)頭或TCP包頭。 我們可以極端考慮零拷貝概念的邏輯,其中接收的數(shù)據(jù)包緩沖區(qū)從NIC一直傳遞到應(yīng)用程序,并且要發(fā)送的應(yīng)用程序包被從DMA到NIC傳輸,甚至不復(fù)制頭(ps:接收到的包直接修改成回復(fù)并發(fā)送)

            Conventional stacks make extensive use of queuing and buffering to mitigate context switches and keep CPUs and NICs busy, at the cost of substantially increased cache footprint and latency. Can we adopt a bufferless event model that reimposes synchrony and avoids large queues that exceed cache sizes? Can we expose link-layer buffer information, such as available space in the transmit descriptor ring, to prevent buffer bloat and reduce wasted work constructing packets that will only be dropped?

             

            傳統(tǒng)堆棧廣泛使用排隊(duì)和緩沖以減少上下文切換并保持CPU和NIC忙碌,其代價(jià)是顯著增加的高速緩存占用和延遲。 我們可以采用無(wú)緩沖事件模型,重建同步,避免超過(guò)緩存大小的大隊(duì)列嗎? 我們可以暴露鏈路層緩沖區(qū)信息,例如傳輸描述符環(huán)中的可用空間,以防止緩沖區(qū)膨脹,并減少那些浪費(fèi)的構(gòu)建只會(huì)丟棄的數(shù)據(jù)包的工作量?

             

            Conventional stacks amortize expenses internally, but cannot amortize repetitive costs spanning application and network layers. For example, they amortize TCP connection lookup using Large Receive Offload (LRO) but they cannot amortize the cost of repeated TCP segmentation of the same data transmitted multiple times. Can we design a network-stack API that allows cross-layer amortizations to be accomplished such that after the first client is served, no work is ever repeated when serving subsequent clients?

             

            傳統(tǒng)堆棧在內(nèi)部攤銷(xiāo)費(fèi)用,但不能攤銷(xiāo)跨應(yīng)用程序和網(wǎng)絡(luò)層的重復(fù)成本。 例如,它們使用大型接收卸載(LRO)來(lái)攤銷(xiāo)TCP連接查找,但是它們不能攤銷(xiāo)多次傳輸?shù)南嗤瑪?shù)據(jù)的重復(fù)TCP分段(ps: 多次對(duì)同一個(gè)tcp包重復(fù)拆小包)的成本。 我們可以設(shè)計(jì)一個(gè)網(wǎng)絡(luò)堆棧API,減少跨層消耗,使第一個(gè)客戶(hù)端服務(wù)后,在服務(wù)后續(xù)客戶(hù)端時(shí),不會(huì)重復(fù)任何工作?

             

            Conventional stacks embed the majority of network code in the kernel to avoid the cost of domain transitions, limiting twoway communication flow through the stack. Can we make heavy use of batching to allow device drivers to remain in the kernel while colocating stack code with the application and avoiding significant latency overhead?

             

            傳統(tǒng)堆棧將大部分網(wǎng)絡(luò)代碼嵌入內(nèi)核中,以避免域轉(zhuǎn)換的成本,限制通過(guò)堆棧的兩個(gè)通信流。 我們可以大量使用批處理,以允許設(shè)備驅(qū)動(dòng)程序保留在內(nèi)核中,同時(shí)與應(yīng)用程序堆棧代碼colocating (ps: 這個(gè)不懂得怎么翻譯,大概是沒(méi)有跨層消耗的意思?),并避免顯著的延遲開(kāi)銷(xiāo)?

             

            Can we avoid any data-structure locking, and even cache-line contention, when dealing with multi-core applications that do not require it?

             

            在不需要多核時(shí),我們可以避免任何數(shù)據(jù)結(jié)構(gòu)鎖定,甚至是高速緩存行爭(zhēng)用嗎?

             

            Finally, while performing all the above, is there a suitable programming abstraction that allows these components to be reused for other applications that may benefit from server specialization?

             

            最后,在執(zhí)行上述所有操作時(shí),是否有合適的編程抽象,允許這些組件重用于可能受益于服務(wù)器專(zhuān)業(yè)化的其他應(yīng)用程序?

             

            2.1 Network-stack Modularization

            Although monolithic kernels are the de facto standard for networked systems, concerns with robustness and flexibility continue to drive exploration of microkernel-like approaches. Both Sandstorm and Namestorm take on several microkernel-like qualities:

            網(wǎng)絡(luò)堆棧模塊化

            雖然獨(dú)立的內(nèi)核是網(wǎng)絡(luò)系統(tǒng)的事實(shí)上的標(biāo)準(zhǔn),但是對(duì)靈活性的關(guān)注繼續(xù)推動(dòng)類(lèi)似微內(nèi)核的方法的探索。 Sandstorm和Namestorm都有幾個(gè)類(lèi)似微內(nèi)核的特性:

             

            Rapid deployment & reusability: Our prototype stack is highly modular, and synthesized from the bottom up using traditional dynamic libraries as building blocks (components) to construct a special-purpose system. Each component corresponds to a standalone service that exposes a well-defined API. Our specialized network stacks are built by combining four basic components:

             

            快速部署和可重用性:我們的原型棧是高度模塊化的,并從下往上使用傳統(tǒng)的動(dòng)態(tài)庫(kù)作為構(gòu)建塊(組件)來(lái)構(gòu)建一個(gè)專(zhuān)用系統(tǒng)。 每個(gè)組件對(duì)應(yīng)于公開(kāi)明確定義的API的獨(dú)立服務(wù)。 我們的專(zhuān)業(yè)網(wǎng)絡(luò)堆棧是由四個(gè)基本組件組合而成:

             

            The netmap I/O (libnmio) library that abstracts traditional data-movement and event-notification primitives needed by higher levels of the stack.

             

            netmap I / O(libnmio)庫(kù)抽象了傳統(tǒng)的數(shù)據(jù)移動(dòng)和事件通知原語(yǔ)需要的更高級(jí)別的堆棧。

             

            libeth component, a lightweight Ethernet-layer implementation.

             

            libeth組件,輕量級(jí)以太網(wǎng)層實(shí)現(xiàn)。

             

            libtcpip that implements our optimized TCP/IP layer.

             

            libtcpip實(shí)現(xiàn)我們優(yōu)化的TCP / IP層。

             

            libudpip that implements a UDP/IP layer.

             

            libudpip實(shí)現(xiàn)一個(gè)UDP / IP層。

             

            Figure 1 depicts how some of these components are used with a simple application layer to form Sandstorm, the optimized web server.

            Splitting functionality into reusable components does not require us to sacrifice the benefits of exploiting cross-layer knowledge to optimize performance, as memory and control flow move easily across API boundaries. For example, Sandstorm interacts directly with libnmio to preload and push segments into the appropriate packet-buffer pools. This preserves a service-centric approach.

            Developer-friendly: Despite seeking inspiration from microkernel design, our approach maintains most of the benefits of conventional monolithic systems:

             

            圖1描述了這些組件如何與簡(jiǎn)單的應(yīng)用層一起使用來(lái)形成Sandstorm,優(yōu)化的web服務(wù)器。

            將功能分解為可重用組件不需要我們犧牲利用跨層知識(shí)來(lái)優(yōu)化性能的優(yōu)勢(shì),因?yàn)閮?nèi)存和控制流可以輕松跨越API邊界。 例如,Sandstorm直接與libnmio交互以預(yù)加載并將segments 推入相應(yīng)的包緩沖池。 這保留了以服務(wù)為中心的方法。

            開(kāi)發(fā)者友好:盡管從微內(nèi)核設(shè)計(jì)中獲得靈感,我們的方法保持了傳統(tǒng)獨(dú)立系統(tǒng)的大部分優(yōu)勢(shì):

             

            Debugging is at least as easy (if not easier) compared to conventional systems, as application-specific, performancecentric code shifts from the kernel to more accessible userspace.

             

            調(diào)試至少和傳統(tǒng)系統(tǒng)一樣容易(如果沒(méi)有更容易的話(huà)),因?yàn)樘囟☉?yīng)用程序,性能中心代碼從內(nèi)核轉(zhuǎn)移到更易于訪(fǎng)問(wèn)的用戶(hù)空間。

             

            Our approach integrates well with the general-purpose operating systems: rewriting basic components such as device drivers or filesystems is not required. We also have direct access to conventional debugging, tracing, and profiling tools, and can also use the conventional network stack for remote access (e.g., via SSH).

             

            我們的方法與通用操作系統(tǒng)完美集成:不需要重寫(xiě)基本組件,如設(shè)備驅(qū)動(dòng)程序或文件系統(tǒng)。 我們還可以直接訪(fǎng)問(wèn)常規(guī)調(diào)試,跟蹤和分析工具,并且還可以使用常規(guī)網(wǎng)絡(luò)棧來(lái)遠(yuǎn)程訪(fǎng)問(wèn)(例如,通過(guò)SSH)。

             

            Instrumentation in Sandstorm is a simple and straightforward task that allows us to explore potential bottlenecks as well as necessary and sufficient costs in network processing across application and stack. In addition, off-the-shelf performance monitoring and profiling tools “just work”, and a synchronous design makes them easier to use.

             

            Sandstorm中的工具完成簡(jiǎn)單和直接的任務(wù),允許我們探索潛在的瓶頸,以及在應(yīng)用和堆棧的網(wǎng)絡(luò)處理中必要和足夠的成本。 此外,現(xiàn)成的性能監(jiān)控和分析工具“只是工作”,同步設(shè)計(jì)使它們更容易使用。

             

            2.2 Sandstorm web server design

            Rizzo’s netmap framework provides a general-purpose API that allows received packets to be mapped directly to userspace, and packets to be transmitted to be sent directly from userspace to the NIC’s DMA rings. Combined with batching to reduce system calls, this provides a high-performance framework on which to build packet-processing applications. A web server, however, is not normally thought of as a packet-processing application, but one that handles TCP streams.

             

            Sandstorm  web服務(wù)器的設(shè)計(jì)

            Rizzo的netmap框架提供了通用API,允許接收的數(shù)據(jù)包直接映射到用戶(hù)空間,要發(fā)送的數(shù)據(jù)包將直接從用戶(hù)空間發(fā)送到NIC的DMA環(huán)。 結(jié)合批處理以減少系統(tǒng)調(diào)用,這提供了一個(gè)高性能框架,用于構(gòu)建數(shù)據(jù)包處理應(yīng)用程序。 然而,Web服務(wù)器通常不被認(rèn)為是包處理應(yīng)用,而是處理TCP流的應(yīng)用。

             

            To serve a static file, we load it into memory, and a priori generate all the packets that will be sent, including TCP, IP, and link-layer headers. When an HTTP request for that file arrives, the server must allocate a TCP-protocol control block (TCB) to keep track of the connection’s state, but the packets to be sent have already been created for each file on the server.2

             

            要提供一個(gè)靜態(tài)文件,我們將它加載到內(nèi)存,并且先驗(yàn)生成所有要發(fā)送的數(shù)據(jù)包,包括TCP,IP和鏈路層頭。 當(dāng)對(duì)該文件的HTTP請(qǐng)求到達(dá)時(shí),服務(wù)器必須分配TCP協(xié)議控制塊(TCB)以跟蹤連接的狀態(tài),但是已經(jīng)為服務(wù)器上的每個(gè)文件創(chuàng)建了要發(fā)送的包。

             

            The majority of the work is performed during inbound TCP ACK processing. The IP header is checked, and if it is acceptable, a hash table is used to locate the TCB. The offset of the ACK number from the start of the connection is used to locate the next prepackaged packet to send, and if permitted by the congestion and receive windows, subsequent packets. To send these packets, the destination address and port must be rewritten, and the TCP and IP checksums incrementally updated. The packet can then be directly fetched by the NIC using netmap. All reads of the ACK header and modifications to the transmitted packets are performed in a single pass, ensuring that both the headers and the TCB remain in the CPU’s L1 cache.

             

            大多數(shù)工作在入站TCP ACK處理期間執(zhí)行。 檢查IP報(bào)頭,并且如果可以則使用哈希表來(lái)定位TCB(ps: 讓這個(gè)TCB來(lái)處理相同鏈接的包)。 ACK號(hào)的偏移用來(lái)在連接開(kāi)始時(shí)定位要發(fā)送的下一個(gè)預(yù)打包的數(shù)據(jù)包,并且如果擁塞和接收窗口允許,則用于定位后續(xù)包。 要發(fā)送這些數(shù)據(jù)包,必須重寫(xiě)目標(biāo)地址和端口,并且逐步更新TCP和IP校驗(yàn)和。 然后,該包可以由NIC使用netmap直接獲取。 ACK報(bào)頭的所有讀取和對(duì)發(fā)送的包的修改在單次通過(guò)中執(zhí)行,確保報(bào)頭和TCB保留在CPU的L1高速緩存中。

             

            Once a packet has been DMAed to the NIC, the packet buffer is returned to Sandstorm, ready to be incrementally modified again and sent to a different client. However, under high load, the same packet may need to be queued in the TX ring for a second client before it has finished being sent to the first client. The same packet buffer cannot be in the TX ring twice, with different destination address and port. This presents us with two design options:

             

            一旦包已經(jīng)被DMA到NIC,buffer被返回到Sandstorm,準(zhǔn)備再次增量地修改并發(fā)送到不同的客戶(hù)端。 然而,在高負(fù)載下,相同的數(shù)據(jù)包可能需要在第二個(gè)客戶(hù)端的tx  ring中被排隊(duì)直到第一個(gè)客戶(hù)端發(fā)送完它。 具有不同的目的地址和端口的相同的包buffer不能被添加兩次到TX環(huán)中。 這里我們提供了兩個(gè)設(shè)計(jì)選項(xiàng)

             

            We can maintain more than one copy of each packet in memory to cope with this eventuality. The extra copy could be created at startup, but a more efficient solution would create extra copies on demand whenever a high-water mark is reached, and then retained for future use.

             

            我們可以在內(nèi)存中保存每個(gè)數(shù)據(jù)包的多個(gè)副本,以應(yīng)對(duì)這種可能性。 可以在啟動(dòng)時(shí)創(chuàng)建額外的副本,但是更高效的解決方案可以在達(dá)到高水位標(biāo)記時(shí)根據(jù)需要?jiǎng)?chuàng)建額外副本,然后保留以供將來(lái)使用。

             

            We can maintain only one long-term copy of each packet,creating ephemeral copies each time it needs to be sent

             

            我們給每個(gè)包長(zhǎng)期維護(hù)一個(gè)副本。在每次需要時(shí)創(chuàng)建臨時(shí)拷貝

             

            We call the former a pre-copy stack (it is an extreme form of zerocopy stack because in the steady state it never copies, but differs from the common use of the term “zero copy”), and the latter a memcpy stack. A pre-copy stack performs less per-packet work than a memcpy stack, but requires more memory; because of this, it has the potential to thrash the CPU’s L3 cache. With the memcpy stack, it is more likely for the original version of a packet to be in the L3 cache, but more work is done. We will evaluate both approaches, because it is far from obvious how CPU cycles trade off against cache misses in modern processors.

             

            我們稱(chēng)前者是一個(gè)預(yù)拷貝堆棧(它是一個(gè)極端形式的zerocopy堆棧,因?yàn)樵诜€(wěn)定狀態(tài)下它從不復(fù)制,但不同于常用的術(shù)語(yǔ)“零拷貝”),后者是一個(gè)memcpy堆棧。 預(yù)拷貝堆棧比memcpy堆棧執(zhí)行更少的工作,但需要更多的內(nèi)存; 因?yàn)檫@樣,它有可能摧毀CPU的L3緩存。 使用memcpy堆棧,更可能的原始版本的數(shù)據(jù)包在L3緩存,但有更多的工作要做。 我們將評(píng)估這兩種方法,因?yàn)樗h(yuǎn)不如CPU周期與現(xiàn)代處理器中的高速緩存未命中明顯。

             

            Figure 2 illustrates tradeoffs through traces taken on nginx/Linux and pre-copy Sandstorm servers that are busy (but unsaturated). On the one hand, a batched design measurably increases TCP roundtrip time with a relatively idle CPU. On the other hand, Sandstorm amortizes or eliminates substantial parts of per-request processing through a more efficient architecture. Under light load, the benefits are pronounced; at saturation, the effect is even more significant.

             

            圖2說(shuō)明了通過(guò)跟蹤在忙碌(但不飽和)的nginx / Linux和預(yù)拷貝Sandstorm服務(wù)器。 一方面,批量的設(shè)計(jì)預(yù)期地延長(zhǎng)具有相對(duì)空閑的CPU的TCP往返時(shí)間但同時(shí) 另一方面,Sandstorm通過(guò)更高效的架構(gòu)來(lái)平攤或消除每個(gè)請(qǐng)求處理的大部分。 在輕負(fù)載下,益處明顯; 在飽和時(shí),效果甚至更顯著。

             

            Although most work is synchronous within the ACK processing code path, TCP still needs timers for certain operations. Sandstorm’s timers are scheduled by polling the Time Stamp Counter (TSC): although not as accurate as other clock sources, it is accessible from userspace at the cost of a single CPU instruction (on modern hardware).The TCP slow timer routine is invoked periodically (every ~500ms) and traverses the list of active TCBs: on RTO expiration,the congestion window and slow-start threshold are adjusted accordingly,and any unacknowledged segments are retransmitted. The same routine also releases TCBs that have been in TIME_WAIT state for longer than 2*MSL. There is no buffering whatsoever required for retransmissions: we identify the segment that needs to be retransmitted using the oldest unacknowledged number as an offset,retrieve the next available prepackaged packet and adjust its headers accordingly, as with regular transmissions. Sandstorm currently implements TCP Reno for congestion control.

             

            雖然大多數(shù)工作在ACK處理代碼路徑內(nèi)同步,但是TCP仍然需要某些操作的定時(shí)器。 Sandstorm的計(jì)時(shí)器通過(guò)輪詢(xún)時(shí)間戳計(jì)數(shù)器(TSC)來(lái)調(diào)度:盡管不如其他時(shí)鐘源精確,但是可以從單個(gè)CPU指令(在現(xiàn)代硬件上)的成本從用戶(hù)空間訪(fǎng)問(wèn)。TCP慢速計(jì)時(shí)器例程被周期性地調(diào)用(每?500ms)并且遍歷活動(dòng)TCB的列表:在RTO到期時(shí),相應(yīng)地調(diào)整擁塞窗口和慢啟動(dòng)閾值,并且重傳任何未確認(rèn)的segments。同一例程還釋放已處于TIME_WAIT狀態(tài)長(zhǎng)于2 * MSL的TCB。沒(méi)有對(duì)重傳要求的緩沖:我們使用最舊的未確認(rèn)號(hào)碼作為偏移來(lái)識(shí)別需要重傳的段,檢索下一個(gè)可用的預(yù)先封裝的分組,并相應(yīng)地調(diào)整其頭部,如同常規(guī)傳輸一樣。 Sandstorm目前實(shí)現(xiàn)了TCP Reno的擁塞控制。

             

            2.3 The Namestorm DNS server

            The same principles applied in the Sandstorm web server, also apply to a wide range of servers returning the same content to multiple users. Authoritative DNS servers are often targets of DDoS attacks – they represent a potential single point of failure, and because DNS traditionally uses UDP, lacks TCP’s three way handshake to protect against attackers using spoofed IP addresses. Thus,high performance DNS servers are of significant interest.

             

            在Sandstorm Web服務(wù)器中應(yīng)用的相同原理也適用于將相同內(nèi)容返回給多個(gè)用戶(hù)類(lèi)的服務(wù)器。 權(quán)威DNS服務(wù)器通常是DDoS攻擊的目標(biāo) - 它們代表一個(gè)潛在的單點(diǎn)故障,并且因?yàn)镈NS傳統(tǒng)上使用UDP,缺乏TCP的三方握手,以防止攻擊者使用欺騙的IP地址。 因此,高性能DNS服務(wù)器具有重大意義。

             

            Unlike TCP, the conventional UDP stack is actually quite lightweight, and DNS servers already preprocess zone files and store response data in memory. Is there still an advantage running a specialized stack?

             

            與TCP不同,常規(guī)UDP堆棧實(shí)際上相當(dāng)輕量級(jí),DNS服務(wù)器已經(jīng)預(yù)處理zone并將響應(yīng)數(shù)據(jù)存儲(chǔ)在內(nèi)存中。 運(yùn)行一個(gè)專(zhuān)用的堆棧還有優(yōu)勢(shì)嗎?

             

            Most DNS-request processing is simple. When a request arrives,

            the server performs sanity checks, hashes the concatenation of the name and record type being requested to find the response, and sends that data. We can preprocess the responses so that they are already stored as a prepackaged UDP packet. As with HTTP, the destination address and port must be rewritten, the identifier must be updated,and the UDP and IP checksums must be incrementally updated.After the initial hash, all remaining processing is performed in one pass, allowing processing of DNS response headers to be performed from the L1 cache. As with Sandstorm, we can use pre-copy or memcpy approaches so that more than one response for the same name can be placed in the DMA ring at a time.

             

            大多數(shù)DNS請(qǐng)求處理很簡(jiǎn)單。 當(dāng)請(qǐng)求到達(dá)時(shí),

            服務(wù)器執(zhí)行完整性檢查,對(duì)所請(qǐng)求的名稱(chēng)和記錄類(lèi)型合并進(jìn)行哈希以找到響應(yīng),并發(fā)送該數(shù)據(jù)。 我們可以預(yù)處理響應(yīng),以便它們已經(jīng)存儲(chǔ)為預(yù)先打包的UDP數(shù)據(jù)包。 與HTTP一樣,必須重寫(xiě)目標(biāo)地址和端口,必須更新標(biāo)識(shí)符,并且必須增量更新UDP和IP校驗(yàn)和。在初始哈希后,所有剩余的處理都在一次執(zhí)行,允許處理DNS響應(yīng)頭 將從L1高速緩存執(zhí)行。 與Sandstorm一樣,我們可以使用預(yù)拷貝或memcpy方法,以便同一名稱(chēng)的多個(gè)響應(yīng)可以一次放置在DMA環(huán)中。

             

            Our specialized userspace DNS server stack is composed of three reusable components, libnmio, libeth, libudpip, and a DNS-specific application layer. As with Sandstorm, Namestorm uses FreeBSD’s netmap API, implementing the entire stack in userspace, and uses netmap’s batching to amortize system call overhead. libnmio and libeth are the same as used by Sandstorm, whereas libudpip contains UDP-specific code closely integrated with an IP layer. Namestorm is an authoritative nameserver, so it does not need to handle recursive lookups.

             

            我們的專(zhuān)用用戶(hù)空間DNS服務(wù)器堆棧由三個(gè)可重用的組件libnmio,libeth,libudpip和DNS特定的應(yīng)用程序?qū)咏M成。 與Sandstorm一樣,Namestorm使用FreeBSD的netmap API,在用戶(hù)空間實(shí)現(xiàn)整個(gè)堆棧,并使用netmap的批處理來(lái)攤銷(xiāo)系統(tǒng)調(diào)用開(kāi)銷(xiāo)。 libnmio和libeth與Sandstorm使用的相同,而libudpip包含與IP層緊密集成的UDP特定代碼。 Namestorm是一個(gè)權(quán)威的名稱(chēng)服務(wù)器,因此它不需要處理遞歸查找。

             

            Namestorm preprocesses the zone file upon startup, creating DNS response packets for all the entries in the zone, including the answer section and any glue records needed. In addition to type-specific queries for A, NS,MX and similar records, DNS also allows queries for ANY. A full implementation would need to create additional response packets to satisfy these queries; our implementation does not yet do so, but the only effect this would have is to increase the overall memory footprint. In practice, ANY requests prove comparatively rare.

             

            Namestorm在啟動(dòng)時(shí)預(yù)處理zone文件,為zone中的所有條目創(chuàng)建DNS響應(yīng)數(shù)據(jù)包,包括答案部分和所需的任何附帶記錄。 除了針對(duì)A,NS,MX和類(lèi)似記錄的類(lèi)型特定查詢(xún)之外,DNS還允許對(duì)ANY進(jìn)行查詢(xún)。 完全實(shí)現(xiàn)將需要?jiǎng)?chuàng)建額外的響應(yīng)分組以滿(mǎn)足這些查詢(xún); 我們的實(shí)現(xiàn)還沒(méi)有這樣做,但是唯一的效果是增加總體內(nèi)存占用。 在實(shí)踐中,any請(qǐng)求比較罕見(jiàn)。

             

            Namestorm idexes the prepackaged DNS response packets using a hash table. There are two ways to do this:

             

            Namestorm使用hash表索引預(yù)先打包的DNS響應(yīng)數(shù)據(jù)包。 有兩種方法可以做到這一點(diǎn):

             


             Index by concatenation of request type (e.g., A, NS, etc) and fully-qualified domain name (FQDN); for example “www.example.com”.

             

            通過(guò)請(qǐng)求類(lèi)型(例如,A,NS等)和完全限定域名(FQDN)的合并索引; 例如“www.example.com”.

             

            Index by concatenation of request type and the wire-format FQDN as this appears in an actual query; for example,“[3]www[7]example[3]com[0]” where [3] is a single byte containing the numeric value 3.

             

            通過(guò)連接請(qǐng)求類(lèi)型和wire格式FQDN索引,因?yàn)樗霈F(xiàn)在實(shí)際查詢(xún)中; 例如“[3] www [7] example [3] com [0]”,其中[3]是包含數(shù)值3的單個(gè)字節(jié)。(ps: dns包中域名使用3www5baidu3com格式,前面的數(shù)字表示后面的域長(zhǎng)方便解析)

             

            Using the wire request format is obviously faster, but DNS permits compression of names. Compression is common in DNS answers, where the same domain name occurs more than once, but proves rare in requests. If we implement wire-format hash keys, we must first perform a check for compression; these requests are decompressed and then reencoded to uncompressed wire-format for hashing.The choice is therefore between optimizing for the common case, using wire-format hash keys, or optimizing for the worst case, assuming compression will be common, and using FQDN hash keys. The former is faster, but the latter is more robust to a DDoS attack by an attacker taking advantage of compression. We evaluate both approaches, as they illustrate different performance tradeoffs.

             

            使用wire請(qǐng)求格式顯然更快,但DNS允許壓縮名稱(chēng)。 壓縮在DNS答案中很常見(jiàn),其中相同的域名不止一次地出現(xiàn),但在請(qǐng)求中證明是罕見(jiàn)的。 如果我們wire線(xiàn)格式哈希鍵,我們必須首先執(zhí)行壓縮檢查; 這些請(qǐng)求被解壓縮,然后被重新編碼為未壓縮的wire格式以用于散列。因此選擇是針對(duì)常見(jiàn)情況優(yōu)化,使用wire格式散列密鑰,或者對(duì)于最壞情況優(yōu)化,假設(shè)壓縮將是常見(jiàn)的,并且使用FQDN散列密鑰。 前者更快,但后者更強(qiáng)大到攻擊者利用壓縮的DDoS攻擊。 我們?cè)u(píng)估這兩種方法,因?yàn)樗鼈冋f(shuō)明了不同的性能。

             

            Our implementation does not currently handle referrals, so it can handle only zones for which it is authoritative for all the sub-zones.It could not, for example, handle the .com zone, because it would receive queries for www.example.com, but only have hash table entries for example.com. Truncating the hash key is trivial to do as part of the translation to an FQDN, so if Namestorm were to be used for a domain such as .com, the FQDN version of hashing would be a reasonable approach.

             

            我們的實(shí)現(xiàn)目前不處理引用,因此它只能處理對(duì)所有子區(qū)域都是權(quán)威的區(qū)域。

            例如,它無(wú)法處理.com區(qū)域,因?yàn)樗鼘⒔邮誻ww.example.com的查詢(xún),但只有example.com的哈希表?xiàng)l目。 截?cái)喙fI對(duì)于轉(zhuǎn)換到FQDN是很重要的,所以如果Namestorm用于一個(gè)域如.com,F(xiàn)QDN版本的哈希將是一個(gè)合理的方法。

             

             

            Outline of the main Sandstorm event loop

            1. Call RX poll to receive a batch of received packets that have been

            stored in the NIC’s RX ring; block if none are available.

            2. For each ACK packet in the batch:

            3. Perform Ethernet and IP input sanity checks.

            4. Locate the TCB for the connection.

            5. Update the acknowledged sequence numbers in TCB; update

            receive window and congestion window.

            6. For each new TCP data packet that can now be sent, or each

            lost packet that needs retransmitting:

            7. Find a free copy of the TCP data packet (or clone one

            if needed).

            8. Correct the destination IP address, destination port,

            sequence numbers, and incrementally update the TCP

            checksum.

            9. Add the packet to the NIC’s TX ring.

            10. Check if dt has passed since last TX poll. If it has, call

            TX poll to send all queued packets.

             

            Sandstorm 主事件循環(huán)概述

            1.調(diào)用RX輪詢(xún)批量接收在NIC的RX環(huán)中的packet;直到?jīng)]有。

            2.處理每個(gè)ACK數(shù)據(jù)包:

            3.執(zhí)行鏈路層和IP層完整性檢查。

            4.找到處理該連接的TCB。

            5.更新TCB中已確認(rèn)的序列號(hào); 更新接收窗口和擁塞窗口。

            6.對(duì)于可以立即發(fā)送的每個(gè)新TCP數(shù)據(jù)包,或每個(gè)需要重傳的丟失數(shù)據(jù)包:

            7.查找TCP數(shù)據(jù)包的空閑拷貝(如果需要,請(qǐng)clone一個(gè))。

            8.更正目標(biāo)IP地址,目標(biāo)端口,序列號(hào),并逐步更新TCP校驗(yàn)和。

            9.將數(shù)據(jù)包添加到NIC的TX環(huán)。

            10.檢查是否達(dá)到TX輪詢(xún)的時(shí)間間隔。 如果是調(diào)用TX poll發(fā)送所有排隊(duì)的數(shù)據(jù)包。

             

            2.4 Main event loop

            To understand how the pieces fit together and the nature of interaction between Sandstorm, Namestorm, and netmap, we consider the timeline for processing ACK packets in more detail. Figure 3 summarizes Sandstorm’s main loop. SYN/FIN handling, HTTP, and timers are omitted from this outline, but also take place. However,most work is performed in the ACK processing code.

             

            2.4主事件循環(huán)

            為了理解這些部分是如何組合在一起的,以及Sandstorm,Namestorm和netmap之間的交互性質(zhì),我們更詳細(xì)地考慮處理ACK數(shù)據(jù)包的時(shí)間線(xiàn)。 圖3總結(jié)了Sandstorm的主循環(huán)。 SYN / FIN處理,HTTP和計(jì)時(shí)器從此大綱中省略但偶爾也有。 然而,大多數(shù)工作是在ACK處理代碼中執(zhí)行的。

             

            One important consequence of this architecture is that the NIC’s TX ring serves as the sole output queue, taking the place of conventional socket buffers and software network-interface queues. This is possible because retransmitted TCP packets are generated in the same way as normal data packets. As Sandstorm is fast enough to saturate two 10Gb/s NICs with a single thread on one core, data structures are also lock free

             

            這種架構(gòu)的一個(gè)重要結(jié)果是,NIC的TX ring用作唯一的輸出隊(duì)列,取代了傳統(tǒng)的套接字緩沖區(qū)和軟件網(wǎng)卡隊(duì)列。 這是可能的,因?yàn)橹貍鞯腡CP包以與正常數(shù)據(jù)包相同的方式生成。 由于Sandstorm足夠快,可以在一個(gè)核上使用單個(gè)線(xiàn)程來(lái)飽和兩個(gè)10Gb / s網(wǎng)卡,數(shù)據(jù)結(jié)構(gòu)也是無(wú)鎖的

             

            When the workload is heavy enough to saturate the CPU, the system-call rate decreases. The receive batch size increases as calls to RX poll become less frequent, improving efficiency at the expense of increased latency. Under extreme load, the RX ring will fill, dropping packets. At this point the system is saturated and, as with any web server, it must bound the number of open connections by dropping some incoming SYNs.

             

            當(dāng)工作負(fù)載足夠大以使CPU飽和時(shí),系統(tǒng)調(diào)用(ps: rx tx的poll輪詢(xún))速率降低。 接收批次大小隨著對(duì)RX輪詢(xún)的調(diào)用變得不那么頻繁而增加,以增加的延遲為代價(jià)來(lái)提高效率。 在極端負(fù)載下,RX環(huán)會(huì)填滿(mǎn),丟棄報(bào)文。 此時(shí),系統(tǒng)已飽和,與任何Web服務(wù)器一樣,它必須丟棄一定數(shù)量的打開(kāi)的連接的SYN。

             

            Under heavier load, the TX-poll system call happens in the RX handler. In our current design, dt, the interval between calls to TX poll in the steady state, is a constant set to 80us. The system-call rate under extreme load could likely be decreased by further increasing dt, but as the pre-copy version of Sandstorm can easily saturate all six 10Gb/s NICs in our systems for all file sizes, we have thus far not needed to examine this. Under lighter load, incoming packets might arrive too rarely to provide acceptable latency for transmitted packets; a 5ms timer will trigger transmission of straggling packets in the NIC’s TX ring.

             

            在較重的負(fù)載下,TX-poll系統(tǒng)調(diào)用發(fā)生在RX處理程序中。 在我們當(dāng)前的設(shè)計(jì)中,dt,在穩(wěn)定狀態(tài)下調(diào)用TX poll之間的間隔,是一個(gè)設(shè)置為80us的常數(shù)。 在極端負(fù)載下通過(guò)進(jìn)一步增加dt來(lái)降低系統(tǒng)調(diào)用率,但是由于預(yù)拷貝版本的Sandstorm可以很容易地飽和所有文件大小的系統(tǒng)中的所有6個(gè)10Gb / s網(wǎng)卡,我們迄今為止不需要 檢查這個(gè)。 在較輕負(fù)載下,傳入packets可能很少于是提供了一個(gè)可接受延遲; 5ms發(fā)送一次。

             

            The difference between the pre-copy version and the memcpy version of Sandstorm is purely in step 7, where the memcpy version will simply clone the single original packet rather than search for an unused existing copy.

             

            預(yù)拷貝版本和memcpy版本之間的差異純粹是在步驟7中,其中memcpy版本將簡(jiǎn)單地克隆單個(gè)原始數(shù)據(jù)包,而不是搜索未使用的現(xiàn)有副本。

             

            Contemporary Intel server processors support Direct Data I/O (DDIO). DDIO allows NIC-originated Direct Memory Access (DMA) over PCIe to access DRAM through the processor’s Last-Level Cache (LLC). For network transmit, DDIO is able to pull data from the cache without a detour through system memory; likewise, for receive,DMA can place data in the processor cache. DDIO implements administrative limits on LLC utilization intended to prevent DMA from thrashing the cache. This design has the potential to significantly reduce latency and increase I/O bandwidth

             

            當(dāng)代英特爾服務(wù)器處理器支持直接數(shù)據(jù)I / O(DDIO)。 DDIO允許通過(guò)PCIe的NIC發(fā)起的DMA直接內(nèi)存訪(fǎng)問(wèn),通過(guò)處理器的最后級(jí)緩存(LLC)訪(fǎng)問(wèn)DRAM。 對(duì)于網(wǎng)絡(luò)傳輸,DDIO能夠從緩存中提取數(shù)據(jù),而不必通過(guò)系統(tǒng)內(nèi)存; 同樣,對(duì)于接收,DMA可以將數(shù)據(jù)放置在處理器高速緩存中。 DDIO實(shí)現(xiàn)對(duì)LLC利用率的管理限制,旨在防止DMA頻繁刷緩存。 此設(shè)計(jì)具有顯著減少延遲和增加I / O帶寬的潛力

             

            Memcpy Sandstorm forces the payload of the copy to be in the CPU cache from which DDIO can DMA it to the NIC without needing to load it from memory again. With pre-copy, the CPU only touches the packet headers, so if the payload is not in the CPU cache, DDIO must load it, potentially impacting performance. These interactions are subtle, and we will look at them in detail.

             

            Memcpy 版本的Sandstorm強(qiáng)制拷貝的負(fù)載壓力在CPU緩存中,DDIO可以將其從DMA傳輸?shù)絅IC,而無(wú)需再次從內(nèi)存加載它。 使用預(yù)拷貝,CPU只觸發(fā)數(shù)據(jù)包頭,因此如果有效負(fù)載不在CPU緩存中,DDIO必須加載它,這可能會(huì)影響性能。 這些互動(dòng)是微妙的,我們將詳細(xì)研究它們。(ps:照理減少拷貝使用現(xiàn)成的數(shù)據(jù)更快但是這里用了DDIO實(shí)現(xiàn)cpu到網(wǎng)卡的直接傳輸而預(yù)拷貝版本的話(huà)現(xiàn)成的數(shù)據(jù)不一定在cache中反而還多了一次加載,關(guān)于這點(diǎn)后面還會(huì)討論)

             

            Namestorm follows the same basic outline, but is simpler as DNS is stateless: it does not need a TCB, and sends a single response packet to each request.

             

            Namestorm遵循相同的基本概要,但是更簡(jiǎn)單,因?yàn)镈NS是無(wú)狀態(tài)的:它不需要TCB,并且向每個(gè)請(qǐng)求發(fā)送單個(gè)響應(yīng)包。

             

            2.5 API

            As discussed, all of our stack components provide well-defined APIs to promote reusability. Table 1 presents a selection of API functions exposed by libnmio and libtcpip. In this section we describe some of the most interesting properties of the APIs.

             

            如上所述,我們的所有堆棧組件都提供了定義明確的API來(lái)提高可重用性。 表1介紹了libnmio和libtcpip暴露的API函數(shù)的選擇。 在本節(jié)中,我們描述了一些最有趣的API的屬性。

             

            libnmio is the lowest-level component: it handles all interaction with netmap and abstracts the main event loop. Higher layers 179 (e.g., libeth) register callback functions to receive raw incoming data as well as set timers for periodic events (e.g., TCP slow timer).The function netmap_ouput() is the main transmission routine:it enqueues a packet to the transmission ring either by memory or zero copying and also implements an adaptive batching algorithm.

            Since there is no socket layer, the application must directly interface with the network stack. With TCP as the transport layer, it acquires a TCB (TCP Control Block), binds it to a specific IPv4 address and port, and sets it to LISTEN state using API functions. The application must also register callback functions to accept connections,receive and process data from active connections, as well as act on successful delivery of sent data (e.g., to close the connection or send more data).

             

            libnmio是最低級(jí)別的組件:它處理與netmap的所有交互并抽象主事件循環(huán)。 高層179(例如,libeth)注冊(cè)回調(diào)函數(shù)以接收原始輸入數(shù)據(jù)以及設(shè)置用于周期性事件的定時(shí)器(例如,TCP慢定時(shí)器)。函數(shù)netmap_ouput()是主傳輸例程:它將分組排入傳輸 通過(guò)存儲(chǔ)器或零拷貝來(lái)放入環(huán)形隊(duì)列,并且還實(shí)現(xiàn)自適應(yīng)批處理算法。

            由于沒(méi)有套接字層,應(yīng)用程序必須直接與網(wǎng)絡(luò)堆棧接口。 使用TCP作為傳輸層,它獲取TCB(TCP控制塊),將其綁定到特定的IPv4地址和端口,并使用API函數(shù)將其設(shè)置為L(zhǎng)ISTEN狀態(tài)。 應(yīng)用程序還必須注冊(cè)回調(diào)函數(shù)以接受連接,從活動(dòng)連接接收和處理數(shù)據(jù)、發(fā)送數(shù)據(jù)的成功傳遞(例如,以關(guān)閉連接或發(fā)送更多數(shù)據(jù))。

             

            3. EVALUATION

            To explore Sandstorm and Namestorm’s performance and behavior, we evaluated using both older and more recent hardware.On older hardware, we employed Linux 3.6.7 and FreeBSD 9-STABLE. On newer hardware, we used Linux 3.12.5 and FreeBSD 10-STABLE. We ran Sandstorm and Namestorm on FreeBSD.

             

            評(píng)估

            為了探索Sandstorm和Namestorm的性能和行為,我們使用舊的和更新的硬件進(jìn)行評(píng)估。在舊的硬件上,我們使用Linux 3.6.7和FreeBSD 9-STABLE。 在較新的硬件上,我們使用Linux 3.12.5和FreeBSD 10-STABLE。 我們?cè)贔reeBSD上運(yùn)行Sandstorm和Namestorm。

             

            For the old hardware, we use three systems: two clients and one server, connected via a 10GbE crossbar switch. All test systems are equipped with an Intel 82598EB dual port 10GbE NIC, 8GB RAM,and two quad-core 2.66 GHz Intel Xeon X5355 CPUs. In 2006,these were high-end servers. For the new hardware, we use seven systems; six clients and one server, all directly connected via dedicated 10GbE links. The server has three dual-port Intel 82599EB 10GbE NICs, 128GB RAM and a quad-core Intel Xeon E5-2643 CPU. In 2014, these are well-equipped contemporary servers.

             

            對(duì)于舊硬件,我們使用三個(gè)系統(tǒng):兩個(gè)客戶(hù)端和一個(gè)服務(wù)器,通過(guò)10GbE交換機(jī)連接。 所有測(cè)試系統(tǒng)都配備了一個(gè)Intel 82598EB雙端口10GbE NIC,8GB RAM和兩個(gè)四核2.66 GHz Intel Xeon X5355 CPU。 2006年,這些都是高端服務(wù)器。 對(duì)于新硬件,我們使用七個(gè)系統(tǒng); 六個(gè)客戶(hù)端和一個(gè)服務(wù)器,都通過(guò)專(zhuān)用的10GbE鏈路直接連接。 該服務(wù)器有三個(gè)雙端口Intel 82599EB 10GbE NIC,128GB RAM和四核Intel Xeon E5-2643 CPU。 在2014年,這些是設(shè)備齊全的現(xiàn)代服務(wù)器。

             

            The most interesting improvements between these hardware generations are in the memory subsystem. The older Xeons have a conventional architecture with a single 1,333MHz memory bus serving both CPUs. The newer machines, as with all recent Intel server processors,support Data Direct I/O (DDIO), so whether data to be sent is in the cache can have a significant impact on performance.


            這些硬件代之間最有趣的改進(jìn)是在存儲(chǔ)器子系統(tǒng)中。 較老的Xeons有一個(gè)傳統(tǒng)的架構(gòu),單個(gè)1,333MHz內(nèi)存總線(xiàn)為兩個(gè)CPU服務(wù)。 較新的機(jī)器(如最近的所有英特爾服務(wù)器處理器)都支持?jǐn)?shù)據(jù)直接I / O(DDIO),因此要發(fā)送的數(shù)據(jù)是否在緩存中會(huì)對(duì)性能產(chǎn)生重大影響。

             

            Our hypothesis is that Sandstorm will be significantly faster than nginx on both platforms; however, the reasons for this may differ. Experience [18] has shown that the older systems often bottleneck on memory latency, and as Sandstorm is not CPU-intensive, we would expect this to be the case. A zero-copy stack should thus be a big win. In addition, as cores contend for memory, we would expect that adding more cores does not help greatly.

             

            我們的假設(shè)是,Sandstorm將在兩個(gè)平臺(tái)上明顯快于nginx; 然而,原因可能不同。 經(jīng)驗(yàn)[18]表明,較舊的系統(tǒng)通常會(huì)對(duì)內(nèi)存延遲造成瓶頸,而且由于Sandstorm不是CPU密集型的,我們預(yù)期會(huì)出現(xiàn)這種情況。 零拷貝堆棧應(yīng)該是一個(gè)大勝利。 此外,隨著核爭(zhēng)奪內(nèi)存,我們預(yù)計(jì)添加更多核并不會(huì)有很大的幫助。

             

            On the other hand, with DDIO, the new systems are less likely to bottleneck on memory. The concern, however, would be that DDIO could thrash at least part of the CPU cache. On these systems, we expect that adding more cores would help performance, but that in doing so, we may experience scalability bottlenecks such as lock contention in conventional stacks. Sandstorm’s lock-free stack can simply be replicated onto multiple 10GbE NICs, with one core per two NICs to scale performance. In addition, as load increases, the number of packets to be sent or received per system call will increase due to application-level batching. Thus, under heavy load, we would hope that the number of system calls per second to still be acceptable despite shifting almost all network-stack processing to userspace.

             

            另一方面,使用DDIO,新系統(tǒng)不太可能在內(nèi)存上造成瓶頸。 然而,關(guān)注的是,DDIO可能至少刷掉部分的CPU緩存。 在這些系統(tǒng)上,我們期望添加更多的核將有助于提高性能,但在這樣做時(shí),我們可能會(huì)遇到可伸縮性瓶頸,例如傳統(tǒng)堆棧中的鎖爭(zhēng)用。 Sandstorm的無(wú)鎖堆棧可以簡(jiǎn)單地用到多個(gè)10GbE NIC上,每?jī)蓚€(gè)NIC一個(gè)核心可以擴(kuò)展性能。 此外,隨著負(fù)載的增加,每個(gè)系統(tǒng)調(diào)用發(fā)送或接收的數(shù)據(jù)包數(shù)量將由于應(yīng)用程序級(jí)別的批處理而增加。 因此,在重負(fù)載下盡管將幾乎所有的網(wǎng)絡(luò)棧處理轉(zhuǎn)移到用戶(hù)空間,我們希望每秒的系統(tǒng)調(diào)用的數(shù)量仍然是可以接受的

             

            The question, of course, is how well do these design choices play out in practice?

            當(dāng)然,問(wèn)題是這些設(shè)計(jì)選擇在實(shí)踐中表現(xiàn)得如何?

             

            3.1 Experiment Design: Sandstorm

            We evaluated the performance of Sandstorm through a set of experiments and compare our results against the nginx web server running on both FreeBSD and Linux. Nginx is a high-performance,low-footprint web server that follows the non-blocking, event-driven model: it relies on OS primitives such as kqueue() for readiness event notifications, it uses sendfile() to send HTTP payload directly from the kernel, and it asynchronously processes requests.

             

            我們通過(guò)一組實(shí)驗(yàn)評(píng)估Sandstorm的性能,并將結(jié)果與在FreeBSD和Linux上運(yùn)行的nginx Web服務(wù)器進(jìn)行比較。 Nginx是一個(gè)高性能,低占用率的web服務(wù)器,它遵循非阻塞,事件驅(qū)動(dòng)模型:它依賴(lài)于諸如kqueue()等用于準(zhǔn)備事件通知的操作系統(tǒng)原語(yǔ),它使用sendfile()直接從內(nèi)核發(fā)送HTTP有效負(fù)載,并且異步處理 請(qǐng)求。

             

            Contemporary web pages are immensely content-rich, but they mainly consist of smaller web objects such as images and scripts. The distribution of requested object sizes for Yahoo! CDN, reveals that 90% of the content is smaller than 25KB [11]. The conventional network stack and web-server application perform well when delivering large files by utilizing OS primitives and NIC hardware features. Conversely, multiple simultaneous short-lived HTTP connections are considered a heavy workload that stresses the kerneluserspace interface and reveals performance bottlenecks: even with sendfile() to send the payload, the size of the transmitted data is not quite enough to compensate for the system cost.

             

            當(dāng)代網(wǎng)頁(yè)內(nèi)容豐富,但它們主要包括較小的網(wǎng)絡(luò)對(duì)象,如圖像和腳本。 對(duì)于Yahoo! CDN的請(qǐng)求的對(duì)象大小的分布,揭示了90%的內(nèi)容小于25KB [11]。 當(dāng)通過(guò)利用OS原語(yǔ)和NIC硬件特征來(lái)傳送大文件時(shí),傳統(tǒng)的網(wǎng)絡(luò)棧和web服務(wù)器應(yīng)用執(zhí)行得很好。 相反,多個(gè)同時(shí)短期HTTP連接被認(rèn)為是一個(gè)沉重的工作負(fù)載,強(qiáng)調(diào)用戶(hù)空間內(nèi)核接口并揭示性能瓶頸:即使使用sendfile()發(fā)送有效負(fù)載,傳輸數(shù)據(jù)的大小也不足以補(bǔ)償系統(tǒng)成本 。

             

            For all the benchmarks, we configured nginx to serve content from a RAM disk to eliminate disk-related I/O bottlenecks. Similarly,Sandstorm preloads the data to be sent and performs its pre-segmentation phase before the experiments begin. We use weighttp [9] to generate load with multiple concurrent clients. Each client generates a series of HTTP requests, with a new connection being initiated immediately after the previous one terminates. For each experiment we measure throughput, and we vary the size of the file served, exploring possible tradeoffs between throughput and system load. Finally, we run experiments with a realistic workload by using a trace of files with sizes that follow the distribution of requested HTTP objects of the Yahoo! CDN.

             

            對(duì)于所有的基準(zhǔn)測(cè)試,我們配置了nginx來(lái)從RAM磁盤(pán)提供內(nèi)容,以消除磁盤(pán)相關(guān)的I / O瓶頸。 類(lèi)似地,Sandstorm預(yù)加載要發(fā)送的數(shù)據(jù),并在實(shí)驗(yàn)開(kāi)始之前執(zhí)行其預(yù)分割階段。 我們使用weighttp [9]來(lái)生成多個(gè)并發(fā)客戶(hù)端的負(fù)載。 每個(gè)客戶(hù)端生成一系列HTTP請(qǐng)求,在前一個(gè)終止后立即啟動(dòng)新的連接。 對(duì)于每個(gè)實(shí)驗(yàn),我們測(cè)量吞吐量,并且我們改變所服務(wù)的文件的大小,探索吞吐量和系統(tǒng)負(fù)載之間可能的折衷。 最后,我們使用跟蹤文件的實(shí)際工作量進(jìn)行實(shí)驗(yàn),這些文件的大小遵循Yahoo! CDN所請(qǐng)求的HTTP對(duì)象的分布。

             

            3.2 Sandstorm Results

            First, we wish to explore how file size affects performance  Sandstorm is designed with small files in mind, and batching to reduce overheads, whereas the conventional sendfile() ought to be better for larger files.

             

            首先,我們希望了解文件大小如何影響性能Sandstorm的設(shè)計(jì)考慮了小文件,并批量化以減少開(kāi)銷(xiāo),而傳統(tǒng)的sendfile()應(yīng)該對(duì)更大的文件更好。

             

            Figure 4 shows performance as a function of content size, comparing pre-copy Sandstorm and nginx running on both FreeBSD and Linux. With a single 10GbE NIC (Fig. 4a and 4d), Sandstorm outperforms nginx for smaller files by ~23–240%. For larger files, all three configurations saturate the link. Both conventional stacks are more CPU-hungry for the whole range of file sizes tested, despite potential advantages such as TSO on bulk transfers.

             

            圖4顯示了對(duì)不同內(nèi)容大小的函數(shù)的性能,比較了在FreeBSD和Linux上運(yùn)行的預(yù)拷貝Sandstorm和nginx。 使用單個(gè)10GbE NIC(圖4a和4d),Sandstorm的性能比較小的文件的nginx大約高23-240%。 對(duì)于較大的文件,所有三個(gè)配置飽和鏈接。 盡管存在諸如批量傳輸?shù)腡SO的潛在優(yōu)勢(shì),但對(duì)于所有測(cè)試的文件大小,這兩種常規(guī)堆棧都更加需要CPU。

             

            To scale to higher bandwidths, we added more 10GbE NICs and client machines. Figure 4b shows aggregate throughput with four 10GbE NICs. Sandstorm saturates all four NICs using just two CPU cores, but neither Linux nor FreeBSD can saturate the NICs with files smaller than 128KB, even though they use four CPU cores.

             

            為了擴(kuò)展到更高的帶寬,我們?cè)黾恿烁嗟?0GbE網(wǎng)卡和客戶(hù)端機(jī)器。 圖4b顯示了四個(gè)10GbE NIC的聚合吞吐量。 Sandstorm使用只有兩個(gè)CPU核心四個(gè)網(wǎng)卡,但即使Linux和FreeBSD使用四個(gè)CPU核心都不能使文件小于128KB的網(wǎng)絡(luò)飽和,。

             

            As we add yet more NICs, shown in Figure 4c, the difference in performance gets larger for a wider range of file sizes. With 610GbE NICs Sandstorm gives between 10% and 10 more throughput than FreeBSD for file sizes in the range of 4–256KB.Linux fares worse, experiencing a performance drop (see Figure 4c)compared to FreeBSD with smaller file sizes and 5–6 NICs. Low CPU utilization is normally good, but here (Figures 4f, 5b), idle time is undesirable since the NICs are not yet saturated.We have not identified any single obvious cause for this degradation. Packet traces show the delay to occur between the connection being accepted and the response being sent. There is no single kernel lock being held for especially long, and although locking is not negligible, it does not dominate, either. The system suffers one soft page fault for every two connections on average, but no hard faults, so data is already in the disk buffer cache, and TCB recycling is enabled. This is an example of how hard it can be to find performance problems with conventional stacks. Interestingly, this was an application-specific behavior triggered only on Linux: in benchmarks we conducted with other web servers (e.g., lighttpd [3], OpenLiteSpeed [7]) we did not experience a similar performance collapse on Linux with more than four NICs.We have chosen, however, to present the nginx datasets as it offered the greatest overall scalability in both operating systems.

             

            當(dāng)我們添加更多的NIC時(shí),如圖4c所示,對(duì)于更大范圍的文件大小,性能的差異變大。使文件大小為4-256KB 用6x10GbE NIC Sandstorm比FreeBSD更高>=10%的吞吐量,與具有較小文件大小和5-6個(gè)NIC的FreeBSD相比,Linux的性能下降(見(jiàn)圖4c)。低的CPU利用率通常是好的,但是在這里(圖4f,5b),空閑是不期望的,因?yàn)镹IC尚未飽和。我們沒(méi)有識(shí)別出任何單個(gè)明顯的原因。數(shù)據(jù)包跟蹤顯示在被接受的連接和正在發(fā)送的響應(yīng)之間發(fā)生的延遲。沒(méi)有一個(gè)單一的內(nèi)核鎖持有特別長(zhǎng)的時(shí)間,雖然鎖定是不可忽略的,它也不占主導(dǎo)地位。系統(tǒng)平均每?jī)蓚€(gè)連接會(huì)出現(xiàn)一個(gè)軟頁(yè)故障,但沒(méi)有硬故障,因此數(shù)據(jù)已經(jīng)在磁盤(pán)緩沖區(qū)緩存中,并且啟用了TCB回收。這是一個(gè)例子,說(shuō)明用常規(guī)堆棧找到性能問(wèn)題有多困難。 有趣的是,這是一個(gè)應(yīng)用程序特定的行為僅在Linux上觸發(fā):在我們與其他Web服務(wù)器(例如lighttpd [3],OpenLiteSpeed [7])進(jìn)行的基準(zhǔn)測(cè)試中,我們沒(méi)有在具有四個(gè)以上NIC的Linux上經(jīng)歷類(lèi)似的性能崩潰 然而,我們選擇呈現(xiàn)nginx上的數(shù)據(jù)集,因?yàn)樗趦蓚€(gè)操作系統(tǒng)中提供最大的整體可伸縮性。

             

            It is clear that Sandstorm dramatically improves network performance when it serves small web objects, but somewhat surprisingly,it performs better for larger files too. For completeness, we evaluate Sandstorm using a realistic workload: following the distribution of requested HTTP object sizes of the Yahoo! CDN [11], we generated a trace of 1000 files ranging from a few KB up to ~20MB which were served from both Sandstorm and nginx. On the clients,we modified weighttp to benchmark the server by concurrently requesting files in a random order. Figures 5a and 5b highlight the achieved network throughput and the CPU utilization of the server as a function of the number of the network adapters. The network performance improvement is more than 2 while CPU utilization is reduced.

             

            很明顯,Sandstorm在服務(wù)小型Web對(duì)象時(shí)可以顯著提高網(wǎng)絡(luò)性能,但有些令人驚訝的是,它對(duì)大型文件的性能也更好。 為了完整性,我們使用現(xiàn)實(shí)的工作負(fù)載評(píng)估Sandstorm:給Sandstorm 和nginx分發(fā)請(qǐng)求的HTTP對(duì)象大小的Yahoo! CDN [11]之后,我們生成了1000個(gè)文件的蹤跡,范圍從幾KB到?20MB。 在客戶(hù)端上,我們修改weighttp以通過(guò)以隨機(jī)順序并發(fā)請(qǐng)求文件來(lái)對(duì)服務(wù)器進(jìn)行基準(zhǔn)測(cè)試。 圖5a和5b突出了網(wǎng)絡(luò)適配器的數(shù)量變化所實(shí)現(xiàn)的網(wǎng)絡(luò)吞吐量和服務(wù)器的CPU利用率。 網(wǎng)絡(luò)性能提升超過(guò)2倍 同時(shí)降低CPU利用率。

             

            Finally, we evaluated whether Sandstorm handles high packet loss correctly. With 80 simultaneous clients and 1% packet loss, as bexpected, throughput plummets. FreeBSD achieves approximately 640Mb/s and Sandstorm roughly 25% less. This is not fundamental,but due to FreeBSD’s more fine-grained retransmit timer and its use of NewReno congestion control rather than Reno, which could also be implemented in Sandstorm.Neither network nor server is stressed in this experiment – if there had been a real congested link causing the loss, both stacks would have filled it.

             

            最后,我們?cè)u(píng)估Sandstorm是否正確處理高數(shù)據(jù)包丟失。 隨著80個(gè)同時(shí)客戶(hù)端和1%的包丟失率,如預(yù)期,吞吐量直線(xiàn)下降。 FreeBSD實(shí)現(xiàn)約640Mb / s比Sandstorm減少約25%。 這不是根本的,但是由于FreeBSD的更細(xì)粒度的重傳計(jì)時(shí)器和它的使用NewReno擁塞控制而不是Reno,這也可以在Sandstorm中實(shí)現(xiàn)。在這個(gè)實(shí)驗(yàn)中強(qiáng)調(diào)的網(wǎng)絡(luò)和服務(wù) - 如果有一個(gè)真正的 擁塞的鏈路造成損失,兩個(gè)堆棧都會(huì)處理。

             

            Throughout, we have invested considerable effort in profiling and optimizing conventional network stacks, both to understand their design choices and bottlenecks, and to provide the fairest possible comparison. We applied conventional performance tuning to Linux and FreeBSD, such as increasing hash-table sizes, manually tuning CPU work placement for multiqueue NICs, and adjusting NIC parameters such as interrupt mitigation. In collaboration with Netflix,we also developed a number of TCP and virtual-memory subsystem performance optimizations for FreeBSD, reducing lock contention under high packet loads. One important optimization is related to sendfile(), in which contention within the VM subsystem occurred while TCP-layer socket-buffer locks were held, triggering a cascade to the system as a whole. These changes have been upstreamed to FreeBSD for inclusion in a future release.

             

            在整個(gè)過(guò)程中,我們投入了相當(dāng)大的努力來(lái)分析和優(yōu)化常規(guī)網(wǎng)絡(luò)堆棧,既了解他們的設(shè)計(jì)選擇和瓶頸,并提供最公平的可能的比較。 我們將常規(guī)性能調(diào)優(yōu)應(yīng)用于Linux和FreeBSD,例如增加哈希表大小,手動(dòng)調(diào)整多隊(duì)列NIC的CPU工作位置,以及調(diào)整NIC參數(shù)(如中斷緩沖)。 與Netflix合作,我們還為FreeBSD開(kāi)發(fā)了許多TCP和虛擬內(nèi)存子系統(tǒng)性能優(yōu)化,從而減少了高數(shù)據(jù)包負(fù)載下的鎖爭(zhēng)用。 一個(gè)重要的優(yōu)化與sendfile()相關(guān),沖突發(fā)生在TCP層套接字緩沖區(qū)鎖定時(shí)發(fā)生VM子系統(tǒng)內(nèi)的爭(zhēng)用,觸發(fā)了作為整體的系統(tǒng)級(jí)聯(lián)。 這些更改已上傳到FreeBSD以包含在將來(lái)的版本中。

             

            To copy or not to copy

            The pre-copy variant of Sandstorm maintains more than one copy of each segment in memory so that it can send the same segment to multiple clients simultaneously. This requires more memory than nginx serving files from RAM. The memcpy variant only enqueues copies, requiring a single long-lived version of each packet, and uses a similar amount of memory to nginx. How does this memcpy affect performance? Figure 6 explores network throughput, CPU utilization, and system-call rate for two- and six-NIC configurations.

             

            Sandstorm的預(yù)拷貝版在內(nèi)存中保存了每個(gè)segment的多個(gè)副本,以便它可以同時(shí)將同一個(gè)segment發(fā)送到多個(gè)客戶(hù)端。 這需要比從RAM中提供文件的nginx更多的內(nèi)存。 memcpy版僅排列副本,需要每個(gè)數(shù)據(jù)包的單個(gè)長(zhǎng)期版本,并使用與nginx類(lèi)似的內(nèi)存量。 這個(gè)memcpy如何影響性能? 圖6探討了兩個(gè)和六個(gè)NIC配置的網(wǎng)絡(luò)吞吐量,CPU利用率和系統(tǒng)調(diào)用率。

             

            With six NICs, the additional memcpy() marginally reduces performance (Figure 6b) while exhibiting slightly higher CPU load (Figure 6d). In this experiment, Sandstorm only uses three cores to simplify the comparison, so around 75% utilization saturates those cores. The memcpy variant saturates the CPU for files smaller than 32KB, whereas the pre-copy variant does not. Nginx, using sendfile() and all four cores, only catches up for file sizes of 512KB and above, and even then exhibits higher CPU load.

             

            使用六個(gè)NIC時(shí),額外的memcpy()會(huì)略微降低性能(圖6b),同時(shí)顯示稍微更高的CPU負(fù)載(圖6d)。 在這個(gè)實(shí)驗(yàn)中,Sandstorm只使用三個(gè)核來(lái)簡(jiǎn)化比較,因此大約75%的利用率使這些核飽和。 對(duì)于小于32KB的文件,memcpy版使CPU飽和,而預(yù)拷貝版不會(huì)。 Nginx使用sendfile()和所有四個(gè)核,只能達(dá)到512KB及以上的文件大小,甚至表現(xiàn)出更高的CPU負(fù)載。

             

            As file size decreases, the expense of SYN/FIN and HTTPrequest processing becomes measurable for both variants, but the pre-copy version has more headroom so is affected less. It is interesting to observe the effects of batching under overload with the memcpy stack in Figure 6f.With large file sizes, pre-copy and memcpy make the same number of system calls per second. With small files, however, the memcpy stack makes substantially fewer system calls per second. This illustrates the efficacy of batching: memcpy has saturated the CPU, and consequently no longer polls the RX queue as often. As the batch size increases, the system-call cost decreases, helping the server weather the storm. The pre-copy variant is not stressed here and continues to poll frequently, but would behave the same way under overload. In the end, the cost of the additional memcpy is measurable, but still performs quite well.

             

            隨著文件大小減小,SYN / FIN和HTTP請(qǐng)求處理的費(fèi)用對(duì)于這兩種變體都變得可預(yù)期,但是預(yù)拷貝版本具有更大的空間,因此受影響較小。 有趣的是觀(guān)察在過(guò)載下使用memcpy堆棧的batching的影響。對(duì)于大文件大小,pre-copy和memcpy每秒都會(huì)產(chǎn)生相同數(shù)量的系統(tǒng)調(diào)用。 然而,對(duì)于小文件,memcpy堆棧每秒大大減少了系統(tǒng)調(diào)用。 這說(shuō)明了批處理的效率:memcpy已飽和CPU,因此不再頻繁輪詢(xún)RX隊(duì)列。 隨著批量增加,降低系統(tǒng)調(diào)用幫助服務(wù)器承受風(fēng)暴。 預(yù)拷貝變體在這里繼續(xù)頻繁地輪詢(xún),但是在過(guò)載下表現(xiàn)相同的方式。 最后,額外的memcpy的成本是可衡量的,但仍然表現(xiàn)相當(dāng)不錯(cuò)。

             

            Results on contemporary hardware are significantly different from those run on older pre-DDIO hardware. Figure 7 shows the results obtained on our 2006-era servers. On the older machines,Sandstorm outperforms nginx by a factor of three, but the memcpy variant suffers a 30% decrease in throughput compared to pre-copy Sandstorm as a result of adding a single memcpy to the code. It is clear that on these older systems,memory bandwidth is the main performance bottleneck.

             

            當(dāng)代硬件上的結(jié)果與舊的前DDIO硬件上的結(jié)果有顯著的不同。 圖7顯示了我們2006年服務(wù)器上的結(jié)果。 在舊機(jī)器上,Sandstorm的性能超過(guò)nginx的三分之一,但是與memcpy版相比,memcpy變體的吞吐量降低了30%,這是因?yàn)樵诖a中添加了一個(gè)memcpy。 很明顯,在這些舊系統(tǒng)上,內(nèi)存帶寬是主要的性能瓶頸。

             

            With DDIO, memory bandwidth is not such a limiting factor. Figure 9 in Section 3.5 shows the corresponding memory read throughput,as measured using CPU performance counters, for the networkthroughput graphs in Figure 6b. With small file sizes, the pre-copy variant of Sandstorm appears to do more work: the L3 cache cannot hold all of the data, so there are many more L3 misses than with memcpy. Memory-read throughput for both pre-copy and nginx are closely correlated with their network throughput, indicating that DDIO is not helping on transmit: DMA comes from memory rather than the cache. The memcpy variant, however, has higher network throughput than memory throughput, indicating that DDIO is transmitting from the cache. Unfortunately, this is offset by much higher memory write throughput. Still, this only causes a small reduction in service throughput. Larger files no longer fit in the L3 cache, even with memcpy. Memory-read throughput starts to rise with files above 64KB. Despite this, performance remains high and CPU load decreases, indicating these systems are not limited by memory bandwidth for this workload.

             

            使用DDIO,內(nèi)存帶寬不是這樣的限制因素。第3.5節(jié)中的圖9顯示了使用CPU性能計(jì)數(shù)器測(cè)量的圖6b中網(wǎng)絡(luò)吞吐量圖的相應(yīng)內(nèi)存讀取吞吐量。對(duì)于小文件大小,Sandstorm的預(yù)拷貝版似乎做了更多的工作:L3緩存不能保存所有數(shù)據(jù),因此比memcpy 版的有更多的L3缺失。預(yù)拷貝和nginx的內(nèi)存讀取吞吐量與它們的網(wǎng)絡(luò)吞吐量密切相關(guān),表明DDIO對(duì)傳輸沒(méi)幫助:DMA來(lái)自?xún)?nèi)存,而不是緩存。 Memcpy版本網(wǎng)絡(luò)吞吐量反而比內(nèi)存吞吐量更高,表明DDIO正在從緩存?zhèn)鬏敗2恍业氖牵@被更高的內(nèi)存寫(xiě)入吞吐量所抵消。但是,這只會(huì)導(dǎo)致服務(wù)吞吐量的小幅下降。較大的文件不再適合L3緩存,即使使用memcpy。隨著高于64KB的文件,內(nèi)存讀取吞吐量開(kāi)始上升。盡管如此,性能仍然很高,CPU負(fù)載降低,表明這些系統(tǒng)不受此工作負(fù)載的內(nèi)存帶寬的限制。

             

            3.3 Experiment Design: Namestorm

            We use the same clients and server systems to evaluate Namestorm as we used for Sandstorm. Namestorm is expected to be significantly more CPU-intensive than Sandstorm, mostly due to fundamental DNS protocol properties: high packet rate and small packets. Based on this observation, we have changed the network topology of our experiment: we use only one NIC on the server connected to the client systems via a 10GbE cut-through switch. In order to balance the load on the server to all available CPU cores we use four dedicated NIC queues and four Namestorm instances.

            We ran Nominum’s dnsperf [2] DNS profiling software on the clients. We created zone files of varying sizes, loaded them onto the DNS servers, and configured dnsperf to query the zone repeatedly.

             

            我們使用相同的客戶(hù)端和服務(wù)器系統(tǒng)來(lái)評(píng)估Namestorm像用于Sandstorm那樣。 Namestorm預(yù)期比Sandstorm明顯更多的CPU密集,主要是由于基本的DNS協(xié)議屬性:高數(shù)據(jù)包速率和小數(shù)據(jù)包。 基于這個(gè)觀(guān)察,我們改變了我們實(shí)驗(yàn)的網(wǎng)絡(luò)拓?fù)洌何覀兺ㄟ^(guò)一個(gè)10GbE直通交換機(jī)在服務(wù)器上只使用一個(gè)網(wǎng)卡連接到客戶(hù)端系統(tǒng)。 為了平衡服務(wù)器上的負(fù)載到所有可用的CPU核心,我們使用四個(gè)專(zhuān)用的NIC隊(duì)列和四個(gè)Namestorm實(shí)例。

            我們?cè)诳蛻?hù)端運(yùn)行Nominum的dnsperf [2] DNS配置軟件。 我們創(chuàng)建了不同大小的區(qū)域文件,將它們加載到DNS服務(wù)器上,并配置dnsperf重復(fù)查詢(xún)區(qū)域。

             

            3.4 Namestorm Results

            Figure 8a shows the performance of Namestorm and NSD running on Linux and FreeBSD when using a single 10GbE NIC. Performance results of NSD are similar with both FreeBSD and Linux.Neither operating system can saturate the 10GbE NIC, however, and both show some performance drop as the zone file grows. On Linux,NSD’s performance drops by ~14% (from ~689,000 to ~590,000 Queries/sec) as the zone file grows from 1 to 10,000 entries, and on FreeBSD, it drops by ~20% (from ~720,000 to ~574,000 Qps). For these benchmarks, NSD saturates all CPU cores on both systems.

             

            圖8a顯示了使用單個(gè)10GbE NIC時(shí)在Linux和FreeBSD上運(yùn)行的Namestorm和NSD的性能。 NSD的性能結(jié)果與FreeBSD和Linux類(lèi)似。但是,操作系統(tǒng)可以使10GbE NIC飽和,并且隨著區(qū)域文件增長(zhǎng),兩者都顯示出一些性能下降。 在Linux上,隨著區(qū)域文件從1到10,000個(gè)條目的增長(zhǎng),NSD的性能下降了約14%(從約689,000到約5.9萬(wàn)查詢(xún)/秒),在FreeBSD上,它下降了?20%(從?720000到574,000 Qps )。 對(duì)于這些基準(zhǔn)測(cè)試NSD使兩個(gè)系統(tǒng)上的所有CPU內(nèi)核飽和。

             

            For Namestorm, we utilized two datasets, one where the hash keys are in wire-format (w/o compr.), and one where they are in FQDN format (compr.). The latter requires copying the search term before hashing it to handle possible compressed requests.

             

            對(duì)于Namestorm,我們使用了兩個(gè)數(shù)據(jù)集,一個(gè)是散列鍵是wire格式(w / o compr。),一個(gè)是FQDN格式(compr。)。 后者處理可能的壓縮請(qǐng)求需要在對(duì)其進(jìn)行哈希處理之前復(fù)制搜索項(xiàng)

             

            With wire-format hashing, Namestorm memcpy performance is ~11–13 better, depending on the zone size, when compared to the best results from NSD with either Linux or FreeBSD. Namestorm’s throughput drops by ~30% as the zone file grows from 1 to 10,000 entries (from ~9,310,000 to ~6,410,000 Qps). The reason for this decrease is mainly the LLC miss rate, which more than doubles.Dnsperf does not report throughput in Gbps, but given the typical DNS response size for our zones we can calculate ~8.4Gbps and ~5.9Gbps for the smallest and largest zone respectively.

             

            使用wire格式哈希,Namestorm memcpy性能是?11-13 更好,取決于區(qū)域大小與使用Linux或FreeBSD從NSD的最佳結(jié)果相比。 當(dāng)區(qū)域文件從1到10,000個(gè)條目(從?9,310,000到?6,410,000 Qps)增長(zhǎng)時(shí),Namestorm的吞吐量下降?30%。 這種減少的原因主要是LLC未命中率,其超過(guò)雙倍。Dnsperf不報(bào)告以Gbps為單位的吞吐量,但給定我們區(qū)域的典型DNS響應(yīng)大小,我們可以計(jì)算?8.4Gbps和?5.9Gbps最小和最大 區(qū)域。

             

            With FQDN-format hashing, Namestorm memcpy performance is worse than with wire-format hashing, but is still ~9–13 better compared to NSD. The extra processing with FQDN-format hashing costs ~10–20% in throughput, depending on the zone size.

            Finally, in Figure 8a we observe a noticeable performance overhead with the pre-copy stack, which we explore in Section 3.5.

             

            有了FQDN格式的哈希,Namestorm memcpy的性能比線(xiàn)格式哈希差,但仍是?9-13 比NSD好。 使用FQDN格式哈希的額外處理的成本約為吞吐量的10-20%,具體取決于區(qū)域大小。

            最后,在圖8a中,我們觀(guān)察到預(yù)拷貝堆棧的性能開(kāi)銷(xiāo)明顯,我們?cè)?.5節(jié)中探討。

             

            3.4.1 Effectiveness of batching

            One of the biggest performance benefits for Namestorm is that netmap provides an API that facilitates batching across the systemcall interface. To explore the effects of batching, we configured a single Namestorm instance and one hardware queue, and reran our benchmark with varying batch sizes. Figure 8b illustrates the results:

            a more than 2 performance gain when growing the batch size from 1 packet (no batching) to 32 packets. Interestingly, the performance of a single-core Namestorm without any batching remains more than 2 better than NSD.

             

            Batching的效果

            Namestorm最大的性能優(yōu)勢(shì)之一是netmap提供了一個(gè)API,可以在整個(gè)系統(tǒng)調(diào)用接口上實(shí)現(xiàn)批處理。 為了探究批處理的效果,我們配置了一個(gè)單獨(dú)的Namestorm實(shí)例和一個(gè)硬件隊(duì)列,并用不同的批量大小重新調(diào)整我們的基準(zhǔn)。 圖8b示出了結(jié)果:

            超過(guò)2倍 的性能增長(zhǎng)在將批處理大小從1個(gè)包(無(wú)批處理)增長(zhǎng)到32個(gè)包。 有趣的是,沒(méi)有任何batching的單核Namestorm的性能仍然超過(guò)2倍 優(yōu)于NSD。

             

            At a minimum, NSD has to make one system call to receive each request and one to send a response. Recently Linux added the new recvmmsg() and sendmmsg() system calls to receive and send multiple UDP messages with a single call. These may go some way to improving NSD’s performance compared to Namestorm.They are, however, UDP-specific, and sendmmsg() requires the application to manage its own transmit-queue batching. When we implemented Namestorm, we already had libnmio, which abstracts and handles all the batching interactions with netmap, so there is no application-specific batching code in Namestorm.

             

            至少,NSD必須進(jìn)行一次系統(tǒng)調(diào)用以接收每個(gè)請(qǐng)求并且發(fā)送一次響應(yīng)。 最近Linux添加了新的recvmmsg()和sendmmsg()系統(tǒng)調(diào)用,通過(guò)一次調(diào)用接收和發(fā)送多個(gè)UDP消息。 與Namestorm相比,這些可能在某種程度上提高NSD的性能。然而,它們是UDP特定的,sendmmsg()要求應(yīng)用程序管理自己的傳輸隊(duì)列批處理。 當(dāng)我們實(shí)現(xiàn)Namestorm時(shí),我們已經(jīng)有了libnmio,它使用netmap抽象和處理所有的批處理交互,因此在Namestorm中沒(méi)有應(yīng)用程序特定的批處理代碼。

             

            3.5 DDIO

            With DDIO, incoming packets are DMAed directly to the CPU’s L3 cache, and outgoing packets are DMAed directly from the L3 cache, avoiding round trips from the CPU to the memory subsystem. For lightly loaded servers in which the working set is smaller than the L3 cache, or in which data is accessed with temporal locality by the processor and DMA engine (e.g., touched and immediately sent, or received and immediately accessed), DDIO can dramatically reduce latency by avoiding memory traffic. Thus DDIO is ideal for RPC-like mechanisms in which processing latency is low and data will be used immediately before or after DMA. On heavily loaded systems, it is far from clear whether DDIO will be a win or not. For applications with a larger cache footprint, or in which communication occurs at some delay from CPU generation or use of packet data, DDIO could unnecessarily pollute the cache and trigger additional memory traffic, damaging performance.

             

            使用DDIO,傳入數(shù)據(jù)包直接DMA直接到CPU的L3緩存,而輸出數(shù)據(jù)包直接從L3緩存直接DMA,避免從CPU到內(nèi)存子系統(tǒng)的往返。 對(duì)于其中工作集小于L3高速緩存或其中數(shù)據(jù)由處理器和DMA引擎以時(shí)間局部性訪(fǎng)問(wèn)(例如,被觸摸和立即發(fā)送,或接收和立即訪(fǎng)問(wèn))的輕負(fù)載服務(wù)器,DDIO可以顯著減少 通過(guò)避免內(nèi)存流量延遲。 因此,DDIO是理想的類(lèi)RPC機(jī)制,其中處理延遲低,數(shù)據(jù)將立即在DMA之前或之后使用。 在負(fù)載重的系統(tǒng)上,很難說(shuō)清楚DDIO是否會(huì)贏。 對(duì)于具有較大高速緩存占用的應(yīng)用程序,或者在CPU生成或使用數(shù)據(jù)包數(shù)據(jù)的某些延遲時(shí)發(fā)生通信時(shí),DDIO可能會(huì)不必要地污染高速緩存并觸發(fā)額外的內(nèi)存流量,從而損壞性能。

             

            Intuitively, one might reasonably assume that Sandstorm’s precopy mode might interact best with DDIO: as with sendfile() based designs, only packet headers enter the L1/L2 caches, with payload content rarely touched by the CPU. Figure 9 illustrates a therefore surprising effect when operating on small file sizes: overall memory throughput from the CPU package, as measured using performance counters situated on the DRAM-facing interface of the LLC, sees significantly less traffic for the memcpy implementation relative to the pre-copy one, which shows a constant rate roughly equal to network throughput.

             

            直觀(guān)地,人們可以合理地認(rèn)為Sandstorm的預(yù)拷貝模式可能與DDIO最好地交互:與基于sendfile()的設(shè)計(jì)一樣,只有包頭進(jìn)入L1 / L2緩存,有效載荷內(nèi)容很少被CPU觸發(fā)。 圖9示出了當(dāng)以小文件大小操作時(shí)的令人驚訝的效果:如使用位于LLC的面向DRAM的接口上的性能計(jì)數(shù)器所測(cè)量的,相對(duì)于預(yù)拷貝處理,memcpy版的內(nèi)存吞吐量顯著減少,其顯示大致等于網(wǎng)絡(luò)吞吐量的恒定速率

             

            We believe this occurs because DDIO is, by policy, limited from occupying most of the LLC: in the pre-copy cases, DDIO is responsible for pulling untouched data into the cache – as the file data cannot fit in this subset of the cache, DMA access thrashes the cache and all network transmit is done from DRAM. In the memcpy case, the CPU loads data into the cache, allowing more complete utilization of the cache for network data. However, as the DRAM memory interface is not a bottleneck in the system as configured, the net result of the additional memcpy, despite better cache utilization,is reduced performance. As file sizes increase, the overall footprint of memory copying rapidly exceeds the LLC size, exceeding network throughput, at which point pre-copy becomes more efficient.Likewise, one might mistakenly believe simply from inspection of CPU memory counters that nginx is somehow benefiting from this same effect: in fact, nginx is experiencing CPU saturation, and it is not until file size reaches 512K that sufficient CPU is available to converge with pre-copy’s saturation of the network link.

             

            我們相信這是因?yàn)镈DIO是,通過(guò)策略,限制占據(jù)大部分的LLC:在預(yù)拷貝的情況下,DDIO負(fù)責(zé)將未觸摸的數(shù)據(jù)拉入高速緩存 - 因?yàn)槲募?shù)據(jù)不能容納在高速緩存的這個(gè)子集中,DMA訪(fǎng)問(wèn)使高速緩存thrashes所以所有網(wǎng)絡(luò)傳輸都由DRAM完成反而在memcpy情況下,CPU將數(shù)據(jù)加載到緩存中,從而允許更加完整地利用網(wǎng)絡(luò)數(shù)據(jù)的緩存,然而,由于DRAM存儲(chǔ)器接口不是如配置的系統(tǒng)中的瓶頸,盡管有更好的高速緩存利用率,附加memcpy的結(jié)果是降低的性能。 隨著文件大小增加,存儲(chǔ)器復(fù)制的總體占用面積快速超過(guò)LLC大小,超過(guò)網(wǎng)絡(luò)吞吐量,此時(shí)預(yù)復(fù)制變得更有效。同樣,人們可能錯(cuò)誤地認(rèn)為,從CPU內(nèi)存計(jì)數(shù)器的檢查nginx是以某種方式受益于這個(gè)相同的效果:事實(shí)上,nginx正在CPU飽和,并且直到文件大小達(dá)到512K,足夠的CPU可用的把預(yù)拷貝的網(wǎng)絡(luò)鏈路跑滿(mǎn)了。

             

            By contrast, Namestorm sees improved performance using the memcpy implementation, as the cache lines holding packet data must be dirtied due to protocol requirements, in which case performing the memcpy has little CPU overhead yet allows much more efficient use of the cache by DDIO

             

            相比之下,Namestorm使用memcpy實(shí)現(xiàn)看到改進(jìn)的性能,因?yàn)槌钟蟹纸M數(shù)據(jù)的緩存線(xiàn)必須由于協(xié)議要求而受到污染,在這種情況下執(zhí)行memcpy具有很少的CPU開(kāi)銷(xiāo),但允許更高效地通過(guò)DDIO使用緩存

            (ps: 這個(gè)例子很神奇,雖然memcpy版每次都要內(nèi)存拷貝但是更靈活,數(shù)據(jù)已經(jīng)在cache中然后網(wǎng)絡(luò)吞吐量大致等于內(nèi)存年吞吐量,而一般認(rèn)為使用預(yù)處理好的數(shù)據(jù)更快因?yàn)椴恍枰截惖怯捎谄洳辉赾ache中所以每次都要從dram走,文件越大這個(gè)情況越明顯)

             

            4. DISCUSSION

            We developed Sandstorm and Namestorm to explore the hypothesis that fundamental architectural change might be required to properly exploit rapidly growing CPU core counts and NIC capacity.Comparisons with Linux and FreeBSD appear to confirm this conclusion far more dramatically than we expected: while there are small-factor differences between Linux and FreeBSD performance curves, we observe that their shapes are fundamentally the same.We believe that this reflects near-identical underlying architectural decisions stemming from common intellectual ancestry (the BSD network stack and sockets API) and largely incremental changes from that original design.

             

            我們開(kāi)發(fā)了Sandstorm和Namestorm來(lái)探索這樣的假設(shè),即可能需要進(jìn)行基礎(chǔ)架構(gòu)更改以正確利用快速增長(zhǎng)的CPU核心數(shù)和NIC容量。與Linux和FreeBSD的比較似乎證實(shí)了這個(gè)結(jié)論比我們預(yù)期的更加顯著: Linux和FreeBSD性能曲線(xiàn)之間的因素差異,我們觀(guān)察到它們的形狀基本上是相同的。我們認(rèn)為,這反映了源于共同知識(shí)祖先(BSD網(wǎng)絡(luò)棧和套接字API)和與原始設(shè)計(jì)有很大增量變化的幾乎相同的底層架構(gòu)決策。

             

            Sandstorm and Namestorm adopt fundamentally different architectural approaches, emphasizing transparent memory flow within applications (and not across expensive protection-domain boundaries), process-to-completion, heavy amortization, batching, and application-specific customizations that seem antithetical to generalpurpose stack design. The results are dramatic, accomplishing nearlinear speedup with increases in core and NIC capacity – completely different curves possible only with a completely different design.

             

            Sandstorm 和Namestorm采用根本不同的架構(gòu)方法,強(qiáng)調(diào)應(yīng)用程序(而不是跨越昂貴的保護(hù)域邊界),過(guò)程到完成,重?cái)備N(xiāo),批處理和應(yīng)用程序特定的定制的透明內(nèi)存流,這似乎與通用堆棧設(shè)計(jì)相對(duì)立。 結(jié)果是驚人的,實(shí)現(xiàn)近線(xiàn)性加速隨著核心和NIC容量的增加 - 完全不同的曲線(xiàn)可能只有一個(gè)完全不同的設(shè)計(jì)。

             

            4.1 Current network-stack specialization

            Over the years there have been many attempts to add specialized features to general-purpose stacks such as FreeBSD and Linux. Examples include sendfile(), primarily for web servers,recvmmsg(), mostly aimed at DNS servers, and assorted socket options for telnet. In some cases, entire applications have been moved to the kernel [13, 24] because it was too difficult to achieve performance through the existing APIs. The problem with these enhancements is that each serves a narrow role, yet still must fit within a general OS architecture, and thus are constrained in what they can do. Special-purpose userspace stacks do not suffer from these constraints,and free the programmer to solve a narrow problem in an application-specific manner while still having the other advantages of a general-purpose OS stack.

             

            多年來(lái),已經(jīng)有很多嘗試為通用堆棧(如FreeBSD和Linux)添加專(zhuān)門(mén)的功能。 示例包括sendfile()(主要用于Web服務(wù)器),recvmmsg()(主要針對(duì)DNS服務(wù)器)和用于telnet的套接字選項(xiàng)。 在某些情況下,整個(gè)應(yīng)用程序已經(jīng)移動(dòng)到內(nèi)核[13,24],因?yàn)橥ㄟ^(guò)現(xiàn)有的API實(shí)現(xiàn)性能太難了。 這些增強(qiáng)的問(wèn)題在于每個(gè)服務(wù)器扮演著狹窄的角色,但仍然必須適合于一般的OS體系結(jié)構(gòu),因此被限制在他們可以做什么。 專(zhuān)用用戶(hù)空間堆棧不受這些約束的困擾,并且釋放程序員以特定于應(yīng)用的方式解決窄的問(wèn)題,同時(shí)仍然具有通用OS堆棧的其他優(yōu)點(diǎn)。

             

            4.2 The generality of specialization

            Our approach tightly integrates the network stack and application within a single process. This model, together with optimizations aimed at cache locality or pre-packetization, naturally fit a reasonably wide range of performance-critical, event-driven applications such as web servers, key-value stores, RPC-based services and name servers. Even rate-adaptive video streaming may benefit, as developments such as MPEG-DASH and Apple’s HLS have moved intelligence to the client leaving servers as dumb static-content farms.

             

            我們的方法在單個(gè)進(jìn)程中緊密的整合了網(wǎng)絡(luò)堆棧和應(yīng)用邏輯。 這種模型與針對(duì)高速緩存局部性或預(yù)分組化的優(yōu)化一起,自然地適合于相當(dāng)廣泛的性能關(guān)鍵的事件驅(qū)動(dòng)應(yīng)用,例如web服務(wù)器,鍵值存儲(chǔ),基于RPC的服務(wù)和名稱(chēng)服務(wù)器。 即使速率自適應(yīng)視頻流可以受益,因?yàn)橹T如MPEG-DASH和蘋(píng)果的HLS的發(fā)展已經(jīng)將智能移動(dòng)到客戶(hù)端,將服務(wù)器留作靜態(tài)內(nèi)容。

             

            Not all network services are a natural fit. For example, CGI-based web services and general-purpose databases have inherently different properties and are generally CPU- or filesystem-intensive, deemphasizing networking bottlenecks. In our design, the control loop and transport-protocol correctness depend on the timely execution of application-layer functions; blocking in the application cannot be tolerated. A thread-based approach might be more suitable for such cases. Isolating the network stack and application into different threads still yields benefits: OS-bypass networking costs less, and saved CPU cycles are available for the application. However, such an approach requires synchronization, and so increases complexity and offers less room for cross-layer optimization.

             

            并不是所有的網(wǎng)絡(luò)服務(wù)都是同類(lèi)的。 例如,基于CGI的Web服務(wù)和通用數(shù)據(jù)庫(kù)具有本質(zhì)上不同的屬性,并且通常是CPU或文件系統(tǒng)密集型,削弱網(wǎng)絡(luò)瓶頸。 在我們的設(shè)計(jì)中,控制回路和傳輸協(xié)議的正確性取決于應(yīng)用層功能的及時(shí)執(zhí)行; 在應(yīng)用程序中的阻塞是不能容忍的。 基于線(xiàn)程的方法可能更適合這種情況。 將網(wǎng)絡(luò)堆棧和應(yīng)用程序分離到不同的線(xiàn)程仍然產(chǎn)生以下好處:OS旁路網(wǎng)絡(luò)成本更低,并且節(jié)省的CPU周期可用于應(yīng)用程序。 然而,這種方法需要同步,因此增加了復(fù)雜性并且為跨層優(yōu)化提供了較少的空間。

             

            We are neither arguing for the exclusive use of specialized stacks over generalized ones, nor deployment of general-purpose network stacks in userspace. Instead, we propose selectively identifying key scale-out applications where informed but aggressive exploitation of domain-specific knowledge and micro-architectural properties will allow cross-layer optimizations. In such cases, the benefits outweigh the costs of developing and maintaining a specialized stack.

             

            我們既不爭(zhēng)論專(zhuān)門(mén)的堆棧對(duì)廣義的堆棧的獨(dú)占使用,也不是在用戶(hù)空間中部署通用網(wǎng)絡(luò)堆棧。 相反,我們建議選擇性地識(shí)別關(guān)鍵的橫向擴(kuò)展應(yīng)用程序并利用領(lǐng)域特定的知識(shí)和微架構(gòu)屬性將允許跨層優(yōu)化。 在這種情況下,收益超過(guò)開(kāi)發(fā)和維護(hù)專(zhuān)門(mén)堆棧的成本。

             

            4.3 Tracing, profiling, and measurement

            One of our greatest challenges in this work was the root-cause analysis of performance issues in contemporary hardware-software implementations. The amount of time spent analyzing networkstack behavior (often unsuccessfully) dwarfed the amount of time required to implement Sandstorm and Namestorm.

             

            在這項(xiàng)工作中我們最大的挑戰(zhàn)之一是根本原因分析當(dāng)代硬件 - 軟件實(shí)現(xiàn)中的性能問(wèn)題。 分析網(wǎng)絡(luò)堆棧行為所花費(fèi)的時(shí)間(通常不成功)使實(shí)現(xiàn)Sandstorm 和Namestorm所需的時(shí)間變得相形見(jiàn)絀。

             

            An enormous variety of tools exist – OS-specific PMC tools, lock contention measurement tools, tcpdump, Intel vTune, DTrace, and a plethora of application-specific tracing features – but they suffer many significant limitations. Perhaps most problematic is that the tools are not holistic: each captures only a fragment of the analysis space – different configuration models, file formats, and feature sets.

             

            存在各種各樣的工具 - 特定于操作系統(tǒng)的PMC工具,鎖定爭(zhēng)用測(cè)量工具,tcpdump,Intel vTune,DTrace和大量的應(yīng)用程序特定跟蹤功能,但是它們受到許多顯著的限制。 也許最有問(wèn)題的是工具不是整體的:每個(gè)只捕獲分析空間的一個(gè)片段 - 不同的配置模型,文件格式和特征集

             

            Worse, as we attempted inter-OS analysis (e.g., comparing Linux and FreeBSD lock profiling), we discovered that tools often measure and report results differently, preventing sensible comparison.For example, we found that Linux took packet timestamps at different points than FreeBSD, FreeBSD uses different clocks for DTrace and BPF, and that while FreeBSD exports both per-process and percore PMC stats, Linux supports only the former. Where supported,DTrace attempts to bridge these gaps by unifying configuration,trace formats, and event namespaces [15]. However, DTrace also experiences high overhead causing bespoke tools to persist, and is unintegrated with packet-level tools preventing side-by-side comparison of packet and execution traces.We feel certain that improvement in the state-of-the-art would benefit not only research, but also the practice of network-stack implementation.

             

            更糟糕的是,當(dāng)我們嘗試跨操作系統(tǒng)分析(例如,比較Linux和FreeBSD鎖定分析)時(shí),我們發(fā)現(xiàn)工具經(jīng)常以不同的方式測(cè)量和報(bào)告結(jié)果,從而阻止了明智的比較。例如,我們發(fā)現(xiàn)Linux在不同于FreeBSD ,F(xiàn)reeBSD對(duì)DTrace和BPF使用不同的時(shí)鐘,當(dāng)FreeBSD導(dǎo)出per-process和percore PMC stats時(shí),Linux僅支持前者。 在支持的情況下,DTrace通過(guò)統(tǒng)一配置,跟蹤格式和事件命名空間來(lái)嘗試彌合這些差距[15]。 然而,DTrace還經(jīng)歷了高開(kāi)銷(xiāo),導(dǎo)致定制工具持續(xù)存在,并且與分組級(jí)工具不集成,阻止了分組和執(zhí)行跟蹤的并行比較。我們確信,現(xiàn)有技術(shù)的改進(jìn)受益的 不僅僅是研究,還有網(wǎng)絡(luò)棧實(shí)現(xiàn)的實(shí)踐。

             

            Our special-purpose stacks are synchronous; after netmap hands off packets to userspace, the control flow is generally linear, and we process packets to completion. This, combined with lock-free design, means that it is very simple to reason about where time goes when handling a request flow. General-purpose stacks cannot, by their nature, be synchronous. They must be asynchronous to balance all the conflicting demands of hardware and applications, managing queues without application knowledge, allocating processing to threads in order to handle those queues, and ensuring safety via locking. To reason about performance in such systems, we often resort to statistical sampling because it is not possible to directly follow the control flow. Of course, not all network applications are well suited to synchronous models; we argue, however, that imposing the asynchrony of a general-purpose stack on all applications can unnecessarily complicate debugging, performance analysis, and performance optimization.

             

            我們的專(zhuān)用堆棧是同步的;在netmap將包交給用戶(hù)空間后,控制流通常是線(xiàn)性的,我們處理包完成。這與無(wú)鎖設(shè)計(jì)相結(jié)合,意味著在處理請(qǐng)求流時(shí)處理時(shí)間是非常簡(jiǎn)單的。通用堆棧根據(jù)其性質(zhì)不能是同步的。它們必須異步以平衡硬件和應(yīng)用程序的所有沖突需求,在沒(méi)有應(yīng)用程序知識(shí)的情況下管理隊(duì)列,為線(xiàn)程分配處理以處理這些隊(duì)列,以及通過(guò)鎖定確保安全性。為了說(shuō)明在這樣的系統(tǒng)中的性能,我們經(jīng)常采用統(tǒng)計(jì)采樣,因?yàn)椴豢赡苤苯痈S控制流。當(dāng)然,并不是所有的網(wǎng)絡(luò)應(yīng)用都非常適合同步模型;但我們認(rèn)為,在所有應(yīng)用程序上施加通用堆棧的不同步可能不必要地使調(diào)試,性能分析和性能優(yōu)化復(fù)雜化。


            5. RELATED WORK

            Web server and network-stack performance optimization is not a new research area. Past studies have come up with many optimization techniques as well as completely different design choices.These designs range from userspace and kernel-based implementations to specialized operating systems.

             

            Web服務(wù)器和網(wǎng)絡(luò)堆棧性能優(yōu)化不是一個(gè)新的研究領(lǐng)域。 過(guò)去的研究已經(jīng)提出了許多優(yōu)化技術(shù)以及完全不同的設(shè)計(jì)選擇。這些設(shè)計(jì)從基于用戶(hù)空間和基于內(nèi)核的實(shí)現(xiàn)到專(zhuān)用操作系統(tǒng)。

             

            With the conventional approaches, userspace applications [1, 6] utilize general-purpose network stacks, relying heavily on operatingsystem primitives to achieve data movement and event notification [26]. Several proposals [23, 12, 30] focus on reducing the overhead of such primitives (e.g., KQueue, epoll, sendfile()).IO-Lite [27] unifies the data management between OS subsystems and userspace applications by providing page-based mechanisms to safely and concurrently share data. Fbufs [17] utilize techniques such as page remapping and shared memory to provide high-performance cross-domain transfers and buffer management.Pesterev and Wickizer [28, 14] have proposed efficient techniques to improve commodity-stack performance by controlling connection locality and taking advantage of modern multicore systems.Similarly, MegaPipe [21] shows significant performance gain by introducing a bidirectional, per-core pipe to facilitate data exchange and event notification between kernel and userspace applications.

             

            使用傳統(tǒng)方法,用戶(hù)空間應(yīng)用[1,6]利用通用網(wǎng)絡(luò)堆棧,嚴(yán)重依賴(lài)操作系統(tǒng)原語(yǔ)來(lái)實(shí)現(xiàn)數(shù)據(jù)移動(dòng)和事件通知[26]。 幾個(gè)提案[23,12,30]集中在減少這樣的原語(yǔ)的開(kāi)銷(xiāo)(例如,KQueue,epoll,sendfile())。IO-Lite [27]通過(guò)提供基于頁(yè)面的協(xié)議來(lái)統(tǒng)一操作系統(tǒng)子系統(tǒng)和用戶(hù)空間應(yīng)用程序之間的數(shù)據(jù)管理,安全和并發(fā)共享數(shù)據(jù)的機(jī)制。 Fbufs [17]利用諸如頁(yè)面重映射和共享內(nèi)存等技術(shù)來(lái)提供高性能的跨域傳輸和緩沖管理.Pesterev和Wickizer [28,14]提出了通過(guò)控制連接局部性和利用現(xiàn)代多核系統(tǒng)來(lái)提高商品堆棧性能的高效技術(shù),MegaPipe [21]通過(guò)引入雙向,每核心管道以促進(jìn)內(nèi)核和用戶(hù)空間應(yīng)用程序之間的數(shù)據(jù)交換和事件通知,顯示出顯著的性能增益。

             

            A significant number of research proposals follow a substantially different approach: they propose partial or full implementation of network applications in kernel, aiming to eliminate the cost of communication between kernel and userspace. Although this design decision improves performance significantly, it comes at the cost of limited security and reliability. A representative example of this category is kHTTPd [13], a kernel-based web server which uses the socket interface. Similar to kHTTPd, TUX [24] is another noteworthy example of in-kernel network applications. TUX achieves greater performance by eliminating the socket layer and pinning the static content it serves in memory. We have adopted several of these ideas in our prototype, although our approach is not kernel based.

             

            大量的研究建議遵循一種截然不同的方法:它們提出在內(nèi)核中部分或全部實(shí)現(xiàn)網(wǎng)絡(luò)應(yīng)用,旨在消除內(nèi)核和用戶(hù)空間之間的通信成本。 雖然這種設(shè)計(jì)決策顯著提高了性能,但其代價(jià)是有限的安全性和可靠性。 這個(gè)類(lèi)別的代表性示例是kHTTPd [13],一種基于內(nèi)核的Web服務(wù)器,它使用套接字接口。 與kHTTPd類(lèi)似,TUX [24]是內(nèi)核網(wǎng)絡(luò)應(yīng)用的另一個(gè)值得注意的例子。 TUX通過(guò)消除套接字層并鎖定其在存儲(chǔ)器中提供的靜態(tài)內(nèi)容來(lái)實(shí)現(xiàn)更高的性能。 我們?cè)谖覀兊脑椭胁捎昧藥讉€(gè)這樣的想法,雖然我們的方法不是基于內(nèi)核的。

             

            Microkernel designs such as Mach [10] have long appealed to OS designers, pushing core services (such as network stacks) into user processes so that they can be more easily developed, customized, and multiply-instantiated. In this direction, Thekkath et al [32], have prototyped capability-enabled, library-synthesized userspace network stacks implemented on Mach. The Cheetah web server is built on top of an Exokernel [19] library operating system that provides a filesystem and an optimized TCP/IP implementation. Lightweight libOSes enable application developers to exploit domain-specific knowledge and improve performance. Unikernel designs such as MirageOS [25] likewise blend operating-system and application components at compile-time, trimming unneeded software elements to accomplish extremely small memory footprints – although by static code analysis rather than application-specific specialization.

             

            諸如Mach [10]的微內(nèi)核設(shè)計(jì)長(zhǎng)期以來(lái)都呼吁OS設(shè)計(jì)者,將核心服務(wù)(例如網(wǎng)絡(luò)堆棧)推入用戶(hù)進(jìn)程,以便能夠更容易地開(kāi)發(fā),定制和多次實(shí)例化。 在這個(gè)方向上,Thekkath等人[32],在Mach上實(shí)現(xiàn)了基于功能的,庫(kù)合成的用戶(hù)空間網(wǎng)絡(luò)棧。 Cheetah Web服務(wù)器構(gòu)建在提供文件系統(tǒng)和優(yōu)化的TCP / IP實(shí)現(xiàn)的Exokernel [19]庫(kù)操作系統(tǒng)之上。 輕量級(jí)的libOS使應(yīng)用程序開(kāi)發(fā)人員能夠利用領(lǐng)域特定的知識(shí)并提高性能。 Unikernel設(shè)計(jì),如MirageOS [25]在編譯時(shí)同樣混合操作系統(tǒng)和應(yīng)用程序組件,修剪不需要的軟件元素以完成極小的內(nèi)存占用 - 盡管通過(guò)靜態(tài)代碼分析而不是專(zhuān)用于專(zhuān)用化。

             

            6. CONCLUSION

            In this paper, we have demonstrated that specialized userspace stacks, built on top of netmap framework, can vastly improve the performance of scale-out applications. These performance gains sacrifice generality by adopting design principles at odds with contemporary stack design: application-specific cross-layer cost amortizations, synchronous and buffering-free protocol implementations, and an extreme focus on interactions between processors, caches, and NICs. This approach reflects a widespread adoption of scale-out computing in data centers, which deemphasizes multifunction hosts in favor of increased large-scale specialization. Our performance results are compelling: a 2–10 improvement for web service, and a roughly 9 improvement for DNS service. Further, these stacks have proven easier to develop and tune than conventional stacks, and their performance improvements are portable over multiple generations

            of hardware.

             

            在本文中,我們已經(jīng)證明,專(zhuān)門(mén)的用戶(hù)空間堆棧,建立在netmap框架之上,可以大大提高橫向擴(kuò)展應(yīng)用程序的性能。 這些性能增益通過(guò)采用與當(dāng)代堆棧設(shè)計(jì)不同的設(shè)計(jì)原理來(lái)犧牲通用性:特定于應(yīng)用程序的跨層成本分?jǐn)偅胶蜔o(wú)緩沖協(xié)議實(shí)現(xiàn),以及極其側(cè)重于處理器,緩存和NIC之間的交互。 這種方法反映了在數(shù)據(jù)中心中橫向擴(kuò)展計(jì)算的廣泛采用,這削弱了多功能主機(jī),有利于增加大規(guī)模專(zhuān)業(yè)化。 我們的性能結(jié)果令人信服:2-10倍的性能改進(jìn)web服務(wù),和大約9倍的提高DNS服務(wù)。 此外,這些堆棧已經(jīng)被證明比常規(guī)堆棧更容易開(kāi)發(fā)和調(diào)整,并且它們的性能改進(jìn)在多代硬件上是可移植的。

             

            General-purpose operating system stacks have been around a long time, and have demonstrated the ability to transcend multiple generations of hardware. We believe the same should be true of special-purpose stacks, but that tuning for particular hardware should be easier. We examined performance on servers manufactured seven years apart, and demonstrated that although the performance bottlenecks were now in different places, the same design delivered significant benefits on both platforms.

             

            通用操作系統(tǒng)堆棧已經(jīng)有很長(zhǎng)時(shí)間了,并且已經(jīng)證明了超越多代硬件的能力。 我們認(rèn)為專(zhuān)用堆棧也應(yīng)該是這樣,但是對(duì)于特定硬件的調(diào)整應(yīng)該更容易。 我們研究了相隔七年的服務(wù)器上的性能,并證明盡管性能瓶頸現(xiàn)在在不同的地方,但是相同的設(shè)計(jì)在這兩個(gè)平臺(tái)上帶來(lái)了顯著的優(yōu)勢(shì)。

            posted on 2017-01-22 18:01 clcl 閱讀(599) 評(píng)論(0)  編輯 收藏 引用

            只有注冊(cè)用戶(hù)登錄后才能發(fā)表評(píng)論。
            網(wǎng)站導(dǎo)航: 博客園   IT新聞   BlogJava   博問(wèn)   Chat2DB   管理


            99久久精品费精品国产一区二区| 久久九九全国免费| 久久天天躁狠狠躁夜夜2020一| 一级a性色生活片久久无少妇一级婬片免费放 | 国产精品久久一区二区三区| 99国产欧美精品久久久蜜芽| 久久国产精品-国产精品| 日韩精品久久久久久| 久久er国产精品免费观看8| 午夜精品久久久内射近拍高清| 久久久久久久波多野结衣高潮| 无码人妻久久一区二区三区免费丨| 麻豆一区二区99久久久久| jizzjizz国产精品久久| a级毛片无码兔费真人久久| 久久综合九色欧美综合狠狠| 久久国产免费直播| 精品久久一区二区三区| 久久免费国产精品| 久久亚洲私人国产精品vA| 久久亚洲国产午夜精品理论片| 久久久噜噜噜久久中文字幕色伊伊| 久久这里的只有是精品23| 久久人妻少妇嫩草AV无码专区| 亚洲国产精品一区二区久久| 一本色道久久综合| 久久99精品国产99久久| 亚洲国产成人精品女人久久久| 久久久免费精品re6| 久久天天躁狠狠躁夜夜不卡 | 丁香五月网久久综合| 久久国产成人亚洲精品影院| 久久精品国产2020| 国产精品99久久久久久猫咪 | 久久久久久久久无码精品亚洲日韩 | 国产精品美女久久久久av爽| 久久综合久久美利坚合众国| 欧美777精品久久久久网| 久久久久久久久久久久久久| 亚洲综合精品香蕉久久网97| 亚洲国产精品无码久久久秋霞2|