最近在研究DPDK,這是sigcomm 2014的論文,紀錄在此備忘
Ps: 文中關(guān)鍵詞的概念:
segment : 對應于tcp的PDU(協(xié)議傳輸單元),這里應該指tcp層的包,如果一個包太大tcp負責將它拆分成多個segment(這個概念對理解后文有幫助)
根據(jù)unix網(wǎng)絡編程卷1 第8頁注解2:packet是IP層傳遞給鏈路層并且由鏈路層打包好封裝在幀中的數(shù)據(jù)(不包括幀頭)而IP層的包(不包括ip頭)應該叫datagram,鏈路層的包叫幀(fragment),不過這里沒有特意區(qū)分packet只是數(shù)據(jù)包的意思
DRAM: 動態(tài)隨機訪問存儲器,系統(tǒng)的主要內(nèi)存
SRAM: 靜態(tài)隨機訪問存儲器,cpu 的cache
Abstract
Contemporary network stacks are masterpieces of generality, supporting many edge-node and middle-node functions. Generality comes at a high performance cost: current APIs, memory models, and implementations drastically limit the effectiveness of increasingly powerful hardware. Generality has historically been required so that individual systems could perform many functions.
However,as providers have scaled services to support millions of users,
they have transitioned toward thousands (or millions) of dedicated servers, each performing a few functions. We argue that the overhead of generality is now a key obstacle to effective scaling, making specialization not only viable, but necessary.
現(xiàn)在網(wǎng)絡堆棧在通用性表現(xiàn)很好,支持許多邊緣節(jié)點和中間節(jié)點功能。通用性伴隨著高成本:當前的API,內(nèi)存模型和實現(xiàn)極大地限制了日益強大的硬件的效能。為了使各個系統(tǒng)可以執(zhí)行許多功能通用性是必須的。
然而,由于提供商服務數(shù)百萬用戶,他們已經(jīng)轉(zhuǎn)向數(shù)千(或數(shù)百萬)的專用服務器每個都執(zhí)行幾個功能(ps:垂直細分),我們認為通用性的開銷是有效擴展的關(guān)鍵障礙,專業(yè)化不僅可行,而且是必要的。
We present Sandstorm and Namestorm, web and DNS servers that utilize a clean-slate userspace network stack that exploits knowledge of application-specific workloads. Based on the netmap framework, our novel approach merges application and network-stack memory models, aggressively amortizes protocol-layer costs based on application-layer knowledge, couples tightly with the NIC event model, and exploits microarchitectural features. Simultaneously, the servers retain use of conventional programming frameworks. We compare our approach with the FreeBSD and Linux stacks using the nginx web server and NSD name server, demonstrating 2–10 and 9 improvements in web-server and DNS throughput, lower CPU usage, linear multicore scaling, and saturated NIC hardware.
我們提出了Sandstorm和Namestorm,web和DNS服務器,采用一個干凈的用戶空間網(wǎng)絡堆棧并利用應用程序特定的工作負載的知識。基于netmap框架,我們的新穎方法合并了應用程序和網(wǎng)絡堆棧內(nèi)存模型根據(jù)應用程序?qū)又R攤銷協(xié)議層成本與NIC事件模型緊密耦合,并利用微架構(gòu)特性。同時,服務器保持使用常規(guī)編程框架。我們在FreeBSD系統(tǒng)上將使用linux協(xié)議棧的nginx web服務器和NSD name server 與我們的方案進行比對。演示 2–10和 9 展示了Web服務器和DNS吞吐量,降低CPU使用率,線性多核縮放和跑滿NIC硬件的改進。
INTRODUCTION
Conventional network stacks were designed in an era where individual systems had to perform multiple diverse functions. In the last decade, the advent of cloud computing and the ubiquity of networking has changed this model; today, large content providers serve hundreds of millions of customers. To scale their systems, they are forced to employ many thousands of servers, with each providing only a single network service. Yet most content is still served with conventional general-purpose network stacks.
介紹
傳統(tǒng)網(wǎng)絡堆棧是在各個系統(tǒng)必須執(zhí)行多種不同功能的時代設計的。 在過去十年中,云計算的出現(xiàn)和網(wǎng)絡的普及改變了這種模式; 今天,大型內(nèi)容提供商為數(shù)億客戶提供服務。 為了擴展他們的系統(tǒng),他們被迫使用成千上萬的服務器,每個服務器僅提供單個網(wǎng)絡服務。 然而,大多數(shù)內(nèi)容仍然服務于傳統(tǒng)的通用網(wǎng)絡棧。
These general-purpose stacks have not stood still, but today’s stacks are the result of numerous incremental updates on top of codebases that were originally developed in the early 1990s. Arguably, these network stacks have proved to be quite efficient, flexible, and reliable, and this is the reason that they still form the core of contemporary networked systems. They also provide a stable programming API, simplifying software development. But this generality comes with significant costs, and we argue that the overhead of generality is now a key obstacle to effective scaling, making specialization not only viable, but necessary.
這些通用棧并沒有停止前進,但今天的棧是在最初在20世紀90年代初開發(fā)的代碼庫上的許多增量更新的結(jié)果。 可以說,這些網(wǎng)絡堆棧已經(jīng)被證明是相當高效,靈活和可靠的,這就是它們?nèi)匀恍纬僧敶W(wǎng)絡系統(tǒng)的核心的原因。 它們還提供穩(wěn)定的編程API,簡化軟件開發(fā)。 但這種普遍性帶來了巨大的成本,我們認為,通用性的開銷現(xiàn)在是有效擴展的關(guān)鍵障礙,專業(yè)化不僅可行,而且是必要的。
In this paper we revisit the idea of specialized network stacks. In particular, we develop Sandstorm, a specialized userspace stack for serving static web content, and Namestorm, a specialized stack implementing a high performance DNS server. More importantly, however, our approach does not simply shift the network stack to userspace: we also promote tight integration and specialization of application and stack functionality, achieving cross-layer optimizations antithetical to current design practices.
在本文中,我們重新思考了關(guān)于網(wǎng)絡堆棧專用化。 特別是我們開發(fā)了Sandstorm,用于提供靜態(tài)Web內(nèi)容的專用用戶空間堆棧和Namestorm,這是一個實現(xiàn)高性能DNS服務器的專用堆棧。 更重要的是,我們的方法不是簡單地將網(wǎng)絡棧轉(zhuǎn)移到用戶空間:我們還促進應用程序和堆棧功能的緊密集成和專業(yè)化,實現(xiàn)與當前設計實踐相對立的跨越性優(yōu)化。
Servers such as Sandstorm could be used for serving images such as the Facebook logo, as OCSP [20] responders for certificate revocations, or as front end caches to popular dynamic content. This is a role that conventional stacks should be good at: nginx [6] uses the sendfile() system call to hand over serving static content to the operating system. FreeBSD and Linux then implement zero-copy stacks, at least for the payload data itself, using scatter-gather to directly DMA the payload from the disk buffer cache to the NIC. They also utilize the features of smart network hardware, such as TCP Segmentation Offload (TSO) and Large Receive Offload (LRO) to further improve performance. With such optimizations, nginx does perform well, but as we will demonstrate, a specialized stack can outperform it by a large margin.
像Sandstorm這樣的服務器可以用于提供諸如Facebook徽標的圖像,作為用于證書撤銷的OCSP [20]響應者,或者作為流行動態(tài)內(nèi)容的前端緩存。 這是常規(guī)堆棧應該擅長的角色:nginx [6]使用sendfile()系統(tǒng)調(diào)用將服務靜態(tài)內(nèi)容移交給操作系統(tǒng)。 FreeBSD和Linux然后實現(xiàn)零拷貝堆棧,至少對于有效載荷數(shù)據(jù)本身,使用scatter-gather散射 - 聚集直接將有效載荷從磁盤緩沖區(qū)緩存通過DMA轉(zhuǎn)發(fā)到NIC。 它們還利用智能網(wǎng)絡硬件的特性,如TCP分段(ps:segment是tcp層的包,這里Segmentation 指將大tcp包分段的功能放在硬件中完成)卸載(TSO)和大型接收卸載(LRO),以進一步提高性能。有了這樣的優(yōu)化,nginx的表現(xiàn)良好,但正如我們將證明,一個專門的堆棧可以大幅度超越它。
Namestorm is aimed at handling extreme DNS loads, such as might be seen at the root nameservers, or when a server is under a high-rate DDoS attack. The open-source state of the art here is NSD [5], which combined with a modern OS that minimizes data copies when sending and receiving UDP packets, performs well.Namestorm, however, can outperform it by a factor of nine.
Namestorm旨在處理極端的DNS負載,例如可能在根名稱服務器處看到,或者當服務器受到高速DDoS攻擊時。 這里的開源代表是NSD [5],它與現(xiàn)代操作系統(tǒng)相結(jié)合,在發(fā)送和接收UDP packet時最小化數(shù)據(jù)復制,性能良好。然而,Namestorm可以超過它的九倍。
Our userspace web server and DNS server are built upon FreeBSD’s netmap [31] framework, which directly maps the NIC buffer rings to userspace.We will show that not only is it possible for a specialized stack to beat nginx, but on data-center-style networks when serving small files typical of many web pages, it can achieve three times the throughput on older hardware, and more than six times the throughput on modern hardware supporting DDIO1.
我們的用戶空間Web服務器和DNS服務器是基于FreeBSD的netmap [31]框架構(gòu)建的,它直接將NIC緩沖環(huán)映射到用戶空間。我們將展示一個專門的堆棧不僅可以擊敗nginx, 在為許多網(wǎng)頁提供典型的小文件時,它可以實現(xiàn)舊硬件的三倍的吞吐量,并且是支持DDIO1的現(xiàn)代硬件的吞吐量的六倍多。
The demonstrated performance improvements come from four places. First, we implement a complete zero-copy stack, not only for payload but also for all packet headers, so sending data is very efficient. Second, we allow aggressive amortization that spans traditionally stiff boundaries – e.g., application-layer code can request pre-segmentation of data intended to be sent multiple times, and extensive batching is used to mitigate system-call overhead from userspace. Third, our implementation is synchronous, clocked from received packets; this improves cache locality and minimizes the latency of sending the first packet of the response. Finally, on recent systems, Intel’s DDIO provides substantial benefits, but only if packets to be sent are already in the L3 cache and received packets are processed to completion immediately. It is hard to ensure this on conventional stacks, but a special-purpose stack can get much closer to this ideal.
顯示的性能改進來自四個地方。 首先,我們實現(xiàn)一個完整的零拷貝堆棧,不僅對于有效負載,而且對于所有packet header(ps: 應該指包含ip頭的包),因此發(fā)送數(shù)據(jù)是非常有效的。 第二,我們允許跨越傳統(tǒng)上僵硬的邊界的積極的攤銷 - 例如,應用層代碼可以對要被發(fā)送多次的數(shù)據(jù)預分段(ps:segmentation 應該指大包分小包),并且廣泛的將來自用戶空間的系統(tǒng)調(diào)用批量處理以降低開銷。 第三,我們的實現(xiàn)是同步的,從接收的數(shù)據(jù)包開始; 這改進了緩存局部性并且使發(fā)送響應的第一分組的等待時間最小化。 最后,在最近的系統(tǒng)上,英特爾的DDIO提供了巨大的好處,但是只有當要發(fā)送的數(shù)據(jù)包已經(jīng)在L3緩存中,并且接收到的數(shù)據(jù)包立即被處理完成。 在常規(guī)堆棧中很難確保這一點,但是專用堆棧可以更接近這個理想。
Of course, userspace stacks are not a novel concept. Indeed, the Cheetah web server for MIT’s XOK Exokernel [19] operating system took a similar approach, and demonstrated significant performance gains over the NCSA web server in 1994. Despite this, the concept has never really taken off, and in the intervening years conventional stacks have improved immensely. Unlike XOK, our specialized userspace stacks are built on top of a conventional FreeBSD operating system. We will show that it is possible to get all the performance gains of a specialized stack without needing to rewrite all the ancillary support functions provided by a mature operating system (e.g., the filesystem). Combined with the need to scale server clusters, we believe that the time has come to re-evaluate specialpurpose stacks on today’s hardware.
The key contributions of our work are:
當然,用戶空間堆棧不是一個新穎的概念。 事實上,用于MIT XOK Exokernel [19]操作系統(tǒng)的Cheetah Web服務器采用了類似的方法,并且在1994年的NCSA Web服務器上顯示出顯著的性能提升。盡管如此,這一概念從未真正起飛,在其間, 堆棧已經(jīng)大大改善。 與XOK不同,我們的專用用戶空間堆棧是建立在常規(guī)FreeBSD操作系統(tǒng)之上的。 我們將顯示,有可能獲得專用堆棧的所有性能增益,而不需要重寫成熟操作系統(tǒng)(例如,文件系統(tǒng))提供的所有輔助支持功能。 結(jié)合需要擴展服務器集群,我們認為,現(xiàn)在是重新評估當今硬件上的專用堆棧的時候了。
我們工作的主要貢獻是:
We discuss many of the issues that affect performance in conventional stacks, even though they use APIs aimed at high performance such as sendfile() and recvmmsg().
我們討論了許多影響傳統(tǒng)堆棧性能的問題,盡管它們使用旨在實現(xiàn)高性能的API,如sendfile()和recvmmsg()。
We describe the design and implementation of multiple modular, highly specialized, application-specific stacks built over a commodity operating system while avoiding these pitfalls. In contrast to prior work, we demonstrate that it is possible to utilize both conventional and specialized stacks in a single system. This allows us to deploy specialization selectively, optimizing networking while continuing to utilize generic OS components such as filesystems without disruption.
我們描述了在操作系統(tǒng)上構(gòu)建的多個模塊化,高度專業(yè)化的應用特定堆棧的設計和實現(xiàn),同時避免這些陷阱。 與以前的工作相反,我們證明,有可能在單個系統(tǒng)中利用常規(guī)和專用的堆棧。 這使我們能夠有選擇地部署專業(yè)化,優(yōu)化網(wǎng)絡連接,同時繼續(xù)使用通用操作系統(tǒng)組件,如文件系統(tǒng)而不丟失通用性。
We demonstrate that specialized network stacks designed for aggressive cross-layer optimizations create opportunities for new and at times counter-intuitive hardware-sensitive optimizations. For example, we find that violating the long-held tenet of data-copy minimization can increase DMA performance for certain workloads on recent CPUs.
我們展示專為積極的跨層優(yōu)化設計的專用的網(wǎng)絡堆棧為新的和偶爾反直覺的硬件敏感的優(yōu)化創(chuàng)造機會。 例如,我們發(fā)現(xiàn)違反數(shù)據(jù)拷貝盡量小的原則可以提高近代CPU上某些工作負載的DMA性能。
We present hardware-grounded performance analyses of our specialized network stacks side-by-side with highly optimized conventional network stacks. We evaluate our optimizations over multiple generations of hardware, suggesting portability despite rapid hardware evolution.
我們提供基于硬件的性能分析將我們的專用網(wǎng)絡堆棧與高度優(yōu)化的常規(guī)網(wǎng)絡堆棧并排比較。 我們評估我們在多代硬件上的優(yōu)化,表明即使硬件快速更新?lián)Q代仍然可以使用。
We explore the potential of a synchronous network stack blended with asynchronous application structures, in stark contrast to conventional asynchronous network stacks supporting synchronous applications. This approach optimizes cache utilization by both the CPU andDMA engines, yielding as much as 2-10 conventional stack performance.
我們探索同步網(wǎng)絡堆棧與異步應用程序結(jié)構(gòu)混合的潛力,與支持同步應用程序的常規(guī)異步網(wǎng)絡堆棧形成鮮明對比。 這種方法優(yōu)化了CPU和DMA引擎的緩存利用率,產(chǎn)生了類似2-10的 常規(guī)堆棧性能。
2. SPECIAL-PURPOSE ARCHITECTURE
What is the minimum amount of work that a web server can perform to serve static content at high speed? It must implement a MAC protocol, IP, TCP (including congestion control), and HTTP.
However, their implementations do not need to conform to the conventional socket model, split between userspace and kernel, or even implement features such as dynamic TCP segmentation. For a web server that serves the same static content to huge numbers of clients (e.g., the Facebook logo or GMail JavaScript), essentially the same functions are repeated again and again. We wish to explore just how far it is possible to go to improve performance. In particular, we seek to answer the following questions:
Web服務器可以執(zhí)行的高速服務靜態(tài)內(nèi)容的最少工作量是多少? 它必須實現(xiàn)MAC協(xié)議(ps:ARP),IP,TCP(包括擁塞控制)和HTTP。
然而,他們的實現(xiàn)不需要符合常規(guī)套接字模型,分離用戶空間和內(nèi)核層,或者甚至實現(xiàn)諸如動態(tài)TCP分段的特征。 對于向巨大數(shù)量的客戶端(例如,F(xiàn)acebook徽標或GMail JavaScript)提供相同靜態(tài)內(nèi)容的web服務器,基本上一次又一次地重復相同的功能。 我們希望探討可以提高性能的可能性。 特別是,我們尋求回答以下問題:
Conventional network stacks support zero copy for OSmaintained data – e.g., filesystem blocks in the buffer cache, but not for application-provided HTTP headers or TCP packet headers. Can we take the zero-copy concept to its logical extreme, in which received packet buffers are passed from the NIC all the way to the application, and application packets to be sent are DMAed to the NIC for transmission without even the headers being copied?
傳統(tǒng)的網(wǎng)絡棧支持操作系統(tǒng)維護的數(shù)據(jù)結(jié)構(gòu)的零拷貝,例如,緩沖區(qū)高速緩存中的文件系統(tǒng)塊,但不支持應用程序提供的HTTP報頭或TCP包頭。 我們可以極端考慮零拷貝概念的邏輯,其中接收的數(shù)據(jù)包緩沖區(qū)從NIC一直傳遞到應用程序,并且要發(fā)送的應用程序包被從DMA到NIC傳輸,甚至不復制頭(ps:接收到的包直接修改成回復并發(fā)送)?
Conventional stacks make extensive use of queuing and buffering to mitigate context switches and keep CPUs and NICs busy, at the cost of substantially increased cache footprint and latency. Can we adopt a bufferless event model that reimposes synchrony and avoids large queues that exceed cache sizes? Can we expose link-layer buffer information, such as available space in the transmit descriptor ring, to prevent buffer bloat and reduce wasted work constructing packets that will only be dropped?
傳統(tǒng)堆棧廣泛使用排隊和緩沖以減少上下文切換并保持CPU和NIC忙碌,其代價是顯著增加的高速緩存占用和延遲。 我們可以采用無緩沖事件模型,重建同步,避免超過緩存大小的大隊列嗎? 我們可以暴露鏈路層緩沖區(qū)信息,例如傳輸描述符環(huán)中的可用空間,以防止緩沖區(qū)膨脹,并減少那些浪費的構(gòu)建只會丟棄的數(shù)據(jù)包的工作量?
Conventional stacks amortize expenses internally, but cannot amortize repetitive costs spanning application and network layers. For example, they amortize TCP connection lookup using Large Receive Offload (LRO) but they cannot amortize the cost of repeated TCP segmentation of the same data transmitted multiple times. Can we design a network-stack API that allows cross-layer amortizations to be accomplished such that after the first client is served, no work is ever repeated when serving subsequent clients?
傳統(tǒng)堆棧在內(nèi)部攤銷費用,但不能攤銷跨應用程序和網(wǎng)絡層的重復成本。 例如,它們使用大型接收卸載(LRO)來攤銷TCP連接查找,但是它們不能攤銷多次傳輸?shù)南嗤瑪?shù)據(jù)的重復TCP分段(ps: 多次對同一個tcp包重復拆小包)的成本。 我們可以設計一個網(wǎng)絡堆棧API,減少跨層消耗,使第一個客戶端服務后,在服務后續(xù)客戶端時,不會重復任何工作?
Conventional stacks embed the majority of network code in the kernel to avoid the cost of domain transitions, limiting twoway communication flow through the stack. Can we make heavy use of batching to allow device drivers to remain in the kernel while colocating stack code with the application and avoiding significant latency overhead?
傳統(tǒng)堆棧將大部分網(wǎng)絡代碼嵌入內(nèi)核中,以避免域轉(zhuǎn)換的成本,限制通過堆棧的兩個通信流。 我們可以大量使用批處理,以允許設備驅(qū)動程序保留在內(nèi)核中,同時與應用程序和堆棧代碼colocating (ps: 這個不懂得怎么翻譯,大概是沒有跨層消耗的意思?),并避免顯著的延遲開銷?
Can we avoid any data-structure locking, and even cache-line contention, when dealing with multi-core applications that do not require it?
在不需要多核時,我們可以避免任何數(shù)據(jù)結(jié)構(gòu)鎖定,甚至是高速緩存行爭用嗎?
Finally, while performing all the above, is there a suitable programming abstraction that allows these components to be reused for other applications that may benefit from server specialization?
最后,在執(zhí)行上述所有操作時,是否有合適的編程抽象,允許這些組件重用于可能受益于服務器專業(yè)化的其他應用程序?
2.1 Network-stack Modularization
Although monolithic kernels are the de facto standard for networked systems, concerns with robustness and flexibility continue to drive exploration of microkernel-like approaches. Both Sandstorm and Namestorm take on several microkernel-like qualities:
網(wǎng)絡堆棧模塊化
雖然獨立的內(nèi)核是網(wǎng)絡系統(tǒng)的事實上的標準,但是對靈活性的關(guān)注繼續(xù)推動類似微內(nèi)核的方法的探索。 Sandstorm和Namestorm都有幾個類似微內(nèi)核的特性:
Rapid deployment & reusability: Our prototype stack is highly modular, and synthesized from the bottom up using traditional dynamic libraries as building blocks (components) to construct a special-purpose system. Each component corresponds to a standalone service that exposes a well-defined API. Our specialized network stacks are built by combining four basic components:
快速部署和可重用性:我們的原型棧是高度模塊化的,并從下往上使用傳統(tǒng)的動態(tài)庫作為構(gòu)建塊(組件)來構(gòu)建一個專用系統(tǒng)。 每個組件對應于公開明確定義的API的獨立服務。 我們的專業(yè)網(wǎng)絡堆棧是由四個基本組件組合而成:
The netmap I/O (libnmio) library that abstracts traditional data-movement and event-notification primitives needed by higher levels of the stack.
netmap I / O(libnmio)庫抽象了傳統(tǒng)的數(shù)據(jù)移動和事件通知原語需要的更高級別的堆棧。
libeth component, a lightweight Ethernet-layer implementation.
libeth組件,輕量級以太網(wǎng)層實現(xiàn)。
libtcpip that implements our optimized TCP/IP layer.
libtcpip實現(xiàn)我們優(yōu)化的TCP / IP層。
libudpip that implements a UDP/IP layer.
libudpip實現(xiàn)一個UDP / IP層。
Figure 1 depicts how some of these components are used with a simple application layer to form Sandstorm, the optimized web server.
Splitting functionality into reusable components does not require us to sacrifice the benefits of exploiting cross-layer knowledge to optimize performance, as memory and control flow move easily across API boundaries. For example, Sandstorm interacts directly with libnmio to preload and push segments into the appropriate packet-buffer pools. This preserves a service-centric approach.
Developer-friendly: Despite seeking inspiration from microkernel design, our approach maintains most of the benefits of conventional monolithic systems:
圖1描述了這些組件如何與簡單的應用層一起使用來形成Sandstorm,優(yōu)化的web服務器。
將功能分解為可重用組件不需要我們犧牲利用跨層知識來優(yōu)化性能的優(yōu)勢,因為內(nèi)存和控制流可以輕松跨越API邊界。 例如,Sandstorm直接與libnmio交互以預加載并將segments 推入相應的包緩沖池。 這保留了以服務為中心的方法。
開發(fā)者友好:盡管從微內(nèi)核設計中獲得靈感,我們的方法保持了傳統(tǒng)獨立系統(tǒng)的大部分優(yōu)勢:
Debugging is at least as easy (if not easier) compared to conventional systems, as application-specific, performancecentric code shifts from the kernel to more accessible userspace.
調(diào)試至少和傳統(tǒng)系統(tǒng)一樣容易(如果沒有更容易的話),因為特定應用程序,性能中心代碼從內(nèi)核轉(zhuǎn)移到更易于訪問的用戶空間。
Our approach integrates well with the general-purpose operating systems: rewriting basic components such as device drivers or filesystems is not required. We also have direct access to conventional debugging, tracing, and profiling tools, and can also use the conventional network stack for remote access (e.g., via SSH).
我們的方法與通用操作系統(tǒng)完美集成:不需要重寫基本組件,如設備驅(qū)動程序或文件系統(tǒng)。 我們還可以直接訪問常規(guī)調(diào)試,跟蹤和分析工具,并且還可以使用常規(guī)網(wǎng)絡棧來遠程訪問(例如,通過SSH)。
Instrumentation in Sandstorm is a simple and straightforward task that allows us to explore potential bottlenecks as well as necessary and sufficient costs in network processing across application and stack. In addition, off-the-shelf performance monitoring and profiling tools “just work”, and a synchronous design makes them easier to use.
Sandstorm中的工具完成簡單和直接的任務,允許我們探索潛在的瓶頸,以及在應用和堆棧的網(wǎng)絡處理中必要和足夠的成本。 此外,現(xiàn)成的性能監(jiān)控和分析工具“只是工作”,同步設計使它們更容易使用。
2.2 Sandstorm web server design
Rizzo’s netmap framework provides a general-purpose API that allows received packets to be mapped directly to userspace, and packets to be transmitted to be sent directly from userspace to the NIC’s DMA rings. Combined with batching to reduce system calls, this provides a high-performance framework on which to build packet-processing applications. A web server, however, is not normally thought of as a packet-processing application, but one that handles TCP streams.
Sandstorm web服務器的設計
Rizzo的netmap框架提供了通用API,允許接收的數(shù)據(jù)包直接映射到用戶空間,要發(fā)送的數(shù)據(jù)包將直接從用戶空間發(fā)送到NIC的DMA環(huán)。 結(jié)合批處理以減少系統(tǒng)調(diào)用,這提供了一個高性能框架,用于構(gòu)建數(shù)據(jù)包處理應用程序。 然而,Web服務器通常不被認為是包處理應用,而是處理TCP流的應用。
To serve a static file, we load it into memory, and a priori generate all the packets that will be sent, including TCP, IP, and link-layer headers. When an HTTP request for that file arrives, the server must allocate a TCP-protocol control block (TCB) to keep track of the connection’s state, but the packets to be sent have already been created for each file on the server.2
要提供一個靜態(tài)文件,我們將它加載到內(nèi)存,并且先驗生成所有要發(fā)送的數(shù)據(jù)包,包括TCP,IP和鏈路層頭。 當對該文件的HTTP請求到達時,服務器必須分配TCP協(xié)議控制塊(TCB)以跟蹤連接的狀態(tài),但是已經(jīng)為服務器上的每個文件創(chuàng)建了要發(fā)送的包。
The majority of the work is performed during inbound TCP ACK processing. The IP header is checked, and if it is acceptable, a hash table is used to locate the TCB. The offset of the ACK number from the start of the connection is used to locate the next prepackaged packet to send, and if permitted by the congestion and receive windows, subsequent packets. To send these packets, the destination address and port must be rewritten, and the TCP and IP checksums incrementally updated. The packet can then be directly fetched by the NIC using netmap. All reads of the ACK header and modifications to the transmitted packets are performed in a single pass, ensuring that both the headers and the TCB remain in the CPU’s L1 cache.
大多數(shù)工作在入站TCP ACK處理期間執(zhí)行。 檢查IP報頭,并且如果可以則使用哈希表來定位TCB(ps: 讓這個TCB來處理相同鏈接的包)。 ACK號的偏移用來在連接開始時定位要發(fā)送的下一個預打包的數(shù)據(jù)包,并且如果擁塞和接收窗口允許,則用于定位后續(xù)包。 要發(fā)送這些數(shù)據(jù)包,必須重寫目標地址和端口,并且逐步更新TCP和IP校驗和。 然后,該包可以由NIC使用netmap直接獲取。 ACK報頭的所有讀取和對發(fā)送的包的修改在單次通過中執(zhí)行,確保報頭和TCB保留在CPU的L1高速緩存中。
Once a packet has been DMAed to the NIC, the packet buffer is returned to Sandstorm, ready to be incrementally modified again and sent to a different client. However, under high load, the same packet may need to be queued in the TX ring for a second client before it has finished being sent to the first client. The same packet buffer cannot be in the TX ring twice, with different destination address and port. This presents us with two design options:
一旦包已經(jīng)被DMA到NIC,buffer被返回到Sandstorm,準備再次增量地修改并發(fā)送到不同的客戶端。 然而,在高負載下,相同的數(shù)據(jù)包可能需要在第二個客戶端的tx ring中被排隊直到第一個客戶端發(fā)送完它。 具有不同的目的地址和端口的相同的包buffer不能被添加兩次到TX環(huán)中。 這里我們提供了兩個設計選項
We can maintain more than one copy of each packet in memory to cope with this eventuality. The extra copy could be created at startup, but a more efficient solution would create extra copies on demand whenever a high-water mark is reached, and then retained for future use.
我們可以在內(nèi)存中保存每個數(shù)據(jù)包的多個副本,以應對這種可能性。 可以在啟動時創(chuàng)建額外的副本,但是更高效的解決方案可以在達到高水位標記時根據(jù)需要創(chuàng)建額外副本,然后保留以供將來使用。
We can maintain only one long-term copy of each packet,creating ephemeral copies each time it needs to be sent
我們給每個包長期維護一個副本。在每次需要時創(chuàng)建臨時拷貝
We call the former a pre-copy stack (it is an extreme form of zerocopy stack because in the steady state it never copies, but differs from the common use of the term “zero copy”), and the latter a memcpy stack. A pre-copy stack performs less per-packet work than a memcpy stack, but requires more memory; because of this, it has the potential to thrash the CPU’s L3 cache. With the memcpy stack, it is more likely for the original version of a packet to be in the L3 cache, but more work is done. We will evaluate both approaches, because it is far from obvious how CPU cycles trade off against cache misses in modern processors.
我們稱前者是一個預拷貝堆棧(它是一個極端形式的zerocopy堆棧,因為在穩(wěn)定狀態(tài)下它從不復制,但不同于常用的術(shù)語“零拷貝”),后者是一個memcpy堆棧。 預拷貝堆棧比memcpy堆棧執(zhí)行更少的工作,但需要更多的內(nèi)存; 因為這樣,它有可能摧毀CPU的L3緩存。 使用memcpy堆棧,更可能的原始版本的數(shù)據(jù)包在L3緩存,但有更多的工作要做。 我們將評估這兩種方法,因為它遠不如CPU周期與現(xiàn)代處理器中的高速緩存未命中明顯。
Figure 2 illustrates tradeoffs through traces taken on nginx/Linux and pre-copy Sandstorm servers that are busy (but unsaturated). On the one hand, a batched design measurably increases TCP roundtrip time with a relatively idle CPU. On the other hand, Sandstorm amortizes or eliminates substantial parts of per-request processing through a more efficient architecture. Under light load, the benefits are pronounced; at saturation, the effect is even more significant.
圖2說明了通過跟蹤在忙碌(但不飽和)的nginx / Linux和預拷貝Sandstorm服務器。 一方面,批量的設計預期地延長具有相對空閑的CPU的TCP往返時間但同時 另一方面,Sandstorm通過更高效的架構(gòu)來平攤或消除每個請求處理的大部分。 在輕負載下,益處明顯; 在飽和時,效果甚至更顯著。
Although most work is synchronous within the ACK processing code path, TCP still needs timers for certain operations. Sandstorm’s timers are scheduled by polling the Time Stamp Counter (TSC): although not as accurate as other clock sources, it is accessible from userspace at the cost of a single CPU instruction (on modern hardware).The TCP slow timer routine is invoked periodically (every ~500ms) and traverses the list of active TCBs: on RTO expiration,the congestion window and slow-start threshold are adjusted accordingly,and any unacknowledged segments are retransmitted. The same routine also releases TCBs that have been in TIME_WAIT state for longer than 2*MSL. There is no buffering whatsoever required for retransmissions: we identify the segment that needs to be retransmitted using the oldest unacknowledged number as an offset,retrieve the next available prepackaged packet and adjust its headers accordingly, as with regular transmissions. Sandstorm currently implements TCP Reno for congestion control.
雖然大多數(shù)工作在ACK處理代碼路徑內(nèi)同步,但是TCP仍然需要某些操作的定時器。 Sandstorm的計時器通過輪詢時間戳計數(shù)器(TSC)來調(diào)度:盡管不如其他時鐘源精確,但是可以從單個CPU指令(在現(xiàn)代硬件上)的成本從用戶空間訪問。TCP慢速計時器例程被周期性地調(diào)用(每?500ms)并且遍歷活動TCB的列表:在RTO到期時,相應地調(diào)整擁塞窗口和慢啟動閾值,并且重傳任何未確認的segments。同一例程還釋放已處于TIME_WAIT狀態(tài)長于2 * MSL的TCB。沒有對重傳要求的緩沖:我們使用最舊的未確認號碼作為偏移來識別需要重傳的段,檢索下一個可用的預先封裝的分組,并相應地調(diào)整其頭部,如同常規(guī)傳輸一樣。 Sandstorm目前實現(xiàn)了TCP Reno的擁塞控制。
2.3 The Namestorm DNS server
The same principles applied in the Sandstorm web server, also apply to a wide range of servers returning the same content to multiple users. Authoritative DNS servers are often targets of DDoS attacks – they represent a potential single point of failure, and because DNS traditionally uses UDP, lacks TCP’s three way handshake to protect against attackers using spoofed IP addresses. Thus,high performance DNS servers are of significant interest.
在Sandstorm Web服務器中應用的相同原理也適用于將相同內(nèi)容返回給多個用戶類的服務器。 權(quán)威DNS服務器通常是DDoS攻擊的目標 - 它們代表一個潛在的單點故障,并且因為DNS傳統(tǒng)上使用UDP,缺乏TCP的三方握手,以防止攻擊者使用欺騙的IP地址。 因此,高性能DNS服務器具有重大意義。
Unlike TCP, the conventional UDP stack is actually quite lightweight, and DNS servers already preprocess zone files and store response data in memory. Is there still an advantage running a specialized stack?
與TCP不同,常規(guī)UDP堆棧實際上相當輕量級,DNS服務器已經(jīng)預處理zone并將響應數(shù)據(jù)存儲在內(nèi)存中。 運行一個專用的堆棧還有優(yōu)勢嗎?
Most DNS-request processing is simple. When a request arrives,
the server performs sanity checks, hashes the concatenation of the name and record type being requested to find the response, and sends that data. We can preprocess the responses so that they are already stored as a prepackaged UDP packet. As with HTTP, the destination address and port must be rewritten, the identifier must be updated,and the UDP and IP checksums must be incrementally updated.After the initial hash, all remaining processing is performed in one pass, allowing processing of DNS response headers to be performed from the L1 cache. As with Sandstorm, we can use pre-copy or memcpy approaches so that more than one response for the same name can be placed in the DMA ring at a time.
大多數(shù)DNS請求處理很簡單。 當請求到達時,
服務器執(zhí)行完整性檢查,對所請求的名稱和記錄類型合并進行哈希以找到響應,并發(fā)送該數(shù)據(jù)。 我們可以預處理響應,以便它們已經(jīng)存儲為預先打包的UDP數(shù)據(jù)包。 與HTTP一樣,必須重寫目標地址和端口,必須更新標識符,并且必須增量更新UDP和IP校驗和。在初始哈希后,所有剩余的處理都在一次執(zhí)行,允許處理DNS響應頭 將從L1高速緩存執(zhí)行。 與Sandstorm一樣,我們可以使用預拷貝或memcpy方法,以便同一名稱的多個響應可以一次放置在DMA環(huán)中。
Our specialized userspace DNS server stack is composed of three reusable components, libnmio, libeth, libudpip, and a DNS-specific application layer. As with Sandstorm, Namestorm uses FreeBSD’s netmap API, implementing the entire stack in userspace, and uses netmap’s batching to amortize system call overhead. libnmio and libeth are the same as used by Sandstorm, whereas libudpip contains UDP-specific code closely integrated with an IP layer. Namestorm is an authoritative nameserver, so it does not need to handle recursive lookups.
我們的專用用戶空間DNS服務器堆棧由三個可重用的組件libnmio,libeth,libudpip和DNS特定的應用程序?qū)咏M成。 與Sandstorm一樣,Namestorm使用FreeBSD的netmap API,在用戶空間實現(xiàn)整個堆棧,并使用netmap的批處理來攤銷系統(tǒng)調(diào)用開銷。 libnmio和libeth與Sandstorm使用的相同,而libudpip包含與IP層緊密集成的UDP特定代碼。 Namestorm是一個權(quán)威的名稱服務器,因此它不需要處理遞歸查找。
Namestorm preprocesses the zone file upon startup, creating DNS response packets for all the entries in the zone, including the answer section and any glue records needed. In addition to type-specific queries for A, NS,MX and similar records, DNS also allows queries for ANY. A full implementation would need to create additional response packets to satisfy these queries; our implementation does not yet do so, but the only effect this would have is to increase the overall memory footprint. In practice, ANY requests prove comparatively rare.
Namestorm在啟動時預處理zone文件,為zone中的所有條目創(chuàng)建DNS響應數(shù)據(jù)包,包括答案部分和所需的任何附帶記錄。 除了針對A,NS,MX和類似記錄的類型特定查詢之外,DNS還允許對ANY進行查詢。 完全實現(xiàn)將需要創(chuàng)建額外的響應分組以滿足這些查詢; 我們的實現(xiàn)還沒有這樣做,但是唯一的效果是增加總體內(nèi)存占用。 在實踐中,any請求比較罕見。
Namestorm idexes the prepackaged DNS response packets using a hash table. There are two ways to do this:
Namestorm使用hash表索引預先打包的DNS響應數(shù)據(jù)包。 有兩種方法可以做到這一點:
Index by concatenation of request type (e.g., A, NS, etc) and fully-qualified domain name (FQDN); for example “www.example.com”.
通過請求類型(例如,A,NS等)和完全限定域名(FQDN)的合并索引; 例如“www.example.com”.
Index by concatenation of request type and the wire-format FQDN as this appears in an actual query; for example,“[3]www[7]example[3]com[0]” where [3] is a single byte containing the numeric value 3.
通過連接請求類型和wire格式FQDN索引,因為它出現(xiàn)在實際查詢中; 例如“[3] www [7] example [3] com [0]”,其中[3]是包含數(shù)值3的單個字節(jié)。(ps: dns包中域名使用”3www5baidu3com”格式,前面的數(shù)字表示后面的域長方便解析)
Using the wire request format is obviously faster, but DNS permits compression of names. Compression is common in DNS answers, where the same domain name occurs more than once, but proves rare in requests. If we implement wire-format hash keys, we must first perform a check for compression; these requests are decompressed and then reencoded to uncompressed wire-format for hashing.The choice is therefore between optimizing for the common case, using wire-format hash keys, or optimizing for the worst case, assuming compression will be common, and using FQDN hash keys. The former is faster, but the latter is more robust to a DDoS attack by an attacker taking advantage of compression. We evaluate both approaches, as they illustrate different performance tradeoffs.
使用wire請求格式顯然更快,但DNS允許壓縮名稱。 壓縮在DNS答案中很常見,其中相同的域名不止一次地出現(xiàn),但在請求中證明是罕見的。 如果我們wire線格式哈希鍵,我們必須首先執(zhí)行壓縮檢查; 這些請求被解壓縮,然后被重新編碼為未壓縮的wire格式以用于散列。因此選擇是針對常見情況優(yōu)化,使用wire格式散列密鑰,或者對于最壞情況優(yōu)化,假設壓縮將是常見的,并且使用FQDN散列密鑰。 前者更快,但后者更強大到攻擊者利用壓縮的DDoS攻擊。 我們評估這兩種方法,因為它們說明了不同的性能。
Our implementation does not currently handle referrals, so it can handle only zones for which it is authoritative for all the sub-zones.It could not, for example, handle the .com zone, because it would receive queries for www.example.com, but only have hash table entries for example.com. Truncating the hash key is trivial to do as part of the translation to an FQDN, so if Namestorm were to be used for a domain such as .com, the FQDN version of hashing would be a reasonable approach.
我們的實現(xiàn)目前不處理引用,因此它只能處理對所有子區(qū)域都是權(quán)威的區(qū)域。
例如,它無法處理.com區(qū)域,因為它將接收www.example.com的查詢,但只有example.com的哈希表條目。 截斷哈希鍵對于轉(zhuǎn)換到FQDN是很重要的,所以如果Namestorm用于一個域如.com,F(xiàn)QDN版本的哈希將是一個合理的方法。
Outline of the main Sandstorm event loop
1. Call RX poll to receive a batch of received packets that have been
stored in the NIC’s RX ring; block if none are available.
2. For each ACK packet in the batch:
3. Perform Ethernet and IP input sanity checks.
4. Locate the TCB for the connection.
5. Update the acknowledged sequence numbers in TCB; update
receive window and congestion window.
6. For each new TCP data packet that can now be sent, or each
lost packet that needs retransmitting:
7. Find a free copy of the TCP data packet (or clone one
if needed).
8. Correct the destination IP address, destination port,
sequence numbers, and incrementally update the TCP
checksum.
9. Add the packet to the NIC’s TX ring.
10. Check if dt has passed since last TX poll. If it has, call
TX poll to send all queued packets.
Sandstorm 主事件循環(huán)概述
1.調(diào)用RX輪詢批量接收在NIC的RX環(huán)中的packet;直到?jīng)]有。
2.處理每個ACK數(shù)據(jù)包:
3.執(zhí)行鏈路層和IP層完整性檢查。
4.找到處理該連接的TCB。
5.更新TCB中已確認的序列號; 更新接收窗口和擁塞窗口。
6.對于可以立即發(fā)送的每個新TCP數(shù)據(jù)包,或每個需要重傳的丟失數(shù)據(jù)包:
7.查找TCP數(shù)據(jù)包的空閑拷貝(如果需要,請clone一個)。
8.更正目標IP地址,目標端口,序列號,并逐步更新TCP校驗和。
9.將數(shù)據(jù)包添加到NIC的TX環(huán)。
10.檢查是否達到TX輪詢的時間間隔。 如果是調(diào)用TX poll發(fā)送所有排隊的數(shù)據(jù)包。
2.4 Main event loop
To understand how the pieces fit together and the nature of interaction between Sandstorm, Namestorm, and netmap, we consider the timeline for processing ACK packets in more detail. Figure 3 summarizes Sandstorm’s main loop. SYN/FIN handling, HTTP, and timers are omitted from this outline, but also take place. However,most work is performed in the ACK processing code.
2.4主事件循環(huán)
為了理解這些部分是如何組合在一起的,以及Sandstorm,Namestorm和netmap之間的交互性質(zhì),我們更詳細地考慮處理ACK數(shù)據(jù)包的時間線。 圖3總結(jié)了Sandstorm的主循環(huán)。 SYN / FIN處理,HTTP和計時器從此大綱中省略但偶爾也有。 然而,大多數(shù)工作是在ACK處理代碼中執(zhí)行的。
One important consequence of this architecture is that the NIC’s TX ring serves as the sole output queue, taking the place of conventional socket buffers and software network-interface queues. This is possible because retransmitted TCP packets are generated in the same way as normal data packets. As Sandstorm is fast enough to saturate two 10Gb/s NICs with a single thread on one core, data structures are also lock free
這種架構(gòu)的一個重要結(jié)果是,NIC的TX ring用作唯一的輸出隊列,取代了傳統(tǒng)的套接字緩沖區(qū)和軟件網(wǎng)卡隊列。 這是可能的,因為重傳的TCP包以與正常數(shù)據(jù)包相同的方式生成。 由于Sandstorm足夠快,可以在一個核上使用單個線程來飽和兩個10Gb / s網(wǎng)卡,數(shù)據(jù)結(jié)構(gòu)也是無鎖的
When the workload is heavy enough to saturate the CPU, the system-call rate decreases. The receive batch size increases as calls to RX poll become less frequent, improving efficiency at the expense of increased latency. Under extreme load, the RX ring will fill, dropping packets. At this point the system is saturated and, as with any web server, it must bound the number of open connections by dropping some incoming SYNs.
當工作負載足夠大以使CPU飽和時,系統(tǒng)調(diào)用(ps: rx tx的poll輪詢)速率降低。 接收批次大小隨著對RX輪詢的調(diào)用變得不那么頻繁而增加,以增加的延遲為代價來提高效率。 在極端負載下,RX環(huán)會填滿,丟棄報文。 此時,系統(tǒng)已飽和,與任何Web服務器一樣,它必須丟棄一定數(shù)量的打開的連接的SYN。
Under heavier load, the TX-poll system call happens in the RX handler. In our current design, dt, the interval between calls to TX poll in the steady state, is a constant set to 80us. The system-call rate under extreme load could likely be decreased by further increasing dt, but as the pre-copy version of Sandstorm can easily saturate all six 10Gb/s NICs in our systems for all file sizes, we have thus far not needed to examine this. Under lighter load, incoming packets might arrive too rarely to provide acceptable latency for transmitted packets; a 5ms timer will trigger transmission of straggling packets in the NIC’s TX ring.
在較重的負載下,TX-poll系統(tǒng)調(diào)用發(fā)生在RX處理程序中。 在我們當前的設計中,dt,在穩(wěn)定狀態(tài)下調(diào)用TX poll之間的間隔,是一個設置為80us的常數(shù)。 在極端負載下通過進一步增加dt來降低系統(tǒng)調(diào)用率,但是由于預拷貝版本的Sandstorm可以很容易地飽和所有文件大小的系統(tǒng)中的所有6個10Gb / s網(wǎng)卡,我們迄今為止不需要 檢查這個。 在較輕負載下,傳入packets可能很少于是提供了一個可接受延遲; 5ms發(fā)送一次。
The difference between the pre-copy version and the memcpy version of Sandstorm is purely in step 7, where the memcpy version will simply clone the single original packet rather than search for an unused existing copy.
預拷貝版本和memcpy版本之間的差異純粹是在步驟7中,其中memcpy版本將簡單地克隆單個原始數(shù)據(jù)包,而不是搜索未使用的現(xiàn)有副本。
Contemporary Intel server processors support Direct Data I/O (DDIO). DDIO allows NIC-originated Direct Memory Access (DMA) over PCIe to access DRAM through the processor’s Last-Level Cache (LLC). For network transmit, DDIO is able to pull data from the cache without a detour through system memory; likewise, for receive,DMA can place data in the processor cache. DDIO implements administrative limits on LLC utilization intended to prevent DMA from thrashing the cache. This design has the potential to significantly reduce latency and increase I/O bandwidth
當代英特爾服務器處理器支持直接數(shù)據(jù)I / O(DDIO)。 DDIO允許通過PCIe的NIC發(fā)起的DMA直接內(nèi)存訪問,通過處理器的最后級緩存(LLC)訪問DRAM。 對于網(wǎng)絡傳輸,DDIO能夠從緩存中提取數(shù)據(jù),而不必通過系統(tǒng)內(nèi)存; 同樣,對于接收,DMA可以將數(shù)據(jù)放置在處理器高速緩存中。 DDIO實現(xiàn)對LLC利用率的管理限制,旨在防止DMA頻繁刷緩存。 此設計具有顯著減少延遲和增加I / O帶寬的潛力
Memcpy Sandstorm forces the payload of the copy to be in the CPU cache from which DDIO can DMA it to the NIC without needing to load it from memory again. With pre-copy, the CPU only touches the packet headers, so if the payload is not in the CPU cache, DDIO must load it, potentially impacting performance. These interactions are subtle, and we will look at them in detail.
Memcpy 版本的Sandstorm強制拷貝的負載壓力在CPU緩存中,DDIO可以將其從DMA傳輸?shù)絅IC,而無需再次從內(nèi)存加載它。 使用預拷貝,CPU只觸發(fā)數(shù)據(jù)包頭,因此如果有效負載不在CPU緩存中,DDIO必須加載它,這可能會影響性能。 這些互動是微妙的,我們將詳細研究它們。(ps:照理減少拷貝使用現(xiàn)成的數(shù)據(jù)更快但是這里用了DDIO實現(xiàn)cpu到網(wǎng)卡的直接傳輸而預拷貝版本的話現(xiàn)成的數(shù)據(jù)不一定在cache中反而還多了一次加載,關(guān)于這點后面還會討論)
Namestorm follows the same basic outline, but is simpler as DNS is stateless: it does not need a TCB, and sends a single response packet to each request.
Namestorm遵循相同的基本概要,但是更簡單,因為DNS是無狀態(tài)的:它不需要TCB,并且向每個請求發(fā)送單個響應包。
2.5 API
As discussed, all of our stack components provide well-defined APIs to promote reusability. Table 1 presents a selection of API functions exposed by libnmio and libtcpip. In this section we describe some of the most interesting properties of the APIs.
如上所述,我們的所有堆棧組件都提供了定義明確的API來提高可重用性。 表1介紹了libnmio和libtcpip暴露的API函數(shù)的選擇。 在本節(jié)中,我們描述了一些最有趣的API的屬性。
libnmio is the lowest-level component: it handles all interaction with netmap and abstracts the main event loop. Higher layers 179 (e.g., libeth) register callback functions to receive raw incoming data as well as set timers for periodic events (e.g., TCP slow timer).The function netmap_ouput() is the main transmission routine:it enqueues a packet to the transmission ring either by memory or zero copying and also implements an adaptive batching algorithm.
Since there is no socket layer, the application must directly interface with the network stack. With TCP as the transport layer, it acquires a TCB (TCP Control Block), binds it to a specific IPv4 address and port, and sets it to LISTEN state using API functions. The application must also register callback functions to accept connections,receive and process data from active connections, as well as act on successful delivery of sent data (e.g., to close the connection or send more data).
libnmio是最低級別的組件:它處理與netmap的所有交互并抽象主事件循環(huán)。 高層179(例如,libeth)注冊回調(diào)函數(shù)以接收原始輸入數(shù)據(jù)以及設置用于周期性事件的定時器(例如,TCP慢定時器)。函數(shù)netmap_ouput()是主傳輸例程:它將分組排入傳輸 通過存儲器或零拷貝來放入環(huán)形隊列,并且還實現(xiàn)自適應批處理算法。
由于沒有套接字層,應用程序必須直接與網(wǎng)絡堆棧接口。 使用TCP作為傳輸層,它獲取TCB(TCP控制塊),將其綁定到特定的IPv4地址和端口,并使用API函數(shù)將其設置為LISTEN狀態(tài)。 應用程序還必須注冊回調(diào)函數(shù)以接受連接,從活動連接接收和處理數(shù)據(jù)、發(fā)送數(shù)據(jù)的成功傳遞(例如,以關(guān)閉連接或發(fā)送更多數(shù)據(jù))。
3. EVALUATION
To explore Sandstorm and Namestorm’s performance and behavior, we evaluated using both older and more recent hardware.On older hardware, we employed Linux 3.6.7 and FreeBSD 9-STABLE. On newer hardware, we used Linux 3.12.5 and FreeBSD 10-STABLE. We ran Sandstorm and Namestorm on FreeBSD.
評估
為了探索Sandstorm和Namestorm的性能和行為,我們使用舊的和更新的硬件進行評估。在舊的硬件上,我們使用Linux 3.6.7和FreeBSD 9-STABLE。 在較新的硬件上,我們使用Linux 3.12.5和FreeBSD 10-STABLE。 我們在FreeBSD上運行Sandstorm和Namestorm。
For the old hardware, we use three systems: two clients and one server, connected via a 10GbE crossbar switch. All test systems are equipped with an Intel 82598EB dual port 10GbE NIC, 8GB RAM,and two quad-core 2.66 GHz Intel Xeon X5355 CPUs. In 2006,these were high-end servers. For the new hardware, we use seven systems; six clients and one server, all directly connected via dedicated 10GbE links. The server has three dual-port Intel 82599EB 10GbE NICs, 128GB RAM and a quad-core Intel Xeon E5-2643 CPU. In 2014, these are well-equipped contemporary servers.
對于舊硬件,我們使用三個系統(tǒng):兩個客戶端和一個服務器,通過10GbE交換機連接。 所有測試系統(tǒng)都配備了一個Intel 82598EB雙端口10GbE NIC,8GB RAM和兩個四核2.66 GHz Intel Xeon X5355 CPU。 2006年,這些都是高端服務器。 對于新硬件,我們使用七個系統(tǒng); 六個客戶端和一個服務器,都通過專用的10GbE鏈路直接連接。 該服務器有三個雙端口Intel 82599EB 10GbE NIC,128GB RAM和四核Intel Xeon E5-2643 CPU。 在2014年,這些是設備齊全的現(xiàn)代服務器。
The most interesting improvements between these hardware generations are in the memory subsystem. The older Xeons have a conventional architecture with a single 1,333MHz memory bus serving both CPUs. The newer machines, as with all recent Intel server processors,support Data Direct I/O (DDIO), so whether data to be sent is in the cache can have a significant impact on performance.
這些硬件代之間最有趣的改進是在存儲器子系統(tǒng)中。 較老的Xeons有一個傳統(tǒng)的架構(gòu),單個1,333MHz內(nèi)存總線為兩個CPU服務。 較新的機器(如最近的所有英特爾服務器處理器)都支持數(shù)據(jù)直接I / O(DDIO),因此要發(fā)送的數(shù)據(jù)是否在緩存中會對性能產(chǎn)生重大影響。
Our hypothesis is that Sandstorm will be significantly faster than nginx on both platforms; however, the reasons for this may differ. Experience [18] has shown that the older systems often bottleneck on memory latency, and as Sandstorm is not CPU-intensive, we would expect this to be the case. A zero-copy stack should thus be a big win. In addition, as cores contend for memory, we would expect that adding more cores does not help greatly.
我們的假設是,Sandstorm將在兩個平臺上明顯快于nginx; 然而,原因可能不同。 經(jīng)驗[18]表明,較舊的系統(tǒng)通常會對內(nèi)存延遲造成瓶頸,而且由于Sandstorm不是CPU密集型的,我們預期會出現(xiàn)這種情況。 零拷貝堆棧應該是一個大勝利。 此外,隨著核爭奪內(nèi)存,我們預計添加更多核并不會有很大的幫助。
On the other hand, with DDIO, the new systems are less likely to bottleneck on memory. The concern, however, would be that DDIO could thrash at least part of the CPU cache. On these systems, we expect that adding more cores would help performance, but that in doing so, we may experience scalability bottlenecks such as lock contention in conventional stacks. Sandstorm’s lock-free stack can simply be replicated onto multiple 10GbE NICs, with one core per two NICs to scale performance. In addition, as load increases, the number of packets to be sent or received per system call will increase due to application-level batching. Thus, under heavy load, we would hope that the number of system calls per second to still be acceptable despite shifting almost all network-stack processing to userspace.
另一方面,使用DDIO,新系統(tǒng)不太可能在內(nèi)存上造成瓶頸。 然而,關(guān)注的是,DDIO可能至少刷掉部分的CPU緩存。 在這些系統(tǒng)上,我們期望添加更多的核將有助于提高性能,但在這樣做時,我們可能會遇到可伸縮性瓶頸,例如傳統(tǒng)堆棧中的鎖爭用。 Sandstorm的無鎖堆棧可以簡單地用到多個10GbE NIC上,每兩個NIC一個核心可以擴展性能。 此外,隨著負載的增加,每個系統(tǒng)調(diào)用發(fā)送或接收的數(shù)據(jù)包數(shù)量將由于應用程序級別的批處理而增加。 因此,在重負載下盡管將幾乎所有的網(wǎng)絡棧處理轉(zhuǎn)移到用戶空間,我們希望每秒的系統(tǒng)調(diào)用的數(shù)量仍然是可以接受的
The question, of course, is how well do these design choices play out in practice?
當然,問題是這些設計選擇在實踐中表現(xiàn)得如何?
3.1 Experiment Design: Sandstorm
We evaluated the performance of Sandstorm through a set of experiments and compare our results against the nginx web server running on both FreeBSD and Linux. Nginx is a high-performance,low-footprint web server that follows the non-blocking, event-driven model: it relies on OS primitives such as kqueue() for readiness event notifications, it uses sendfile() to send HTTP payload directly from the kernel, and it asynchronously processes requests.
我們通過一組實驗評估Sandstorm的性能,并將結(jié)果與在FreeBSD和Linux上運行的nginx Web服務器進行比較。 Nginx是一個高性能,低占用率的web服務器,它遵循非阻塞,事件驅(qū)動模型:它依賴于諸如kqueue()等用于準備事件通知的操作系統(tǒng)原語,它使用sendfile()直接從內(nèi)核發(fā)送HTTP有效負載,并且異步處理 請求。
Contemporary web pages are immensely content-rich, but they mainly consist of smaller web objects such as images and scripts. The distribution of requested object sizes for Yahoo! CDN, reveals that 90% of the content is smaller than 25KB [11]. The conventional network stack and web-server application perform well when delivering large files by utilizing OS primitives and NIC hardware features. Conversely, multiple simultaneous short-lived HTTP connections are considered a heavy workload that stresses the kerneluserspace interface and reveals performance bottlenecks: even with sendfile() to send the payload, the size of the transmitted data is not quite enough to compensate for the system cost.
當代網(wǎng)頁內(nèi)容豐富,但它們主要包括較小的網(wǎng)絡對象,如圖像和腳本。 對于Yahoo! CDN的請求的對象大小的分布,揭示了90%的內(nèi)容小于25KB [11]。 當通過利用OS原語和NIC硬件特征來傳送大文件時,傳統(tǒng)的網(wǎng)絡棧和web服務器應用執(zhí)行得很好。 相反,多個同時短期HTTP連接被認為是一個沉重的工作負載,強調(diào)用戶空間內(nèi)核接口并揭示性能瓶頸:即使使用sendfile()發(fā)送有效負載,傳輸數(shù)據(jù)的大小也不足以補償系統(tǒng)成本 。
For all the benchmarks, we configured nginx to serve content from a RAM disk to eliminate disk-related I/O bottlenecks. Similarly,Sandstorm preloads the data to be sent and performs its pre-segmentation phase before the experiments begin. We use weighttp [9] to generate load with multiple concurrent clients. Each client generates a series of HTTP requests, with a new connection being initiated immediately after the previous one terminates. For each experiment we measure throughput, and we vary the size of the file served, exploring possible tradeoffs between throughput and system load. Finally, we run experiments with a realistic workload by using a trace of files with sizes that follow the distribution of requested HTTP objects of the Yahoo! CDN.
對于所有的基準測試,我們配置了nginx來從RAM磁盤提供內(nèi)容,以消除磁盤相關(guān)的I / O瓶頸。 類似地,Sandstorm預加載要發(fā)送的數(shù)據(jù),并在實驗開始之前執(zhí)行其預分割階段。 我們使用weighttp [9]來生成多個并發(fā)客戶端的負載。 每個客戶端生成一系列HTTP請求,在前一個終止后立即啟動新的連接。 對于每個實驗,我們測量吞吐量,并且我們改變所服務的文件的大小,探索吞吐量和系統(tǒng)負載之間可能的折衷。 最后,我們使用跟蹤文件的實際工作量進行實驗,這些文件的大小遵循Yahoo! CDN所請求的HTTP對象的分布。
3.2 Sandstorm Results
First, we wish to explore how file size affects performance Sandstorm is designed with small files in mind, and batching to reduce overheads, whereas the conventional sendfile() ought to be better for larger files.
首先,我們希望了解文件大小如何影響性能Sandstorm的設計考慮了小文件,并批量化以減少開銷,而傳統(tǒng)的sendfile()應該對更大的文件更好。
Figure 4 shows performance as a function of content size, comparing pre-copy Sandstorm and nginx running on both FreeBSD and Linux. With a single 10GbE NIC (Fig. 4a and 4d), Sandstorm outperforms nginx for smaller files by ~23–240%. For larger files, all three configurations saturate the link. Both conventional stacks are more CPU-hungry for the whole range of file sizes tested, despite potential advantages such as TSO on bulk transfers.
圖4顯示了對不同內(nèi)容大小的函數(shù)的性能,比較了在FreeBSD和Linux上運行的預拷貝Sandstorm和nginx。 使用單個10GbE NIC(圖4a和4d),Sandstorm的性能比較小的文件的nginx大約高23-240%。 對于較大的文件,所有三個配置飽和鏈接。 盡管存在諸如批量傳輸?shù)腡SO的潛在優(yōu)勢,但對于所有測試的文件大小,這兩種常規(guī)堆棧都更加需要CPU。
To scale to higher bandwidths, we added more 10GbE NICs and client machines. Figure 4b shows aggregate throughput with four 10GbE NICs. Sandstorm saturates all four NICs using just two CPU cores, but neither Linux nor FreeBSD can saturate the NICs with files smaller than 128KB, even though they use four CPU cores.
為了擴展到更高的帶寬,我們增加了更多的10GbE網(wǎng)卡和客戶端機器。 圖4b顯示了四個10GbE NIC的聚合吞吐量。 Sandstorm使用只有兩個CPU核心四個網(wǎng)卡,但即使Linux和FreeBSD使用四個CPU核心都不能使文件小于128KB的網(wǎng)絡飽和,。
As we add yet more NICs, shown in Figure 4c, the difference in performance gets larger for a wider range of file sizes. With 610GbE NICs Sandstorm gives between 10% and 10 more throughput than FreeBSD for file sizes in the range of 4–256KB.Linux fares worse, experiencing a performance drop (see Figure 4c)compared to FreeBSD with smaller file sizes and 5–6 NICs. Low CPU utilization is normally good, but here (Figures 4f, 5b), idle time is undesirable since the NICs are not yet saturated.We have not identified any single obvious cause for this degradation. Packet traces show the delay to occur between the connection being accepted and the response being sent. There is no single kernel lock being held for especially long, and although locking is not negligible, it does not dominate, either. The system suffers one soft page fault for every two connections on average, but no hard faults, so data is already in the disk buffer cache, and TCB recycling is enabled. This is an example of how hard it can be to find performance problems with conventional stacks. Interestingly, this was an application-specific behavior triggered only on Linux: in benchmarks we conducted with other web servers (e.g., lighttpd [3], OpenLiteSpeed [7]) we did not experience a similar performance collapse on Linux with more than four NICs.We have chosen, however, to present the nginx datasets as it offered the greatest overall scalability in both operating systems.
當我們添加更多的NIC時,如圖4c所示,對于更大范圍的文件大小,性能的差異變大。使文件大小為4-256KB 用6x10GbE NIC Sandstorm比FreeBSD更高>=10%的吞吐量,與具有較小文件大小和5-6個NIC的FreeBSD相比,Linux的性能下降(見圖4c)。低的CPU利用率通常是好的,但是在這里(圖4f,5b),空閑是不期望的,因為NIC尚未飽和。我們沒有識別出任何單個明顯的原因。數(shù)據(jù)包跟蹤顯示在被接受的連接和正在發(fā)送的響應之間發(fā)生的延遲。沒有一個單一的內(nèi)核鎖持有特別長的時間,雖然鎖定是不可忽略的,它也不占主導地位。系統(tǒng)平均每兩個連接會出現(xiàn)一個軟頁故障,但沒有硬故障,因此數(shù)據(jù)已經(jīng)在磁盤緩沖區(qū)緩存中,并且啟用了TCB回收。這是一個例子,說明用常規(guī)堆棧找到性能問題有多困難。 有趣的是,這是一個應用程序特定的行為僅在Linux上觸發(fā):在我們與其他Web服務器(例如lighttpd [3],OpenLiteSpeed [7])進行的基準測試中,我們沒有在具有四個以上NIC的Linux上經(jīng)歷類似的性能崩潰 然而,我們選擇呈現(xiàn)nginx上的數(shù)據(jù)集,因為它在兩個操作系統(tǒng)中提供最大的整體可伸縮性。
It is clear that Sandstorm dramatically improves network performance when it serves small web objects, but somewhat surprisingly,it performs better for larger files too. For completeness, we evaluate Sandstorm using a realistic workload: following the distribution of requested HTTP object sizes of the Yahoo! CDN [11], we generated a trace of 1000 files ranging from a few KB up to ~20MB which were served from both Sandstorm and nginx. On the clients,we modified weighttp to benchmark the server by concurrently requesting files in a random order. Figures 5a and 5b highlight the achieved network throughput and the CPU utilization of the server as a function of the number of the network adapters. The network performance improvement is more than 2 while CPU utilization is reduced.
很明顯,Sandstorm在服務小型Web對象時可以顯著提高網(wǎng)絡性能,但有些令人驚訝的是,它對大型文件的性能也更好。 為了完整性,我們使用現(xiàn)實的工作負載評估Sandstorm:給Sandstorm 和nginx分發(fā)請求的HTTP對象大小的Yahoo! CDN [11]之后,我們生成了1000個文件的蹤跡,范圍從幾KB到?20MB。 在客戶端上,我們修改weighttp以通過以隨機順序并發(fā)請求文件來對服務器進行基準測試。 圖5a和5b突出了網(wǎng)絡適配器的數(shù)量變化所實現(xiàn)的網(wǎng)絡吞吐量和服務器的CPU利用率。 網(wǎng)絡性能提升超過2倍 同時降低CPU利用率。
Finally, we evaluated whether Sandstorm handles high packet loss correctly. With 80 simultaneous clients and 1% packet loss, as bexpected, throughput plummets. FreeBSD achieves approximately 640Mb/s and Sandstorm roughly 25% less. This is not fundamental,but due to FreeBSD’s more fine-grained retransmit timer and its use of NewReno congestion control rather than Reno, which could also be implemented in Sandstorm.Neither network nor server is stressed in this experiment – if there had been a real congested link causing the loss, both stacks would have filled it.
最后,我們評估Sandstorm是否正確處理高數(shù)據(jù)包丟失。 隨著80個同時客戶端和1%的包丟失率,如預期,吞吐量直線下降。 FreeBSD實現(xiàn)約640Mb / s比Sandstorm減少約25%。 這不是根本的,但是由于FreeBSD的更細粒度的重傳計時器和它的使用NewReno擁塞控制而不是Reno,這也可以在Sandstorm中實現(xiàn)。在這個實驗中強調(diào)的網(wǎng)絡和服務 - 如果有一個真正的 擁塞的鏈路造成損失,兩個堆棧都會處理。
Throughout, we have invested considerable effort in profiling and optimizing conventional network stacks, both to understand their design choices and bottlenecks, and to provide the fairest possible comparison. We applied conventional performance tuning to Linux and FreeBSD, such as increasing hash-table sizes, manually tuning CPU work placement for multiqueue NICs, and adjusting NIC parameters such as interrupt mitigation. In collaboration with Netflix,we also developed a number of TCP and virtual-memory subsystem performance optimizations for FreeBSD, reducing lock contention under high packet loads. One important optimization is related to sendfile(), in which contention within the VM subsystem occurred while TCP-layer socket-buffer locks were held, triggering a cascade to the system as a whole. These changes have been upstreamed to FreeBSD for inclusion in a future release.
在整個過程中,我們投入了相當大的努力來分析和優(yōu)化常規(guī)網(wǎng)絡堆棧,既了解他們的設計選擇和瓶頸,并提供最公平的可能的比較。 我們將常規(guī)性能調(diào)優(yōu)應用于Linux和FreeBSD,例如增加哈希表大小,手動調(diào)整多隊列NIC的CPU工作位置,以及調(diào)整NIC參數(shù)(如中斷緩沖)。 與Netflix合作,我們還為FreeBSD開發(fā)了許多TCP和虛擬內(nèi)存子系統(tǒng)性能優(yōu)化,從而減少了高數(shù)據(jù)包負載下的鎖爭用。 一個重要的優(yōu)化與sendfile()相關(guān),沖突發(fā)生在TCP層套接字緩沖區(qū)鎖定時發(fā)生VM子系統(tǒng)內(nèi)的爭用,觸發(fā)了作為整體的系統(tǒng)級聯(lián)。 這些更改已上傳到FreeBSD以包含在將來的版本中。
To copy or not to copy
The pre-copy variant of Sandstorm maintains more than one copy of each segment in memory so that it can send the same segment to multiple clients simultaneously. This requires more memory than nginx serving files from RAM. The memcpy variant only enqueues copies, requiring a single long-lived version of each packet, and uses a similar amount of memory to nginx. How does this memcpy affect performance? Figure 6 explores network throughput, CPU utilization, and system-call rate for two- and six-NIC configurations.
Sandstorm的預拷貝版在內(nèi)存中保存了每個segment的多個副本,以便它可以同時將同一個segment發(fā)送到多個客戶端。 這需要比從RAM中提供文件的nginx更多的內(nèi)存。 memcpy版僅排列副本,需要每個數(shù)據(jù)包的單個長期版本,并使用與nginx類似的內(nèi)存量。 這個memcpy如何影響性能? 圖6探討了兩個和六個NIC配置的網(wǎng)絡吞吐量,CPU利用率和系統(tǒng)調(diào)用率。
With six NICs, the additional memcpy() marginally reduces performance (Figure 6b) while exhibiting slightly higher CPU load (Figure 6d). In this experiment, Sandstorm only uses three cores to simplify the comparison, so around 75% utilization saturates those cores. The memcpy variant saturates the CPU for files smaller than 32KB, whereas the pre-copy variant does not. Nginx, using sendfile() and all four cores, only catches up for file sizes of 512KB and above, and even then exhibits higher CPU load.
使用六個NIC時,額外的memcpy()會略微降低性能(圖6b),同時顯示稍微更高的CPU負載(圖6d)。 在這個實驗中,Sandstorm只使用三個核來簡化比較,因此大約75%的利用率使這些核飽和。 對于小于32KB的文件,memcpy版使CPU飽和,而預拷貝版不會。 Nginx使用sendfile()和所有四個核,只能達到512KB及以上的文件大小,甚至表現(xiàn)出更高的CPU負載。
As file size decreases, the expense of SYN/FIN and HTTPrequest processing becomes measurable for both variants, but the pre-copy version has more headroom so is affected less. It is interesting to observe the effects of batching under overload with the memcpy stack in Figure 6f.With large file sizes, pre-copy and memcpy make the same number of system calls per second. With small files, however, the memcpy stack makes substantially fewer system calls per second. This illustrates the efficacy of batching: memcpy has saturated the CPU, and consequently no longer polls the RX queue as often. As the batch size increases, the system-call cost decreases, helping the server weather the storm. The pre-copy variant is not stressed here and continues to poll frequently, but would behave the same way under overload. In the end, the cost of the additional memcpy is measurable, but still performs quite well.
隨著文件大小減小,SYN / FIN和HTTP請求處理的費用對于這兩種變體都變得可預期,但是預拷貝版本具有更大的空間,因此受影響較小。 有趣的是觀察在過載下使用memcpy堆棧的batching的影響。對于大文件大小,pre-copy和memcpy每秒都會產(chǎn)生相同數(shù)量的系統(tǒng)調(diào)用。 然而,對于小文件,memcpy堆棧每秒大大減少了系統(tǒng)調(diào)用。 這說明了批處理的效率:memcpy已飽和CPU,因此不再頻繁輪詢RX隊列。 隨著批量增加,降低系統(tǒng)調(diào)用幫助服務器承受風暴。 預拷貝變體在這里繼續(xù)頻繁地輪詢,但是在過載下表現(xiàn)相同的方式。 最后,額外的memcpy的成本是可衡量的,但仍然表現(xiàn)相當不錯。
Results on contemporary hardware are significantly different from those run on older pre-DDIO hardware. Figure 7 shows the results obtained on our 2006-era servers. On the older machines,Sandstorm outperforms nginx by a factor of three, but the memcpy variant suffers a 30% decrease in throughput compared to pre-copy Sandstorm as a result of adding a single memcpy to the code. It is clear that on these older systems,memory bandwidth is the main performance bottleneck.
當代硬件上的結(jié)果與舊的前DDIO硬件上的結(jié)果有顯著的不同。 圖7顯示了我們2006年服務器上的結(jié)果。 在舊機器上,Sandstorm的性能超過nginx的三分之一,但是與memcpy版相比,memcpy變體的吞吐量降低了30%,這是因為在代碼中添加了一個memcpy。 很明顯,在這些舊系統(tǒng)上,內(nèi)存帶寬是主要的性能瓶頸。
With DDIO, memory bandwidth is not such a limiting factor. Figure 9 in Section 3.5 shows the corresponding memory read throughput,as measured using CPU performance counters, for the networkthroughput graphs in Figure 6b. With small file sizes, the pre-copy variant of Sandstorm appears to do more work: the L3 cache cannot hold all of the data, so there are many more L3 misses than with memcpy. Memory-read throughput for both pre-copy and nginx are closely correlated with their network throughput, indicating that DDIO is not helping on transmit: DMA comes from memory rather than the cache. The memcpy variant, however, has higher network throughput than memory throughput, indicating that DDIO is transmitting from the cache. Unfortunately, this is offset by much higher memory write throughput. Still, this only causes a small reduction in service throughput. Larger files no longer fit in the L3 cache, even with memcpy. Memory-read throughput starts to rise with files above 64KB. Despite this, performance remains high and CPU load decreases, indicating these systems are not limited by memory bandwidth for this workload.
使用DDIO,內(nèi)存帶寬不是這樣的限制因素。第3.5節(jié)中的圖9顯示了使用CPU性能計數(shù)器測量的圖6b中網(wǎng)絡吞吐量圖的相應內(nèi)存讀取吞吐量。對于小文件大小,Sandstorm的預拷貝版似乎做了更多的工作:L3緩存不能保存所有數(shù)據(jù),因此比memcpy 版的有更多的L3缺失。預拷貝和nginx的內(nèi)存讀取吞吐量與它們的網(wǎng)絡吞吐量密切相關(guān),表明DDIO對傳輸沒幫助:DMA來自內(nèi)存,而不是緩存。 Memcpy版本網(wǎng)絡吞吐量反而比內(nèi)存吞吐量更高,表明DDIO正在從緩存?zhèn)鬏敗2恍业氖牵@被更高的內(nèi)存寫入吞吐量所抵消。但是,這只會導致服務吞吐量的小幅下降。較大的文件不再適合L3緩存,即使使用memcpy。隨著高于64KB的文件,內(nèi)存讀取吞吐量開始上升。盡管如此,性能仍然很高,CPU負載降低,表明這些系統(tǒng)不受此工作負載的內(nèi)存帶寬的限制。
3.3 Experiment Design: Namestorm
We use the same clients and server systems to evaluate Namestorm as we used for Sandstorm. Namestorm is expected to be significantly more CPU-intensive than Sandstorm, mostly due to fundamental DNS protocol properties: high packet rate and small packets. Based on this observation, we have changed the network topology of our experiment: we use only one NIC on the server connected to the client systems via a 10GbE cut-through switch. In order to balance the load on the server to all available CPU cores we use four dedicated NIC queues and four Namestorm instances.
We ran Nominum’s dnsperf [2] DNS profiling software on the clients. We created zone files of varying sizes, loaded them onto the DNS servers, and configured dnsperf to query the zone repeatedly.
我們使用相同的客戶端和服務器系統(tǒng)來評估Namestorm像用于Sandstorm那樣。 Namestorm預期比Sandstorm明顯更多的CPU密集,主要是由于基本的DNS協(xié)議屬性:高數(shù)據(jù)包速率和小數(shù)據(jù)包。 基于這個觀察,我們改變了我們實驗的網(wǎng)絡拓撲:我們通過一個10GbE直通交換機在服務器上只使用一個網(wǎng)卡連接到客戶端系統(tǒng)。 為了平衡服務器上的負載到所有可用的CPU核心,我們使用四個專用的NIC隊列和四個Namestorm實例。
我們在客戶端運行Nominum的dnsperf [2] DNS配置軟件。 我們創(chuàng)建了不同大小的區(qū)域文件,將它們加載到DNS服務器上,并配置dnsperf重復查詢區(qū)域。
3.4 Namestorm Results
Figure 8a shows the performance of Namestorm and NSD running on Linux and FreeBSD when using a single 10GbE NIC. Performance results of NSD are similar with both FreeBSD and Linux.Neither operating system can saturate the 10GbE NIC, however, and both show some performance drop as the zone file grows. On Linux,NSD’s performance drops by ~14% (from ~689,000 to ~590,000 Queries/sec) as the zone file grows from 1 to 10,000 entries, and on FreeBSD, it drops by ~20% (from ~720,000 to ~574,000 Qps). For these benchmarks, NSD saturates all CPU cores on both systems.
圖8a顯示了使用單個10GbE NIC時在Linux和FreeBSD上運行的Namestorm和NSD的性能。 NSD的性能結(jié)果與FreeBSD和Linux類似。但是,操作系統(tǒng)可以使10GbE NIC飽和,并且隨著區(qū)域文件增長,兩者都顯示出一些性能下降。 在Linux上,隨著區(qū)域文件從1到10,000個條目的增長,NSD的性能下降了約14%(從約689,000到約5.9萬查詢/秒),在FreeBSD上,它下降了?20%(從?720000到574,000 Qps )。 對于這些基準測試NSD使兩個系統(tǒng)上的所有CPU內(nèi)核飽和。
For Namestorm, we utilized two datasets, one where the hash keys are in wire-format (w/o compr.), and one where they are in FQDN format (compr.). The latter requires copying the search term before hashing it to handle possible compressed requests.
對于Namestorm,我們使用了兩個數(shù)據(jù)集,一個是散列鍵是wire格式(w / o compr。),一個是FQDN格式(compr。)。 后者處理可能的壓縮請求需要在對其進行哈希處理之前復制搜索項
With wire-format hashing, Namestorm memcpy performance is ~11–13 better, depending on the zone size, when compared to the best results from NSD with either Linux or FreeBSD. Namestorm’s throughput drops by ~30% as the zone file grows from 1 to 10,000 entries (from ~9,310,000 to ~6,410,000 Qps). The reason for this decrease is mainly the LLC miss rate, which more than doubles.Dnsperf does not report throughput in Gbps, but given the typical DNS response size for our zones we can calculate ~8.4Gbps and ~5.9Gbps for the smallest and largest zone respectively.
使用wire格式哈希,Namestorm memcpy性能是?11-13 更好,取決于區(qū)域大小與使用Linux或FreeBSD從NSD的最佳結(jié)果相比。 當區(qū)域文件從1到10,000個條目(從?9,310,000到?6,410,000 Qps)增長時,Namestorm的吞吐量下降?30%。 這種減少的原因主要是LLC未命中率,其超過雙倍。Dnsperf不報告以Gbps為單位的吞吐量,但給定我們區(qū)域的典型DNS響應大小,我們可以計算?8.4Gbps和?5.9Gbps最小和最大 區(qū)域。
With FQDN-format hashing, Namestorm memcpy performance is worse than with wire-format hashing, but is still ~9–13 better compared to NSD. The extra processing with FQDN-format hashing costs ~10–20% in throughput, depending on the zone size.
Finally, in Figure 8a we observe a noticeable performance overhead with the pre-copy stack, which we explore in Section 3.5.
有了FQDN格式的哈希,Namestorm memcpy的性能比線格式哈希差,但仍是?9-13 比NSD好。 使用FQDN格式哈希的額外處理的成本約為吞吐量的10-20%,具體取決于區(qū)域大小。
最后,在圖8a中,我們觀察到預拷貝堆棧的性能開銷明顯,我們在3.5節(jié)中探討。
3.4.1 Effectiveness of batching
One of the biggest performance benefits for Namestorm is that netmap provides an API that facilitates batching across the systemcall interface. To explore the effects of batching, we configured a single Namestorm instance and one hardware queue, and reran our benchmark with varying batch sizes. Figure 8b illustrates the results:
a more than 2 performance gain when growing the batch size from 1 packet (no batching) to 32 packets. Interestingly, the performance of a single-core Namestorm without any batching remains more than 2 better than NSD.
Batching的效果
Namestorm最大的性能優(yōu)勢之一是netmap提供了一個API,可以在整個系統(tǒng)調(diào)用接口上實現(xiàn)批處理。 為了探究批處理的效果,我們配置了一個單獨的Namestorm實例和一個硬件隊列,并用不同的批量大小重新調(diào)整我們的基準。 圖8b示出了結(jié)果:
超過2倍 的性能增長在將批處理大小從1個包(無批處理)增長到32個包。 有趣的是,沒有任何batching的單核Namestorm的性能仍然超過2倍 優(yōu)于NSD。
At a minimum, NSD has to make one system call to receive each request and one to send a response. Recently Linux added the new recvmmsg() and sendmmsg() system calls to receive and send multiple UDP messages with a single call. These may go some way to improving NSD’s performance compared to Namestorm.They are, however, UDP-specific, and sendmmsg() requires the application to manage its own transmit-queue batching. When we implemented Namestorm, we already had libnmio, which abstracts and handles all the batching interactions with netmap, so there is no application-specific batching code in Namestorm.
至少,NSD必須進行一次系統(tǒng)調(diào)用以接收每個請求并且發(fā)送一次響應。 最近Linux添加了新的recvmmsg()和sendmmsg()系統(tǒng)調(diào)用,通過一次調(diào)用接收和發(fā)送多個UDP消息。 與Namestorm相比,這些可能在某種程度上提高NSD的性能。然而,它們是UDP特定的,sendmmsg()要求應用程序管理自己的傳輸隊列批處理。 當我們實現(xiàn)Namestorm時,我們已經(jīng)有了libnmio,它使用netmap抽象和處理所有的批處理交互,因此在Namestorm中沒有應用程序特定的批處理代碼。
3.5 DDIO
With DDIO, incoming packets are DMAed directly to the CPU’s L3 cache, and outgoing packets are DMAed directly from the L3 cache, avoiding round trips from the CPU to the memory subsystem. For lightly loaded servers in which the working set is smaller than the L3 cache, or in which data is accessed with temporal locality by the processor and DMA engine (e.g., touched and immediately sent, or received and immediately accessed), DDIO can dramatically reduce latency by avoiding memory traffic. Thus DDIO is ideal for RPC-like mechanisms in which processing latency is low and data will be used immediately before or after DMA. On heavily loaded systems, it is far from clear whether DDIO will be a win or not. For applications with a larger cache footprint, or in which communication occurs at some delay from CPU generation or use of packet data, DDIO could unnecessarily pollute the cache and trigger additional memory traffic, damaging performance.
使用DDIO,傳入數(shù)據(jù)包直接DMA直接到CPU的L3緩存,而輸出數(shù)據(jù)包直接從L3緩存直接DMA,避免從CPU到內(nèi)存子系統(tǒng)的往返。 對于其中工作集小于L3高速緩存或其中數(shù)據(jù)由處理器和DMA引擎以時間局部性訪問(例如,被觸摸和立即發(fā)送,或接收和立即訪問)的輕負載服務器,DDIO可以顯著減少 通過避免內(nèi)存流量延遲。 因此,DDIO是理想的類RPC機制,其中處理延遲低,數(shù)據(jù)將立即在DMA之前或之后使用。 在負載重的系統(tǒng)上,很難說清楚DDIO是否會贏。 對于具有較大高速緩存占用的應用程序,或者在CPU生成或使用數(shù)據(jù)包數(shù)據(jù)的某些延遲時發(fā)生通信時,DDIO可能會不必要地污染高速緩存并觸發(fā)額外的內(nèi)存流量,從而損壞性能。
Intuitively, one might reasonably assume that Sandstorm’s precopy mode might interact best with DDIO: as with sendfile() based designs, only packet headers enter the L1/L2 caches, with payload content rarely touched by the CPU. Figure 9 illustrates a therefore surprising effect when operating on small file sizes: overall memory throughput from the CPU package, as measured using performance counters situated on the DRAM-facing interface of the LLC, sees significantly less traffic for the memcpy implementation relative to the pre-copy one, which shows a constant rate roughly equal to network throughput.
直觀地,人們可以合理地認為Sandstorm的預拷貝模式可能與DDIO最好地交互:與基于sendfile()的設計一樣,只有包頭進入L1 / L2緩存,有效載荷內(nèi)容很少被CPU觸發(fā)。 圖9示出了當以小文件大小操作時的令人驚訝的效果:如使用位于LLC的面向DRAM的接口上的性能計數(shù)器所測量的,相對于預拷貝處理,memcpy版的內(nèi)存吞吐量顯著減少,其顯示大致等于網(wǎng)絡吞吐量的恒定速率
We believe this occurs because DDIO is, by policy, limited from occupying most of the LLC: in the pre-copy cases, DDIO is responsible for pulling untouched data into the cache – as the file data cannot fit in this subset of the cache, DMA access thrashes the cache and all network transmit is done from DRAM. In the memcpy case, the CPU loads data into the cache, allowing more complete utilization of the cache for network data. However, as the DRAM memory interface is not a bottleneck in the system as configured, the net result of the additional memcpy, despite better cache utilization,is reduced performance. As file sizes increase, the overall footprint of memory copying rapidly exceeds the LLC size, exceeding network throughput, at which point pre-copy becomes more efficient.Likewise, one might mistakenly believe simply from inspection of CPU memory counters that nginx is somehow benefiting from this same effect: in fact, nginx is experiencing CPU saturation, and it is not until file size reaches 512K that sufficient CPU is available to converge with pre-copy’s saturation of the network link.
我們相信這是因為DDIO是,通過策略,限制占據(jù)大部分的LLC:在預拷貝的情況下,DDIO負責將未觸摸的數(shù)據(jù)拉入高速緩存 - 因為文件數(shù)據(jù)不能容納在高速緩存的這個子集中,DMA訪問使高速緩存thrashes所以所有網(wǎng)絡傳輸都由DRAM完成反而在memcpy情況下,CPU將數(shù)據(jù)加載到緩存中,從而允許更加完整地利用網(wǎng)絡數(shù)據(jù)的緩存,然而,由于DRAM存儲器接口不是如配置的系統(tǒng)中的瓶頸,盡管有更好的高速緩存利用率,附加memcpy的結(jié)果是降低的性能。 隨著文件大小增加,存儲器復制的總體占用面積快速超過LLC大小,超過網(wǎng)絡吞吐量,此時預復制變得更有效。同樣,人們可能錯誤地認為,從CPU內(nèi)存計數(shù)器的檢查nginx是以某種方式受益于這個相同的效果:事實上,nginx正在CPU飽和,并且直到文件大小達到512K,足夠的CPU可用的把預拷貝的網(wǎng)絡鏈路跑滿了。
By contrast, Namestorm sees improved performance using the memcpy implementation, as the cache lines holding packet data must be dirtied due to protocol requirements, in which case performing the memcpy has little CPU overhead yet allows much more efficient use of the cache by DDIO
相比之下,Namestorm使用memcpy實現(xiàn)看到改進的性能,因為持有分組數(shù)據(jù)的緩存線必須由于協(xié)議要求而受到污染,在這種情況下執(zhí)行memcpy具有很少的CPU開銷,但允許更高效地通過DDIO使用緩存
(ps: 這個例子很神奇,雖然memcpy版每次都要內(nèi)存拷貝但是更靈活,數(shù)據(jù)已經(jīng)在cache中然后網(wǎng)絡吞吐量大致等于內(nèi)存年吞吐量,而一般認為使用預處理好的數(shù)據(jù)更快因為不需要拷貝但是由于其不在cache中所以每次都要從dram走,文件越大這個情況越明顯)
4. DISCUSSION
We developed Sandstorm and Namestorm to explore the hypothesis that fundamental architectural change might be required to properly exploit rapidly growing CPU core counts and NIC capacity.Comparisons with Linux and FreeBSD appear to confirm this conclusion far more dramatically than we expected: while there are small-factor differences between Linux and FreeBSD performance curves, we observe that their shapes are fundamentally the same.We believe that this reflects near-identical underlying architectural decisions stemming from common intellectual ancestry (the BSD network stack and sockets API) and largely incremental changes from that original design.
我們開發(fā)了Sandstorm和Namestorm來探索這樣的假設,即可能需要進行基礎架構(gòu)更改以正確利用快速增長的CPU核心數(shù)和NIC容量。與Linux和FreeBSD的比較似乎證實了這個結(jié)論比我們預期的更加顯著: Linux和FreeBSD性能曲線之間的因素差異,我們觀察到它們的形狀基本上是相同的。我們認為,這反映了源于共同知識祖先(BSD網(wǎng)絡棧和套接字API)和與原始設計有很大增量變化的幾乎相同的底層架構(gòu)決策。
Sandstorm and Namestorm adopt fundamentally different architectural approaches, emphasizing transparent memory flow within applications (and not across expensive protection-domain boundaries), process-to-completion, heavy amortization, batching, and application-specific customizations that seem antithetical to generalpurpose stack design. The results are dramatic, accomplishing nearlinear speedup with increases in core and NIC capacity – completely different curves possible only with a completely different design.
Sandstorm 和Namestorm采用根本不同的架構(gòu)方法,強調(diào)應用程序(而不是跨越昂貴的保護域邊界),過程到完成,重攤銷,批處理和應用程序特定的定制的透明內(nèi)存流,這似乎與通用堆棧設計相對立。 結(jié)果是驚人的,實現(xiàn)近線性加速隨著核心和NIC容量的增加 - 完全不同的曲線可能只有一個完全不同的設計。
4.1 Current network-stack specialization
Over the years there have been many attempts to add specialized features to general-purpose stacks such as FreeBSD and Linux. Examples include sendfile(), primarily for web servers,recvmmsg(), mostly aimed at DNS servers, and assorted socket options for telnet. In some cases, entire applications have been moved to the kernel [13, 24] because it was too difficult to achieve performance through the existing APIs. The problem with these enhancements is that each serves a narrow role, yet still must fit within a general OS architecture, and thus are constrained in what they can do. Special-purpose userspace stacks do not suffer from these constraints,and free the programmer to solve a narrow problem in an application-specific manner while still having the other advantages of a general-purpose OS stack.
多年來,已經(jīng)有很多嘗試為通用堆棧(如FreeBSD和Linux)添加專門的功能。 示例包括sendfile()(主要用于Web服務器),recvmmsg()(主要針對DNS服務器)和用于telnet的套接字選項。 在某些情況下,整個應用程序已經(jīng)移動到內(nèi)核[13,24],因為通過現(xiàn)有的API實現(xiàn)性能太難了。 這些增強的問題在于每個服務器扮演著狹窄的角色,但仍然必須適合于一般的OS體系結(jié)構(gòu),因此被限制在他們可以做什么。 專用用戶空間堆棧不受這些約束的困擾,并且釋放程序員以特定于應用的方式解決窄的問題,同時仍然具有通用OS堆棧的其他優(yōu)點。
4.2 The generality of specialization
Our approach tightly integrates the network stack and application within a single process. This model, together with optimizations aimed at cache locality or pre-packetization, naturally fit a reasonably wide range of performance-critical, event-driven applications such as web servers, key-value stores, RPC-based services and name servers. Even rate-adaptive video streaming may benefit, as developments such as MPEG-DASH and Apple’s HLS have moved intelligence to the client leaving servers as dumb static-content farms.
我們的方法在單個進程中緊密的整合了網(wǎng)絡堆棧和應用邏輯。 這種模型與針對高速緩存局部性或預分組化的優(yōu)化一起,自然地適合于相當廣泛的性能關(guān)鍵的事件驅(qū)動應用,例如web服務器,鍵值存儲,基于RPC的服務和名稱服務器。 即使速率自適應視頻流可以受益,因為諸如MPEG-DASH和蘋果的HLS的發(fā)展已經(jīng)將智能移動到客戶端,將服務器留作靜態(tài)內(nèi)容。
Not all network services are a natural fit. For example, CGI-based web services and general-purpose databases have inherently different properties and are generally CPU- or filesystem-intensive, deemphasizing networking bottlenecks. In our design, the control loop and transport-protocol correctness depend on the timely execution of application-layer functions; blocking in the application cannot be tolerated. A thread-based approach might be more suitable for such cases. Isolating the network stack and application into different threads still yields benefits: OS-bypass networking costs less, and saved CPU cycles are available for the application. However, such an approach requires synchronization, and so increases complexity and offers less room for cross-layer optimization.
并不是所有的網(wǎng)絡服務都是同類的。 例如,基于CGI的Web服務和通用數(shù)據(jù)庫具有本質(zhì)上不同的屬性,并且通常是CPU或文件系統(tǒng)密集型,削弱網(wǎng)絡瓶頸。 在我們的設計中,控制回路和傳輸協(xié)議的正確性取決于應用層功能的及時執(zhí)行; 在應用程序中的阻塞是不能容忍的。 基于線程的方法可能更適合這種情況。 將網(wǎng)絡堆棧和應用程序分離到不同的線程仍然產(chǎn)生以下好處:OS旁路網(wǎng)絡成本更低,并且節(jié)省的CPU周期可用于應用程序。 然而,這種方法需要同步,因此增加了復雜性并且為跨層優(yōu)化提供了較少的空間。
We are neither arguing for the exclusive use of specialized stacks over generalized ones, nor deployment of general-purpose network stacks in userspace. Instead, we propose selectively identifying key scale-out applications where informed but aggressive exploitation of domain-specific knowledge and micro-architectural properties will allow cross-layer optimizations. In such cases, the benefits outweigh the costs of developing and maintaining a specialized stack.
我們既不爭論專門的堆棧對廣義的堆棧的獨占使用,也不是在用戶空間中部署通用網(wǎng)絡堆棧。 相反,我們建議選擇性地識別關(guān)鍵的橫向擴展應用程序并利用領(lǐng)域特定的知識和微架構(gòu)屬性將允許跨層優(yōu)化。 在這種情況下,收益超過開發(fā)和維護專門堆棧的成本。
4.3 Tracing, profiling, and measurement
One of our greatest challenges in this work was the root-cause analysis of performance issues in contemporary hardware-software implementations. The amount of time spent analyzing networkstack behavior (often unsuccessfully) dwarfed the amount of time required to implement Sandstorm and Namestorm.
在這項工作中我們最大的挑戰(zhàn)之一是根本原因分析當代硬件 - 軟件實現(xiàn)中的性能問題。 分析網(wǎng)絡堆棧行為所花費的時間(通常不成功)使實現(xiàn)Sandstorm 和Namestorm所需的時間變得相形見絀。
An enormous variety of tools exist – OS-specific PMC tools, lock contention measurement tools, tcpdump, Intel vTune, DTrace, and a plethora of application-specific tracing features – but they suffer many significant limitations. Perhaps most problematic is that the tools are not holistic: each captures only a fragment of the analysis space – different configuration models, file formats, and feature sets.
存在各種各樣的工具 - 特定于操作系統(tǒng)的PMC工具,鎖定爭用測量工具,tcpdump,Intel vTune,DTrace和大量的應用程序特定跟蹤功能,但是它們受到許多顯著的限制。 也許最有問題的是工具不是整體的:每個只捕獲分析空間的一個片段 - 不同的配置模型,文件格式和特征集
Worse, as we attempted inter-OS analysis (e.g., comparing Linux and FreeBSD lock profiling), we discovered that tools often measure and report results differently, preventing sensible comparison.For example, we found that Linux took packet timestamps at different points than FreeBSD, FreeBSD uses different clocks for DTrace and BPF, and that while FreeBSD exports both per-process and percore PMC stats, Linux supports only the former. Where supported,DTrace attempts to bridge these gaps by unifying configuration,trace formats, and event namespaces [15]. However, DTrace also experiences high overhead causing bespoke tools to persist, and is unintegrated with packet-level tools preventing side-by-side comparison of packet and execution traces.We feel certain that improvement in the state-of-the-art would benefit not only research, but also the practice of network-stack implementation.
更糟糕的是,當我們嘗試跨操作系統(tǒng)分析(例如,比較Linux和FreeBSD鎖定分析)時,我們發(fā)現(xiàn)工具經(jīng)常以不同的方式測量和報告結(jié)果,從而阻止了明智的比較。例如,我們發(fā)現(xiàn)Linux在不同于FreeBSD ,F(xiàn)reeBSD對DTrace和BPF使用不同的時鐘,當FreeBSD導出per-process和percore PMC stats時,Linux僅支持前者。 在支持的情況下,DTrace通過統(tǒng)一配置,跟蹤格式和事件命名空間來嘗試彌合這些差距[15]。 然而,DTrace還經(jīng)歷了高開銷,導致定制工具持續(xù)存在,并且與分組級工具不集成,阻止了分組和執(zhí)行跟蹤的并行比較。我們確信,現(xiàn)有技術(shù)的改進受益的 不僅僅是研究,還有網(wǎng)絡棧實現(xiàn)的實踐。
Our special-purpose stacks are synchronous; after netmap hands off packets to userspace, the control flow is generally linear, and we process packets to completion. This, combined with lock-free design, means that it is very simple to reason about where time goes when handling a request flow. General-purpose stacks cannot, by their nature, be synchronous. They must be asynchronous to balance all the conflicting demands of hardware and applications, managing queues without application knowledge, allocating processing to threads in order to handle those queues, and ensuring safety via locking. To reason about performance in such systems, we often resort to statistical sampling because it is not possible to directly follow the control flow. Of course, not all network applications are well suited to synchronous models; we argue, however, that imposing the asynchrony of a general-purpose stack on all applications can unnecessarily complicate debugging, performance analysis, and performance optimization.
我們的專用堆棧是同步的;在netmap將包交給用戶空間后,控制流通常是線性的,我們處理包完成。這與無鎖設計相結(jié)合,意味著在處理請求流時處理時間是非常簡單的。通用堆棧根據(jù)其性質(zhì)不能是同步的。它們必須異步以平衡硬件和應用程序的所有沖突需求,在沒有應用程序知識的情況下管理隊列,為線程分配處理以處理這些隊列,以及通過鎖定確保安全性。為了說明在這樣的系統(tǒng)中的性能,我們經(jīng)常采用統(tǒng)計采樣,因為不可能直接跟隨控制流。當然,并不是所有的網(wǎng)絡應用都非常適合同步模型;但我們認為,在所有應用程序上施加通用堆棧的不同步可能不必要地使調(diào)試,性能分析和性能優(yōu)化復雜化。
5. RELATED WORK
Web server and network-stack performance optimization is not a new research area. Past studies have come up with many optimization techniques as well as completely different design choices.These designs range from userspace and kernel-based implementations to specialized operating systems.
Web服務器和網(wǎng)絡堆棧性能優(yōu)化不是一個新的研究領(lǐng)域。 過去的研究已經(jīng)提出了許多優(yōu)化技術(shù)以及完全不同的設計選擇。這些設計從基于用戶空間和基于內(nèi)核的實現(xiàn)到專用操作系統(tǒng)。
With the conventional approaches, userspace applications [1, 6] utilize general-purpose network stacks, relying heavily on operatingsystem primitives to achieve data movement and event notification [26]. Several proposals [23, 12, 30] focus on reducing the overhead of such primitives (e.g., KQueue, epoll, sendfile()).IO-Lite [27] unifies the data management between OS subsystems and userspace applications by providing page-based mechanisms to safely and concurrently share data. Fbufs [17] utilize techniques such as page remapping and shared memory to provide high-performance cross-domain transfers and buffer management.Pesterev and Wickizer [28, 14] have proposed efficient techniques to improve commodity-stack performance by controlling connection locality and taking advantage of modern multicore systems.Similarly, MegaPipe [21] shows significant performance gain by introducing a bidirectional, per-core pipe to facilitate data exchange and event notification between kernel and userspace applications.
使用傳統(tǒng)方法,用戶空間應用[1,6]利用通用網(wǎng)絡堆棧,嚴重依賴操作系統(tǒng)原語來實現(xiàn)數(shù)據(jù)移動和事件通知[26]。 幾個提案[23,12,30]集中在減少這樣的原語的開銷(例如,KQueue,epoll,sendfile())。IO-Lite [27]通過提供基于頁面的協(xié)議來統(tǒng)一操作系統(tǒng)子系統(tǒng)和用戶空間應用程序之間的數(shù)據(jù)管理,安全和并發(fā)共享數(shù)據(jù)的機制。 Fbufs [17]利用諸如頁面重映射和共享內(nèi)存等技術(shù)來提供高性能的跨域傳輸和緩沖管理.Pesterev和Wickizer [28,14]提出了通過控制連接局部性和利用現(xiàn)代多核系統(tǒng)來提高商品堆棧性能的高效技術(shù),MegaPipe [21]通過引入雙向,每核心管道以促進內(nèi)核和用戶空間應用程序之間的數(shù)據(jù)交換和事件通知,顯示出顯著的性能增益。
A significant number of research proposals follow a substantially different approach: they propose partial or full implementation of network applications in kernel, aiming to eliminate the cost of communication between kernel and userspace. Although this design decision improves performance significantly, it comes at the cost of limited security and reliability. A representative example of this category is kHTTPd [13], a kernel-based web server which uses the socket interface. Similar to kHTTPd, TUX [24] is another noteworthy example of in-kernel network applications. TUX achieves greater performance by eliminating the socket layer and pinning the static content it serves in memory. We have adopted several of these ideas in our prototype, although our approach is not kernel based.
大量的研究建議遵循一種截然不同的方法:它們提出在內(nèi)核中部分或全部實現(xiàn)網(wǎng)絡應用,旨在消除內(nèi)核和用戶空間之間的通信成本。 雖然這種設計決策顯著提高了性能,但其代價是有限的安全性和可靠性。 這個類別的代表性示例是kHTTPd [13],一種基于內(nèi)核的Web服務器,它使用套接字接口。 與kHTTPd類似,TUX [24]是內(nèi)核網(wǎng)絡應用的另一個值得注意的例子。 TUX通過消除套接字層并鎖定其在存儲器中提供的靜態(tài)內(nèi)容來實現(xiàn)更高的性能。 我們在我們的原型中采用了幾個這樣的想法,雖然我們的方法不是基于內(nèi)核的。
Microkernel designs such as Mach [10] have long appealed to OS designers, pushing core services (such as network stacks) into user processes so that they can be more easily developed, customized, and multiply-instantiated. In this direction, Thekkath et al [32], have prototyped capability-enabled, library-synthesized userspace network stacks implemented on Mach. The Cheetah web server is built on top of an Exokernel [19] library operating system that provides a filesystem and an optimized TCP/IP implementation. Lightweight libOSes enable application developers to exploit domain-specific knowledge and improve performance. Unikernel designs such as MirageOS [25] likewise blend operating-system and application components at compile-time, trimming unneeded software elements to accomplish extremely small memory footprints – although by static code analysis rather than application-specific specialization.
諸如Mach [10]的微內(nèi)核設計長期以來都呼吁OS設計者,將核心服務(例如網(wǎng)絡堆棧)推入用戶進程,以便能夠更容易地開發(fā),定制和多次實例化。 在這個方向上,Thekkath等人[32],在Mach上實現(xiàn)了基于功能的,庫合成的用戶空間網(wǎng)絡棧。 Cheetah Web服務器構(gòu)建在提供文件系統(tǒng)和優(yōu)化的TCP / IP實現(xiàn)的Exokernel [19]庫操作系統(tǒng)之上。 輕量級的libOS使應用程序開發(fā)人員能夠利用領(lǐng)域特定的知識并提高性能。 Unikernel設計,如MirageOS [25]在編譯時同樣混合操作系統(tǒng)和應用程序組件,修剪不需要的軟件元素以完成極小的內(nèi)存占用 - 盡管通過靜態(tài)代碼分析而不是專用于專用化。
6. CONCLUSION
In this paper, we have demonstrated that specialized userspace stacks, built on top of netmap framework, can vastly improve the performance of scale-out applications. These performance gains sacrifice generality by adopting design principles at odds with contemporary stack design: application-specific cross-layer cost amortizations, synchronous and buffering-free protocol implementations, and an extreme focus on interactions between processors, caches, and NICs. This approach reflects a widespread adoption of scale-out computing in data centers, which deemphasizes multifunction hosts in favor of increased large-scale specialization. Our performance results are compelling: a 2–10 improvement for web service, and a roughly 9 improvement for DNS service. Further, these stacks have proven easier to develop and tune than conventional stacks, and their performance improvements are portable over multiple generations
of hardware.
在本文中,我們已經(jīng)證明,專門的用戶空間堆棧,建立在netmap框架之上,可以大大提高橫向擴展應用程序的性能。 這些性能增益通過采用與當代堆棧設計不同的設計原理來犧牲通用性:特定于應用程序的跨層成本分攤,同步和無緩沖協(xié)議實現(xiàn),以及極其側(cè)重于處理器,緩存和NIC之間的交互。 這種方法反映了在數(shù)據(jù)中心中橫向擴展計算的廣泛采用,這削弱了多功能主機,有利于增加大規(guī)模專業(yè)化。 我們的性能結(jié)果令人信服:2-10倍的性能改進web服務,和大約9倍的提高DNS服務。 此外,這些堆棧已經(jīng)被證明比常規(guī)堆棧更容易開發(fā)和調(diào)整,并且它們的性能改進在多代硬件上是可移植的。
General-purpose operating system stacks have been around a long time, and have demonstrated the ability to transcend multiple generations of hardware. We believe the same should be true of special-purpose stacks, but that tuning for particular hardware should be easier. We examined performance on servers manufactured seven years apart, and demonstrated that although the performance bottlenecks were now in different places, the same design delivered significant benefits on both platforms.
通用操作系統(tǒng)堆棧已經(jīng)有很長時間了,并且已經(jīng)證明了超越多代硬件的能力。 我們認為專用堆棧也應該是這樣,但是對于特定硬件的調(diào)整應該更容易。 我們研究了相隔七年的服務器上的性能,并證明盡管性能瓶頸現(xiàn)在在不同的地方,但是相同的設計在這兩個平臺上帶來了顯著的優(yōu)勢。