精品久久久久久国产牛牛app,亚洲精品乱码久久久久久按摩,久久久久国色AV免费观看

Network Stack Specialization for Performance

最近在研究DPDK，這是sigcomm 2014的論文，紀錄在此備忘

Ps: 文中關鍵詞的概念：

segment : 對應于tcp的PDU(協議傳輸單元)，這里應該指tcp層的包，如果一個包太大tcp負責將它拆分成多個segment（這個概念對理解后文有幫助）

根據unix網絡編程卷1 第8頁注解2：packet是IP層傳遞給鏈路層并且由鏈路層打包好封裝在幀中的數據（不包括幀頭）而IP層的包(不包括ip頭)應該叫datagram，鏈路層的包叫幀(fragment)，不過這里沒有特意區分packet只是數據包的意思

DRAM: 動態隨機訪問存儲器，系統的主要內存

SRAM: 靜態隨機訪問存儲器，cpu 的cache

Abstract

Contemporary network stacks are masterpieces of generality, supporting many edge-node and middle-node functions. Generality comes at a high performance cost: current APIs, memory models, and implementations drastically limit the effectiveness of increasingly powerful hardware. Generality has historically been required so that individual systems could perform many functions.

However,as providers have scaled services to support millions of users,

they have transitioned toward thousands (or millions) of dedicated servers, each performing a few functions. We argue that the overhead of generality is now a key obstacle to effective scaling, making specialization not only viable, but necessary.

現在網絡堆棧在通用性表現很好，支持許多邊緣節點和中間節點功能。通用性伴隨著高成本：當前的API，內存模型和實現極大地限制了日益強大的硬件的效能。為了使各個系統可以執行許多功能通用性是必須的。

然而，由于提供商服務數百萬用戶，他們已經轉向數千（或數百萬）的專用服務器每個都執行幾個功能（ps:垂直細分），我們認為通用性的開銷是有效擴展的關鍵障礙，專業化不僅可行，而且是必要的。

We present Sandstorm and Namestorm, web and DNS servers that utilize a clean-slate userspace network stack that exploits knowledge of application-specific workloads. Based on the netmap framework, our novel approach merges application and network-stack memory models, aggressively amortizes protocol-layer costs based on application-layer knowledge, couples tightly with the NIC event model, and exploits microarchitectural features. Simultaneously, the servers retain use of conventional programming frameworks. We compare our approach with the FreeBSD and Linux stacks using the nginx web server and NSD name server, demonstrating 2–10 and 9 improvements in web-server and DNS throughput, lower CPU usage, linear multicore scaling, and saturated NIC hardware.

我們提出了Sandstorm和Namestorm，web和DNS服務器，采用一個干凈的用戶空間網絡堆棧并利用應用程序特定的工作負載的知識。基于netmap框架，我們的新穎方法合并了應用程序和網絡堆棧內存模型根據應用程序層知識攤銷協議層成本與NIC事件模型緊密耦合，并利用微架構特性。同時，服務器保持使用常規編程框架。我們在FreeBSD系統上將使用linux協議棧的nginx web服務器和NSD name server 與我們的方案進行比對。演示 2–10和 9 展示了Web服務器和DNS吞吐量，降低CPU使用率，線性多核縮放和跑滿NIC硬件的改進。

INTRODUCTION

Conventional network stacks were designed in an era where individual systems had to perform multiple diverse functions. In the last decade, the advent of cloud computing and the ubiquity of networking has changed this model; today, large content providers serve hundreds of millions of customers. To scale their systems, they are forced to employ many thousands of servers, with each providing only a single network service. Yet most content is still served with conventional general-purpose network stacks.

介紹

傳統網絡堆棧是在各個系統必須執行多種不同功能的時代設計的。在過去十年中，云計算的出現和網絡的普及改變了這種模式; 今天，大型內容提供商為數億客戶提供服務。為了擴展他們的系統，他們被迫使用成千上萬的服務器，每個服務器僅提供單個網絡服務。然而，大多數內容仍然服務于傳統的通用網絡棧。

These general-purpose stacks have not stood still, but today’s stacks are the result of numerous incremental updates on top of codebases that were originally developed in the early 1990s. Arguably, these network stacks have proved to be quite efficient, flexible, and reliable, and this is the reason that they still form the core of contemporary networked systems. They also provide a stable programming API, simplifying software development. But this generality comes with significant costs, and we argue that the overhead of generality is now a key obstacle to effective scaling, making specialization not only viable, but necessary.

這些通用棧并沒有停止前進，但今天的棧是在最初在20世紀90年代初開發的代碼庫上的許多增量更新的結果。可以說，這些網絡堆棧已經被證明是相當高效，靈活和可靠的，這就是它們仍然形成當代網絡系統的核心的原因。它們還提供穩定的編程API，簡化軟件開發。但這種普遍性帶來了巨大的成本，我們認為，通用性的開銷現在是有效擴展的關鍵障礙，專業化不僅可行，而且是必要的。

In this paper we revisit the idea of specialized network stacks. In particular, we develop Sandstorm, a specialized userspace stack for serving static web content, and Namestorm, a specialized stack implementing a high performance DNS server. More importantly, however, our approach does not simply shift the network stack to userspace: we also promote tight integration and specialization of application and stack functionality, achieving cross-layer optimizations antithetical to current design practices.

在本文中，我們重新思考了關于網絡堆棧專用化。特別是我們開發了Sandstorm，用于提供靜態Web內容的專用用戶空間堆棧和Namestorm，這是一個實現高性能DNS服務器的專用堆棧。更重要的是，我們的方法不是簡單地將網絡棧轉移到用戶空間：我們還促進應用程序和堆棧功能的緊密集成和專業化，實現與當前設計實踐相對立的跨越性優化。

Servers such as Sandstorm could be used for serving images such as the Facebook logo, as OCSP [20] responders for certificate revocations, or as front end caches to popular dynamic content. This is a role that conventional stacks should be good at: nginx [6] uses the sendfile() system call to hand over serving static content to the operating system. FreeBSD and Linux then implement zero-copy stacks, at least for the payload data itself, using scatter-gather to directly DMA the payload from the disk buffer cache to the NIC. They also utilize the features of smart network hardware, such as TCP Segmentation Offload (TSO) and Large Receive Offload (LRO) to further improve performance. With such optimizations, nginx does perform well, but as we will demonstrate, a specialized stack can outperform it by a large margin.

像Sandstorm這樣的服務器可以用于提供諸如Facebook徽標的圖像，作為用于證書撤銷的OCSP [20]響應者，或者作為流行動態內容的前端緩存。這是常規堆棧應該擅長的角色：nginx [6]使用sendfile（）系統調用將服務靜態內容移交給操作系統。 FreeBSD和Linux然后實現零拷貝堆棧，至少對于有效載荷數據本身，使用scatter-gather散射 - 聚集直接將有效載荷從磁盤緩沖區緩存通過DMA轉發到NIC。它們還利用智能網絡硬件的特性，如TCP分段(ps:segment是tcp層的包，這里Segmentation 指將大tcp包分段的功能放在硬件中完成)卸載（TSO）和大型接收卸載（LRO），以進一步提高性能。有了這樣的優化，nginx的表現良好，但正如我們將證明，一個專門的堆棧可以大幅度超越它。

Namestorm is aimed at handling extreme DNS loads, such as might be seen at the root nameservers, or when a server is under a high-rate DDoS attack. The open-source state of the art here is NSD [5], which combined with a modern OS that minimizes data copies when sending and receiving UDP packets, performs well.Namestorm, however, can outperform it by a factor of nine.

Namestorm旨在處理極端的DNS負載，例如可能在根名稱服務器處看到，或者當服務器受到高速DDoS攻擊時。這里的開源代表是NSD [5]，它與現代操作系統相結合，在發送和接收UDP packet時最小化數據復制，性能良好。然而，Namestorm可以超過它的九倍。

Our userspace web server and DNS server are built upon FreeBSD’s netmap [31] framework, which directly maps the NIC buffer rings to userspace.We will show that not only is it possible for a specialized stack to beat nginx, but on data-center-style networks when serving small files typical of many web pages, it can achieve three times the throughput on older hardware, and more than six times the throughput on modern hardware supporting DDIO1.

我們的用戶空間Web服務器和DNS服務器是基于FreeBSD的netmap [31]框架構建的，它直接將NIC緩沖環映射到用戶空間。我們將展示一個專門的堆棧不僅可以擊敗nginx，在為許多網頁提供典型的小文件時，它可以實現舊硬件的三倍的吞吐量，并且是支持DDIO1的現代硬件的吞吐量的六倍多。

The demonstrated performance improvements come from four places. First, we implement a complete zero-copy stack, not only for payload but also for all packet headers, so sending data is very efficient. Second, we allow aggressive amortization that spans traditionally stiff boundaries – e.g., application-layer code can request pre-segmentation of data intended to be sent multiple times, and extensive batching is used to mitigate system-call overhead from userspace. Third, our implementation is synchronous, clocked from received packets; this improves cache locality and minimizes the latency of sending the first packet of the response. Finally, on recent systems, Intel’s DDIO provides substantial benefits, but only if packets to be sent are already in the L3 cache and received packets are processed to completion immediately. It is hard to ensure this on conventional stacks, but a special-purpose stack can get much closer to this ideal.

顯示的性能改進來自四個地方。首先，我們實現一個完整的零拷貝堆棧，不僅對于有效負載，而且對于所有packet header(ps: 應該指包含ip頭的包)，因此發送數據是非常有效的。第二，我們允許跨越傳統上僵硬的邊界的積極的攤銷 - 例如，應用層代碼可以對要被發送多次的數據預分段(ps:segmentation 應該指大包分小包)，并且廣泛的將來自用戶空間的系統調用批量處理以降低開銷。第三，我們的實現是同步的，從接收的數據包開始; 這改進了緩存局部性并且使發送響應的第一分組的等待時間最小化。最后，在最近的系統上，英特爾的DDIO提供了巨大的好處，但是只有當要發送的數據包已經在L3緩存中，并且接收到的數據包立即被處理完成。在常規堆棧中很難確保這一點，但是專用堆棧可以更接近這個理想。

Of course, userspace stacks are not a novel concept. Indeed, the Cheetah web server for MIT’s XOK Exokernel [19] operating system took a similar approach, and demonstrated significant performance gains over the NCSA web server in 1994. Despite this, the concept has never really taken off, and in the intervening years conventional stacks have improved immensely. Unlike XOK, our specialized userspace stacks are built on top of a conventional FreeBSD operating system. We will show that it is possible to get all the performance gains of a specialized stack without needing to rewrite all the ancillary support functions provided by a mature operating system (e.g., the filesystem). Combined with the need to scale server clusters, we believe that the time has come to re-evaluate specialpurpose stacks on today’s hardware.

The key contributions of our work are:

當然，用戶空間堆棧不是一個新穎的概念。事實上，用于MIT XOK Exokernel [19]操作系統的Cheetah Web服務器采用了類似的方法，并且在1994年的NCSA Web服務器上顯示出顯著的性能提升。盡管如此，這一概念從未真正起飛，在其間，堆棧已經大大改善。與XOK不同，我們的專用用戶空間堆棧是建立在常規FreeBSD操作系統之上的。我們將顯示，有可能獲得專用堆棧的所有性能增益，而不需要重寫成熟操作系統（例如，文件系統）提供的所有輔助支持功能。結合需要擴展服務器集群，我們認為，現在是重新評估當今硬件上的專用堆棧的時候了。

我們工作的主要貢獻是：

We discuss many of the issues that affect performance in conventional stacks, even though they use APIs aimed at high performance such as sendfile() and recvmmsg().

我們討論了許多影響傳統堆棧性能的問題，盡管它們使用旨在實現高性能的API，如sendfile（）和recvmmsg（）。

We describe the design and implementation of multiple modular, highly specialized, application-specific stacks built over a commodity operating system while avoiding these pitfalls. In contrast to prior work, we demonstrate that it is possible to utilize both conventional and specialized stacks in a single system. This allows us to deploy specialization selectively, optimizing networking while continuing to utilize generic OS components such as filesystems without disruption.

我們描述了在操作系統上構建的多個模塊化，高度專業化的應用特定堆棧的設計和實現，同時避免這些陷阱。與以前的工作相反，我們證明，有可能在單個系統中利用常規和專用的堆棧。這使我們能夠有選擇地部署專業化，優化網絡連接，同時繼續使用通用操作系統組件，如文件系統而不丟失通用性。

We demonstrate that specialized network stacks designed for aggressive cross-layer optimizations create opportunities for new and at times counter-intuitive hardware-sensitive optimizations. For example, we find that violating the long-held tenet of data-copy minimization can increase DMA performance for certain workloads on recent CPUs.

我們展示專為積極的跨層優化設計的專用的網絡堆棧為新的和偶爾反直覺的硬件敏感的優化創造機會。例如，我們發現違反數據拷貝盡量小的原則可以提高近代CPU上某些工作負載的DMA性能。

We present hardware-grounded performance analyses of our specialized network stacks side-by-side with highly optimized conventional network stacks. We evaluate our optimizations over multiple generations of hardware, suggesting portability despite rapid hardware evolution.

我們提供基于硬件的性能分析將我們的專用網絡堆棧與高度優化的常規網絡堆棧并排比較。我們評估我們在多代硬件上的優化，表明即使硬件快速更新換代仍然可以使用。

We explore the potential of a synchronous network stack blended with asynchronous application structures, in stark contrast to conventional asynchronous network stacks supporting synchronous applications. This approach optimizes cache utilization by both the CPU andDMA engines, yielding as much as 2-10 conventional stack performance.

我們探索同步網絡堆棧與異步應用程序結構混合的潛力，與支持同步應用程序的常規異步網絡堆棧形成鮮明對比。這種方法優化了CPU和DMA引擎的緩存利用率，產生了類似2-10的常規堆棧性能。

2. SPECIAL-PURPOSE ARCHITECTURE

What is the minimum amount of work that a web server can perform to serve static content at high speed? It must implement a MAC protocol, IP, TCP (including congestion control), and HTTP.

However, their implementations do not need to conform to the conventional socket model, split between userspace and kernel, or even implement features such as dynamic TCP segmentation. For a web server that serves the same static content to huge numbers of clients (e.g., the Facebook logo or GMail JavaScript), essentially the same functions are repeated again and again. We wish to explore just how far it is possible to go to improve performance. In particular, we seek to answer the following questions:

Web服務器可以執行的高速服務靜態內容的最少工作量是多少？它必須實現MAC協議(ps:ARP)，IP，TCP（包括擁塞控制）和HTTP。

然而，他們的實現不需要符合常規套接字模型，分離用戶空間和內核層，或者甚至實現諸如動態TCP分段的特征。對于向巨大數量的客戶端（例如，Facebook徽標或GMail JavaScript）提供相同靜態內容的web服務器，基本上一次又一次地重復相同的功能。我們希望探討可以提高性能的可能性。特別是，我們尋求回答以下問題：

Conventional network stacks support zero copy for OSmaintained data – e.g., filesystem blocks in the buffer cache, but not for application-provided HTTP headers or TCP packet headers. Can we take the zero-copy concept to its logical extreme, in which received packet buffers are passed from the NIC all the way to the application, and application packets to be sent are DMAed to the NIC for transmission without even the headers being copied?

傳統的網絡棧支持操作系統維護的數據結構的零拷貝，例如，緩沖區高速緩存中的文件系統塊，但不支持應用程序提供的HTTP報頭或TCP包頭。我們可以極端考慮零拷貝概念的邏輯，其中接收的數據包緩沖區從NIC一直傳遞到應用程序，并且要發送的應用程序包被從DMA到NIC傳輸，甚至不復制頭(ps:接收到的包直接修改成回復并發送)？

Conventional stacks make extensive use of queuing and buffering to mitigate context switches and keep CPUs and NICs busy, at the cost of substantially increased cache footprint and latency. Can we adopt a bufferless event model that reimposes synchrony and avoids large queues that exceed cache sizes? Can we expose link-layer buffer information, such as available space in the transmit descriptor ring, to prevent buffer bloat and reduce wasted work constructing packets that will only be dropped?

傳統堆棧廣泛使用排隊和緩沖以減少上下文切換并保持CPU和NIC忙碌，其代價是顯著增加的高速緩存占用和延遲。我們可以采用無緩沖事件模型，重建同步，避免超過緩存大小的大隊列嗎？我們可以暴露鏈路層緩沖區信息，例如傳輸描述符環中的可用空間，以防止緩沖區膨脹，并減少那些浪費的構建只會丟棄的數據包的工作量？

Conventional stacks amortize expenses internally, but cannot amortize repetitive costs spanning application and network layers. For example, they amortize TCP connection lookup using Large Receive Offload (LRO) but they cannot amortize the cost of repeated TCP segmentation of the same data transmitted multiple times. Can we design a network-stack API that allows cross-layer amortizations to be accomplished such that after the first client is served, no work is ever repeated when serving subsequent clients?

傳統堆棧在內部攤銷費用，但不能攤銷跨應用程序和網絡層的重復成本。例如，它們使用大型接收卸載（LRO）來攤銷TCP連接查找，但是它們不能攤銷多次傳輸的相同數據的重復TCP分段(ps: 多次對同一個tcp包重復拆小包)的成本。我們可以設計一個網絡堆棧API，減少跨層消耗，使第一個客戶端服務后，在服務后續客戶端時，不會重復任何工作？

Conventional stacks embed the majority of network code in the kernel to avoid the cost of domain transitions, limiting twoway communication flow through the stack. Can we make heavy use of batching to allow device drivers to remain in the kernel while colocating stack code with the application and avoiding significant latency overhead?

傳統堆棧將大部分網絡代碼嵌入內核中，以避免域轉換的成本，限制通過堆棧的兩個通信流。我們可以大量使用批處理，以允許設備驅動程序保留在內核中，同時與應用程序和堆棧代碼colocating (ps: 這個不懂得怎么翻譯，大概是沒有跨層消耗的意思？)，并避免顯著的延遲開銷？

Can we avoid any data-structure locking, and even cache-line contention, when dealing with multi-core applications that do not require it?

在不需要多核時，我們可以避免任何數據結構鎖定，甚至是高速緩存行爭用嗎？

Finally, while performing all the above, is there a suitable programming abstraction that allows these components to be reused for other applications that may benefit from server specialization?

最后，在執行上述所有操作時，是否有合適的編程抽象，允許這些組件重用于可能受益于服務器專業化的其他應用程序？

2.1 Network-stack Modularization

Although monolithic kernels are the de facto standard for networked systems, concerns with robustness and flexibility continue to drive exploration of microkernel-like approaches. Both Sandstorm and Namestorm take on several microkernel-like qualities:

網絡堆棧模塊化

雖然獨立的內核是網絡系統的事實上的標準，但是對靈活性的關注繼續推動類似微內核的方法的探索。 Sandstorm和Namestorm都有幾個類似微內核的特性：

Rapid deployment & reusability: Our prototype stack is highly modular, and synthesized from the bottom up using traditional dynamic libraries as building blocks (components) to construct a special-purpose system. Each component corresponds to a standalone service that exposes a well-defined API. Our specialized network stacks are built by combining four basic components:

快速部署和可重用性：我們的原型棧是高度模塊化的，并從下往上使用傳統的動態庫作為構建塊（組件）來構建一個專用系統。每個組件對應于公開明確定義的API的獨立服務。我們的專業網絡堆棧是由四個基本組件組合而成：

The netmap I/O (libnmio) library that abstracts traditional data-movement and event-notification primitives needed by higher levels of the stack.

netmap I / O（libnmio）庫抽象了傳統的數據移動和事件通知原語需要的更高級別的堆棧。

libeth component, a lightweight Ethernet-layer implementation.

libeth組件，輕量級以太網層實現。

libtcpip that implements our optimized TCP/IP layer.

libtcpip實現我們優化的TCP / IP層。

libudpip that implements a UDP/IP layer.

libudpip實現一個UDP / IP層。

Figure 1 depicts how some of these components are used with a simple application layer to form Sandstorm, the optimized web server.

Splitting functionality into reusable components does not require us to sacrifice the benefits of exploiting cross-layer knowledge to optimize performance, as memory and control flow move easily across API boundaries. For example, Sandstorm interacts directly with libnmio to preload and push segments into the appropriate packet-buffer pools. This preserves a service-centric approach.

Developer-friendly: Despite seeking inspiration from microkernel design, our approach maintains most of the benefits of conventional monolithic systems:

圖1描述了這些組件如何與簡單的應用層一起使用來形成Sandstorm，優化的web服務器。

將功能分解為可重用組件不需要我們犧牲利用跨層知識來優化性能的優勢，因為內存和控制流可以輕松跨越API邊界。例如，Sandstorm直接與libnmio交互以預加載并將segments 推入相應的包緩沖池。這保留了以服務為中心的方法。

開發者友好：盡管從微內核設計中獲得靈感，我們的方法保持了傳統獨立系統的大部分優勢：

Debugging is at least as easy (if not easier) compared to conventional systems, as application-specific, performancecentric code shifts from the kernel to more accessible userspace.

調試至少和傳統系統一樣容易（如果沒有更容易的話），因為特定應用程序，性能中心代碼從內核轉移到更易于訪問的用戶空間。

Our approach integrates well with the general-purpose operating systems: rewriting basic components such as device drivers or filesystems is not required. We also have direct access to conventional debugging, tracing, and profiling tools, and can also use the conventional network stack for remote access (e.g., via SSH).

我們的方法與通用操作系統完美集成：不需要重寫基本組件，如設備驅動程序或文件系統。我們還可以直接訪問常規調試，跟蹤和分析工具，并且還可以使用常規網絡棧來遠程訪問（例如，通過SSH）。

Instrumentation in Sandstorm is a simple and straightforward task that allows us to explore potential bottlenecks as well as necessary and sufficient costs in network processing across application and stack. In addition, off-the-shelf performance monitoring and profiling tools “just work”, and a synchronous design makes them easier to use.

Sandstorm中的工具完成簡單和直接的任務，允許我們探索潛在的瓶頸，以及在應用和堆棧的網絡處理中必要和足夠的成本。此外，現成的性能監控和分析工具“只是工作”，同步設計使它們更容易使用。

2.2 Sandstorm web server design

Rizzo’s netmap framework provides a general-purpose API that allows received packets to be mapped directly to userspace, and packets to be transmitted to be sent directly from userspace to the NIC’s DMA rings. Combined with batching to reduce system calls, this provides a high-performance framework on which to build packet-processing applications. A web server, however, is not normally thought of as a packet-processing application, but one that handles TCP streams.

Sandstorm web服務器的設計

Rizzo的netmap框架提供了通用API，允許接收的數據包直接映射到用戶空間，要發送的數據包將直接從用戶空間發送到NIC的DMA環。結合批處理以減少系統調用，這提供了一個高性能框架，用于構建數據包處理應用程序。然而，Web服務器通常不被認為是包處理應用，而是處理TCP流的應用。

To serve a static file, we load it into memory, and a priori generate all the packets that will be sent, including TCP, IP, and link-layer headers. When an HTTP request for that file arrives, the server must allocate a TCP-protocol control block (TCB) to keep track of the connection’s state, but the packets to be sent have already been created for each file on the server.2

要提供一個靜態文件，我們將它加載到內存，并且先驗生成所有要發送的數據包，包括TCP，IP和鏈路層頭。當對該文件的HTTP請求到達時，服務器必須分配TCP協議控制塊（TCB）以跟蹤連接的狀態，但是已經為服務器上的每個文件創建了要發送的包。

The majority of the work is performed during inbound TCP ACK processing. The IP header is checked, and if it is acceptable, a hash table is used to locate the TCB. The offset of the ACK number from the start of the connection is used to locate the next prepackaged packet to send, and if permitted by the congestion and receive windows, subsequent packets. To send these packets, the destination address and port must be rewritten, and the TCP and IP checksums incrementally updated. The packet can then be directly fetched by the NIC using netmap. All reads of the ACK header and modifications to the transmitted packets are performed in a single pass, ensuring that both the headers and the TCB remain in the CPU’s L1 cache.

大多數工作在入站TCP ACK處理期間執行。檢查IP報頭，并且如果可以則使用哈希表來定位TCB(ps: 讓這個TCB來處理相同鏈接的包)。 ACK號的偏移用來在連接開始時定位要發送的下一個預打包的數據包，并且如果擁塞和接收窗口允許，則用于定位后續包。要發送這些數據包，必須重寫目標地址和端口，并且逐步更新TCP和IP校驗和。然后，該包可以由NIC使用netmap直接獲取。 ACK報頭的所有讀取和對發送的包的修改在單次通過中執行，確保報頭和TCB保留在CPU的L1高速緩存中。

Once a packet has been DMAed to the NIC, the packet buffer is returned to Sandstorm, ready to be incrementally modified again and sent to a different client. However, under high load, the same packet may need to be queued in the TX ring for a second client before it has finished being sent to the first client. The same packet buffer cannot be in the TX ring twice, with different destination address and port. This presents us with two design options:

一旦包已經被DMA到NIC，buffer被返回到Sandstorm，準備再次增量地修改并發送到不同的客戶端。然而，在高負載下，相同的數據包可能需要在第二個客戶端的tx ring中被排隊直到第一個客戶端發送完它。具有不同的目的地址和端口的相同的包buffer不能被添加兩次到TX環中。這里我們提供了兩個設計選項

We can maintain more than one copy of each packet in memory to cope with this eventuality. The extra copy could be created at startup, but a more efficient solution would create extra copies on demand whenever a high-water mark is reached, and then retained for future use.

我們可以在內存中保存每個數據包的多個副本，以應對這種可能性。可以在啟動時創建額外的副本，但是更高效的解決方案可以在達到高水位標記時根據需要創建額外副本，然后保留以供將來使用。

We can maintain only one long-term copy of each packet,creating ephemeral copies each time it needs to be sent

我們給每個包長期維護一個副本。在每次需要時創建臨時拷貝

We call the former a pre-copy stack (it is an extreme form of zerocopy stack because in the steady state it never copies, but differs from the common use of the term “zero copy”), and the latter a memcpy stack. A pre-copy stack performs less per-packet work than a memcpy stack, but requires more memory; because of this, it has the potential to thrash the CPU’s L3 cache. With the memcpy stack, it is more likely for the original version of a packet to be in the L3 cache, but more work is done. We will evaluate both approaches, because it is far from obvious how CPU cycles trade off against cache misses in modern processors.

我們稱前者是一個預拷貝堆棧（它是一個極端形式的zerocopy堆棧，因為在穩定狀態下它從不復制，但不同于常用的術語“零拷貝”），后者是一個memcpy堆棧。預拷貝堆棧比memcpy堆棧執行更少的工作，但需要更多的內存; 因為這樣，它有可能摧毀CPU的L3緩存。使用memcpy堆棧，更可能的原始版本的數據包在L3緩存，但有更多的工作要做。我們將評估這兩種方法，因為它遠不如CPU周期與現代處理器中的高速緩存未命中明顯。

Figure 2 illustrates tradeoffs through traces taken on nginx/Linux and pre-copy Sandstorm servers that are busy (but unsaturated). On the one hand, a batched design measurably increases TCP roundtrip time with a relatively idle CPU. On the other hand, Sandstorm amortizes or eliminates substantial parts of per-request processing through a more efficient architecture. Under light load, the benefits are pronounced; at saturation, the effect is even more significant.

圖2說明了通過跟蹤在忙碌（但不飽和）的nginx / Linux和預拷貝Sandstorm服務器。一方面，批量的設計預期地延長具有相對空閑的CPU的TCP往返時間但同時另一方面，Sandstorm通過更高效的架構來平攤或消除每個請求處理的大部分。在輕負載下，益處明顯; 在飽和時，效果甚至更顯著。

Although most work is synchronous within the ACK processing code path, TCP still needs timers for certain operations. Sandstorm’s timers are scheduled by polling the Time Stamp Counter (TSC): although not as accurate as other clock sources, it is accessible from userspace at the cost of a single CPU instruction (on modern hardware).The TCP slow timer routine is invoked periodically (every ~500ms) and traverses the list of active TCBs: on RTO expiration,the congestion window and slow-start threshold are adjusted accordingly,and any unacknowledged segments are retransmitted. The same routine also releases TCBs that have been in TIME_WAIT state for longer than 2*MSL. There is no buffering whatsoever required for retransmissions: we identify the segment that needs to be retransmitted using the oldest unacknowledged number as an offset,retrieve the next available prepackaged packet and adjust its headers accordingly, as with regular transmissions. Sandstorm currently implements TCP Reno for congestion control.

雖然大多數工作在ACK處理代碼路徑內同步，但是TCP仍然需要某些操作的定時器。 Sandstorm的計時器通過輪詢時間戳計數器（TSC）來調度：盡管不如其他時鐘源精確，但是可以從單個CPU指令（在現代硬件上）的成本從用戶空間訪問。TCP慢速計時器例程被周期性地調用（每?500ms）并且遍歷活動TCB的列表：在RTO到期時，相應地調整擁塞窗口和慢啟動閾值，并且重傳任何未確認的segments。同一例程還釋放已處于TIME_WAIT狀態長于2 * MSL的TCB。沒有對重傳要求的緩沖：我們使用最舊的未確認號碼作為偏移來識別需要重傳的段，檢索下一個可用的預先封裝的分組，并相應地調整其頭部，如同常規傳輸一樣。 Sandstorm目前實現了TCP Reno的擁塞控制。

2.3 The Namestorm DNS server

The same principles applied in the Sandstorm web server, also apply to a wide range of servers returning the same content to multiple users. Authoritative DNS servers are often targets of DDoS attacks – they represent a potential single point of failure, and because DNS traditionally uses UDP, lacks TCP’s three way handshake to protect against attackers using spoofed IP addresses. Thus,high performance DNS servers are of significant interest.

在Sandstorm Web服務器中應用的相同原理也適用于將相同內容返回給多個用戶類的服務器。權威DNS服務器通常是DDoS攻擊的目標 - 它們代表一個潛在的單點故障，并且因為DNS傳統上使用UDP，缺乏TCP的三方握手，以防止攻擊者使用欺騙的IP地址。因此，高性能DNS服務器具有重大意義。

Unlike TCP, the conventional UDP stack is actually quite lightweight, and DNS servers already preprocess zone files and store response data in memory. Is there still an advantage running a specialized stack?

與TCP不同，常規UDP堆棧實際上相當輕量級，DNS服務器已經預處理zone并將響應數據存儲在內存中。運行一個專用的堆棧還有優勢嗎？

Most DNS-request processing is simple. When a request arrives,

the server performs sanity checks, hashes the concatenation of the name and record type being requested to find the response, and sends that data. We can preprocess the responses so that they are already stored as a prepackaged UDP packet. As with HTTP, the destination address and port must be rewritten, the identifier must be updated,and the UDP and IP checksums must be incrementally updated.After the initial hash, all remaining processing is performed in one pass, allowing processing of DNS response headers to be performed from the L1 cache. As with Sandstorm, we can use pre-copy or memcpy approaches so that more than one response for the same name can be placed in the DMA ring at a time.

大多數DNS請求處理很簡單。當請求到達時，

服務器執行完整性檢查，對所請求的名稱和記錄類型合并進行哈希以找到響應，并發送該數據。我們可以預處理響應，以便它們已經存儲為預先打包的UDP數據包。與HTTP一樣，必須重寫目標地址和端口，必須更新標識符，并且必須增量更新UDP和IP校驗和。在初始哈希后，所有剩余的處理都在一次執行，允許處理DNS響應頭將從L1高速緩存執行。與Sandstorm一樣，我們可以使用預拷貝或memcpy方法，以便同一名稱的多個響應可以一次放置在DMA環中。

Our specialized userspace DNS server stack is composed of three reusable components, libnmio, libeth, libudpip, and a DNS-specific application layer. As with Sandstorm, Namestorm uses FreeBSD’s netmap API, implementing the entire stack in userspace, and uses netmap’s batching to amortize system call overhead. libnmio and libeth are the same as used by Sandstorm, whereas libudpip contains UDP-specific code closely integrated with an IP layer. Namestorm is an authoritative nameserver, so it does not need to handle recursive lookups.

我們的專用用戶空間DNS服務器堆棧由三個可重用的組件libnmio，libeth，libudpip和DNS特定的應用程序層組成。與Sandstorm一樣，Namestorm使用FreeBSD的netmap API，在用戶空間實現整個堆棧，并使用netmap的批處理來攤銷系統調用開銷。 libnmio和libeth與Sandstorm使用的相同，而libudpip包含與IP層緊密集成的UDP特定代碼。 Namestorm是一個權威的名稱服務器，因此它不需要處理遞歸查找。

Namestorm preprocesses the zone file upon startup, creating DNS response packets for all the entries in the zone, including the answer section and any glue records needed. In addition to type-specific queries for A, NS,MX and similar records, DNS also allows queries for ANY. A full implementation would need to create additional response packets to satisfy these queries; our implementation does not yet do so, but the only effect this would have is to increase the overall memory footprint. In practice, ANY requests prove comparatively rare.

Namestorm在啟動時預處理zone文件，為zone中的所有條目創建DNS響應數據包，包括答案部分和所需的任何附帶記錄。除了針對A，NS，MX和類似記錄的類型特定查詢之外，DNS還允許對ANY進行查詢。完全實現將需要創建額外的響應分組以滿足這些查詢; 我們的實現還沒有這樣做，但是唯一的效果是增加總體內存占用。在實踐中，any請求比較罕見。

Namestorm idexes the prepackaged DNS response packets using a hash table. There are two ways to do this:

Namestorm使用hash表索引預先打包的DNS響應數據包。有兩種方法可以做到這一點：

Index by concatenation of request type (e.g., A, NS, etc) and fully-qualified domain name (FQDN); for example “www.example.com”.

通過請求類型（例如，A，NS等）和完全限定域名（FQDN）的合并索引; 例如“www.example.com”.

Index by concatenation of request type and the wire-format FQDN as this appears in an actual query; for example,“[3]www[7]example[3]com[0]” where [3] is a single byte containing the numeric value 3.

通過連接請求類型和wire格式FQDN索引，因為它出現在實際查詢中; 例如“[3] www [7] example [3] com [0]”，其中[3]是包含數值3的單個字節。(ps: dns包中域名使用”3www5baidu3com”格式，前面的數字表示后面的域長方便解析)

Using the wire request format is obviously faster, but DNS permits compression of names. Compression is common in DNS answers, where the same domain name occurs more than once, but proves rare in requests. If we implement wire-format hash keys, we must first perform a check for compression; these requests are decompressed and then reencoded to uncompressed wire-format for hashing.The choice is therefore between optimizing for the common case, using wire-format hash keys, or optimizing for the worst case, assuming compression will be common, and using FQDN hash keys. The former is faster, but the latter is more robust to a DDoS attack by an attacker taking advantage of compression. We evaluate both approaches, as they illustrate different performance tradeoffs.

使用wire請求格式顯然更快，但DNS允許壓縮名稱。壓縮在DNS答案中很常見，其中相同的域名不止一次地出現，但在請求中證明是罕見的。如果我們wire線格式哈希鍵，我們必須首先執行壓縮檢查; 這些請求被解壓縮，然后被重新編碼為未壓縮的wire格式以用于散列。因此選擇是針對常見情況優化，使用wire格式散列密鑰，或者對于最壞情況優化，假設壓縮將是常見的，并且使用FQDN散列密鑰。前者更快，但后者更強大到攻擊者利用壓縮的DDoS攻擊。我們評估這兩種方法，因為它們說明了不同的性能。

Our implementation does not currently handle referrals, so it can handle only zones for which it is authoritative for all the sub-zones.It could not, for example, handle the .com zone, because it would receive queries for www.example.com, but only have hash table entries for example.com. Truncating the hash key is trivial to do as part of the translation to an FQDN, so if Namestorm were to be used for a domain such as .com, the FQDN version of hashing would be a reasonable approach.

我們的實現目前不處理引用，因此它只能處理對所有子區域都是權威的區域。

例如，它無法處理.com區域，因為它將接收www.example.com的查詢，但只有example.com的哈希表條目。截斷哈希鍵對于轉換到FQDN是很重要的，所以如果Namestorm用于一個域如.com，FQDN版本的哈希將是一個合理的方法。

Outline of the main Sandstorm event loop

1. Call RX poll to receive a batch of received packets that have been

stored in the NIC’s RX ring; block if none are available.

2. For each ACK packet in the batch:

3. Perform Ethernet and IP input sanity checks.

4. Locate the TCB for the connection.

5. Update the acknowledged sequence numbers in TCB; update

receive window and congestion window.

6. For each new TCP data packet that can now be sent, or each

lost packet that needs retransmitting:

7. Find a free copy of the TCP data packet (or clone one

if needed).

8. Correct the destination IP address, destination port,

sequence numbers, and incrementally update the TCP

checksum.

9. Add the packet to the NIC’s TX ring.

10. Check if dt has passed since last TX poll. If it has, call

TX poll to send all queued packets.

Sandstorm 主事件循環概述

1.調用RX輪詢批量接收在NIC的RX環中的packet;直到沒有。

2.處理每個ACK數據包：

3.執行鏈路層和IP層完整性檢查。

4.找到處理該連接的TCB。

5.更新TCB中已確認的序列號; 更新接收窗口和擁塞窗口。

6.對于可以立即發送的每個新TCP數據包，或每個需要重傳的丟失數據包：

7.查找TCP數據包的空閑拷貝（如果需要，請clone一個）。

8.更正目標IP地址，目標端口，序列號，并逐步更新TCP校驗和。

9.將數據包添加到NIC的TX環。

10.檢查是否達到TX輪詢的時間間隔。如果是調用TX poll發送所有排隊的數據包。

2.4 Main event loop

To understand how the pieces fit together and the nature of interaction between Sandstorm, Namestorm, and netmap, we consider the timeline for processing ACK packets in more detail. Figure 3 summarizes Sandstorm’s main loop. SYN/FIN handling, HTTP, and timers are omitted from this outline, but also take place. However,most work is performed in the ACK processing code.

2.4主事件循環

為了理解這些部分是如何組合在一起的，以及Sandstorm，Namestorm和netmap之間的交互性質，我們更詳細地考慮處理ACK數據包的時間線。圖3總結了Sandstorm的主循環。 SYN / FIN處理，HTTP和計時器從此大綱中省略但偶爾也有。然而，大多數工作是在ACK處理代碼中執行的。

One important consequence of this architecture is that the NIC’s TX ring serves as the sole output queue, taking the place of conventional socket buffers and software network-interface queues. This is possible because retransmitted TCP packets are generated in the same way as normal data packets. As Sandstorm is fast enough to saturate two 10Gb/s NICs with a single thread on one core, data structures are also lock free

這種架構的一個重要結果是，NIC的TX ring用作唯一的輸出隊列，取代了傳統的套接字緩沖區和軟件網卡隊列。這是可能的，因為重傳的TCP包以與正常數據包相同的方式生成。由于Sandstorm足夠快，可以在一個核上使用單個線程來飽和兩個10Gb / s網卡，數據結構也是無鎖的

When the workload is heavy enough to saturate the CPU, the system-call rate decreases. The receive batch size increases as calls to RX poll become less frequent, improving efficiency at the expense of increased latency. Under extreme load, the RX ring will fill, dropping packets. At this point the system is saturated and, as with any web server, it must bound the number of open connections by dropping some incoming SYNs.

當工作負載足夠大以使CPU飽和時，系統調用(ps: rx tx的poll輪詢)速率降低。接收批次大小隨著對RX輪詢的調用變得不那么頻繁而增加，以增加的延遲為代價來提高效率。在極端負載下，RX環會填滿，丟棄報文。此時，系統已飽和，與任何Web服務器一樣，它必須丟棄一定數量的打開的連接的SYN。

Under heavier load, the TX-poll system call happens in the RX handler. In our current design, dt, the interval between calls to TX poll in the steady state, is a constant set to 80us. The system-call rate under extreme load could likely be decreased by further increasing dt, but as the pre-copy version of Sandstorm can easily saturate all six 10Gb/s NICs in our systems for all file sizes, we have thus far not needed to examine this. Under lighter load, incoming packets might arrive too rarely to provide acceptable latency for transmitted packets; a 5ms timer will trigger transmission of straggling packets in the NIC’s TX ring.

在較重的負載下，TX-poll系統調用發生在RX處理程序中。在我們當前的設計中，dt，在穩定狀態下調用TX poll之間的間隔，是一個設置為80us的常數。在極端負載下通過進一步增加dt來降低系統調用率，但是由于預拷貝版本的Sandstorm可以很容易地飽和所有文件大小的系統中的所有6個10Gb / s網卡，我們迄今為止不需要檢查這個。在較輕負載下，傳入packets可能很少于是提供了一個可接受延遲; 5ms發送一次。

The difference between the pre-copy version and the memcpy version of Sandstorm is purely in step 7, where the memcpy version will simply clone the single original packet rather than search for an unused existing copy.

預拷貝版本和memcpy版本之間的差異純粹是在步驟7中，其中memcpy版本將簡單地克隆單個原始數據包，而不是搜索未使用的現有副本。

Contemporary Intel server processors support Direct Data I/O (DDIO). DDIO allows NIC-originated Direct Memory Access (DMA) over PCIe to access DRAM through the processor’s Last-Level Cache (LLC). For network transmit, DDIO is able to pull data from the cache without a detour through system memory; likewise, for receive,DMA can place data in the processor cache. DDIO implements administrative limits on LLC utilization intended to prevent DMA from thrashing the cache. This design has the potential to significantly reduce latency and increase I/O bandwidth

當代英特爾服務器處理器支持直接數據I / O（DDIO）。 DDIO允許通過PCIe的NIC發起的DMA直接內存訪問,通過處理器的最后級緩存（LLC）訪問DRAM。對于網絡傳輸，DDIO能夠從緩存中提取數據，而不必通過系統內存; 同樣，對于接收，DMA可以將數據放置在處理器高速緩存中。 DDIO實現對LLC利用率的管理限制，旨在防止DMA頻繁刷緩存。此設計具有顯著減少延遲和增加I / O帶寬的潛力

Memcpy Sandstorm forces the payload of the copy to be in the CPU cache from which DDIO can DMA it to the NIC without needing to load it from memory again. With pre-copy, the CPU only touches the packet headers, so if the payload is not in the CPU cache, DDIO must load it, potentially impacting performance. These interactions are subtle, and we will look at them in detail.

Memcpy 版本的Sandstorm強制拷貝的負載壓力在CPU緩存中，DDIO可以將其從DMA傳輸到NIC，而無需再次從內存加載它。使用預拷貝，CPU只觸發數據包頭，因此如果有效負載不在CPU緩存中，DDIO必須加載它，這可能會影響性能。這些互動是微妙的，我們將詳細研究它們。(ps:照理減少拷貝使用現成的數據更快但是這里用了DDIO實現cpu到網卡的直接傳輸而預拷貝版本的話現成的數據不一定在cache中反而還多了一次加載，關于這點后面還會討論)

Namestorm follows the same basic outline, but is simpler as DNS is stateless: it does not need a TCB, and sends a single response packet to each request.

Namestorm遵循相同的基本概要，但是更簡單，因為DNS是無狀態的：它不需要TCB，并且向每個請求發送單個響應包。

2.5 API

As discussed, all of our stack components provide well-defined APIs to promote reusability. Table 1 presents a selection of API functions exposed by libnmio and libtcpip. In this section we describe some of the most interesting properties of the APIs.

如上所述，我們的所有堆棧組件都提供了定義明確的API來提高可重用性。表1介紹了libnmio和libtcpip暴露的API函數的選擇。在本節中，我們描述了一些最有趣的API的屬性。

libnmio is the lowest-level component: it handles all interaction with netmap and abstracts the main event loop. Higher layers 179 (e.g., libeth) register callback functions to receive raw incoming data as well as set timers for periodic events (e.g., TCP slow timer).The function netmap_ouput() is the main transmission routine:it enqueues a packet to the transmission ring either by memory or zero copying and also implements an adaptive batching algorithm.

Since there is no socket layer, the application must directly interface with the network stack. With TCP as the transport layer, it acquires a TCB (TCP Control Block), binds it to a specific IPv4 address and port, and sets it to LISTEN state using API functions. The application must also register callback functions to accept connections,receive and process data from active connections, as well as act on successful delivery of sent data (e.g., to close the connection or send more data).

libnmio是最低級別的組件：它處理與netmap的所有交互并抽象主事件循環。高層179（例如，libeth）注冊回調函數以接收原始輸入數據以及設置用于周期性事件的定時器（例如，TCP慢定時器）。函數netmap_ouput（）是主傳輸例程：它將分組排入傳輸通過存儲器或零拷貝來放入環形隊列，并且還實現自適應批處理算法。

由于沒有套接字層，應用程序必須直接與網絡堆棧接口。使用TCP作為傳輸層，它獲取TCB（TCP控制塊），將其綁定到特定的IPv4地址和端口，并使用API函數將其設置為LISTEN狀態。應用程序還必須注冊回調函數以接受連接，從活動連接接收和處理數據、發送數據的成功傳遞（例如，以關閉連接或發送更多數據）。

3. EVALUATION

To explore Sandstorm and Namestorm’s performance and behavior, we evaluated using both older and more recent hardware.On older hardware, we employed Linux 3.6.7 and FreeBSD 9-STABLE. On newer hardware, we used Linux 3.12.5 and FreeBSD 10-STABLE. We ran Sandstorm and Namestorm on FreeBSD.

評估

為了探索Sandstorm和Namestorm的性能和行為，我們使用舊的和更新的硬件進行評估。在舊的硬件上，我們使用Linux 3.6.7和FreeBSD 9-STABLE。在較新的硬件上，我們使用Linux 3.12.5和FreeBSD 10-STABLE。我們在FreeBSD上運行Sandstorm和Namestorm。

For the old hardware, we use three systems: two clients and one server, connected via a 10GbE crossbar switch. All test systems are equipped with an Intel 82598EB dual port 10GbE NIC, 8GB RAM,and two quad-core 2.66 GHz Intel Xeon X5355 CPUs. In 2006,these were high-end servers. For the new hardware, we use seven systems; six clients and one server, all directly connected via dedicated 10GbE links. The server has three dual-port Intel 82599EB 10GbE NICs, 128GB RAM and a quad-core Intel Xeon E5-2643 CPU. In 2014, these are well-equipped contemporary servers.

對于舊硬件，我們使用三個系統：兩個客戶端和一個服務器，通過10GbE交換機連接。所有測試系統都配備了一個Intel 82598EB雙端口10GbE NIC，8GB RAM和兩個四核2.66 GHz Intel Xeon X5355 CPU。 2006年，這些都是高端服務器。對于新硬件，我們使用七個系統; 六個客戶端和一個服務器，都通過專用的10GbE鏈路直接連接。該服務器有三個雙端口Intel 82599EB 10GbE NIC，128GB RAM和四核Intel Xeon E5-2643 CPU。在2014年，這些是設備齊全的現代服務器。

The most interesting improvements between these hardware generations are in the memory subsystem. The older Xeons have a conventional architecture with a single 1,333MHz memory bus serving both CPUs. The newer machines, as with all recent Intel server processors,support Data Direct I/O (DDIO), so whether data to be sent is in the cache can have a significant impact on performance.

這些硬件代之間最有趣的改進是在存儲器子系統中。較老的Xeons有一個傳統的架構，單個1,333MHz內存總線為兩個CPU服務。較新的機器（如最近的所有英特爾服務器處理器）都支持數據直接I / O（DDIO），因此要發送的數據是否在緩存中會對性能產生重大影響。

Our hypothesis is that Sandstorm will be significantly faster than nginx on both platforms; however, the reasons for this may differ. Experience [18] has shown that the older systems often bottleneck on memory latency, and as Sandstorm is not CPU-intensive, we would expect this to be the case. A zero-copy stack should thus be a big win. In addition, as cores contend for memory, we would expect that adding more cores does not help greatly.

我們的假設是，Sandstorm將在兩個平臺上明顯快于nginx; 然而，原因可能不同。經驗[18]表明，較舊的系統通常會對內存延遲造成瓶頸，而且由于Sandstorm不是CPU密集型的，我們預期會出現這種情況。零拷貝堆棧應該是一個大勝利。此外，隨著核爭奪內存，我們預計添加更多核并不會有很大的幫助。

On the other hand, with DDIO, the new systems are less likely to bottleneck on memory. The concern, however, would be that DDIO could thrash at least part of the CPU cache. On these systems, we expect that adding more cores would help performance, but that in doing so, we may experience scalability bottlenecks such as lock contention in conventional stacks. Sandstorm’s lock-free stack can simply be replicated onto multiple 10GbE NICs, with one core per two NICs to scale performance. In addition, as load increases, the number of packets to be sent or received per system call will increase due to application-level batching. Thus, under heavy load, we would hope that the number of system calls per second to still be acceptable despite shifting almost all network-stack processing to userspace.

另一方面，使用DDIO，新系統不太可能在內存上造成瓶頸。然而，關注的是，DDIO可能至少刷掉部分的CPU緩存。在這些系統上，我們期望添加更多的核將有助于提高性能，但在這樣做時，我們可能會遇到可伸縮性瓶頸，例如傳統堆棧中的鎖爭用。 Sandstorm的無鎖堆棧可以簡單地用到多個10GbE NIC上，每兩個NIC一個核心可以擴展性能。此外，隨著負載的增加，每個系統調用發送或接收的數據包數量將由于應用程序級別的批處理而增加。因此，在重負載下盡管將幾乎所有的網絡棧處理轉移到用戶空間，我們希望每秒的系統調用的數量仍然是可以接受的

The question, of course, is how well do these design choices play out in practice?

當然，問題是這些設計選擇在實踐中表現得如何？

3.1 Experiment Design: Sandstorm

We evaluated the performance of Sandstorm through a set of experiments and compare our results against the nginx web server running on both FreeBSD and Linux. Nginx is a high-performance,low-footprint web server that follows the non-blocking, event-driven model: it relies on OS primitives such as kqueue() for readiness event notifications, it uses sendfile() to send HTTP payload directly from the kernel, and it asynchronously processes requests.

我們通過一組實驗評估Sandstorm的性能，并將結果與在FreeBSD和Linux上運行的nginx Web服務器進行比較。 Nginx是一個高性能，低占用率的web服務器，它遵循非阻塞，事件驅動模型：它依賴于諸如kqueue（）等用于準備事件通知的操作系統原語，它使用sendfile（）直接從內核發送HTTP有效負載，并且異步處理請求。

Contemporary web pages are immensely content-rich, but they mainly consist of smaller web objects such as images and scripts. The distribution of requested object sizes for Yahoo! CDN, reveals that 90% of the content is smaller than 25KB [11]. The conventional network stack and web-server application perform well when delivering large files by utilizing OS primitives and NIC hardware features. Conversely, multiple simultaneous short-lived HTTP connections are considered a heavy workload that stresses the kerneluserspace interface and reveals performance bottlenecks: even with sendfile() to send the payload, the size of the transmitted data is not quite enough to compensate for the system cost.

當代網頁內容豐富，但它們主要包括較小的網絡對象，如圖像和腳本。對于Yahoo! CDN的請求的對象大小的分布，揭示了90％的內容小于25KB [11]。當通過利用OS原語和NIC硬件特征來傳送大文件時，傳統的網絡棧和web服務器應用執行得很好。相反，多個同時短期HTTP連接被認為是一個沉重的工作負載，強調用戶空間內核接口并揭示性能瓶頸：即使使用sendfile（）發送有效負載，傳輸數據的大小也不足以補償系統成本。

For all the benchmarks, we configured nginx to serve content from a RAM disk to eliminate disk-related I/O bottlenecks. Similarly,Sandstorm preloads the data to be sent and performs its pre-segmentation phase before the experiments begin. We use weighttp [9] to generate load with multiple concurrent clients. Each client generates a series of HTTP requests, with a new connection being initiated immediately after the previous one terminates. For each experiment we measure throughput, and we vary the size of the file served, exploring possible tradeoffs between throughput and system load. Finally, we run experiments with a realistic workload by using a trace of files with sizes that follow the distribution of requested HTTP objects of the Yahoo! CDN.

對于所有的基準測試，我們配置了nginx來從RAM磁盤提供內容，以消除磁盤相關的I / O瓶頸。類似地，Sandstorm預加載要發送的數據，并在實驗開始之前執行其預分割階段。我們使用weighttp [9]來生成多個并發客戶端的負載。每個客戶端生成一系列HTTP請求，在前一個終止后立即啟動新的連接。對于每個實驗，我們測量吞吐量，并且我們改變所服務的文件的大小，探索吞吐量和系統負載之間可能的折衷。最后，我們使用跟蹤文件的實際工作量進行實驗，這些文件的大小遵循Yahoo! CDN所請求的HTTP對象的分布。

3.2 Sandstorm Results

First, we wish to explore how file size affects performance Sandstorm is designed with small files in mind, and batching to reduce overheads, whereas the conventional sendfile() ought to be better for larger files.

首先，我們希望了解文件大小如何影響性能Sandstorm的設計考慮了小文件，并批量化以減少開銷，而傳統的sendfile（）應該對更大的文件更好。

Figure 4 shows performance as a function of content size, comparing pre-copy Sandstorm and nginx running on both FreeBSD and Linux. With a single 10GbE NIC (Fig. 4a and 4d), Sandstorm outperforms nginx for smaller files by ~23–240%. For larger files, all three configurations saturate the link. Both conventional stacks are more CPU-hungry for the whole range of file sizes tested, despite potential advantages such as TSO on bulk transfers.

圖4顯示了對不同內容大小的函數的性能，比較了在FreeBSD和Linux上運行的預拷貝Sandstorm和nginx。使用單個10GbE NIC（圖4a和4d），Sandstorm的性能比較小的文件的nginx大約高23-240％。對于較大的文件，所有三個配置飽和鏈接。盡管存在諸如批量傳輸的TSO的潛在優勢，但對于所有測試的文件大小，這兩種常規堆棧都更加需要CPU。

To scale to higher bandwidths, we added more 10GbE NICs and client machines. Figure 4b shows aggregate throughput with four 10GbE NICs. Sandstorm saturates all four NICs using just two CPU cores, but neither Linux nor FreeBSD can saturate the NICs with files smaller than 128KB, even though they use four CPU cores.

為了擴展到更高的帶寬，我們增加了更多的10GbE網卡和客戶端機器。圖4b顯示了四個10GbE NIC的聚合吞吐量。 Sandstorm使用只有兩個CPU核心四個網卡，但即使Linux和FreeBSD使用四個CPU核心都不能使文件小于128KB的網絡飽和，。

As we add yet more NICs, shown in Figure 4c, the difference in performance gets larger for a wider range of file sizes. With 610GbE NICs Sandstorm gives between 10% and 10 more throughput than FreeBSD for file sizes in the range of 4–256KB.Linux fares worse, experiencing a performance drop (see Figure 4c)compared to FreeBSD with smaller file sizes and 5–6 NICs. Low CPU utilization is normally good, but here (Figures 4f, 5b), idle time is undesirable since the NICs are not yet saturated.We have not identified any single obvious cause for this degradation. Packet traces show the delay to occur between the connection being accepted and the response being sent. There is no single kernel lock being held for especially long, and although locking is not negligible, it does not dominate, either. The system suffers one soft page fault for every two connections on average, but no hard faults, so data is already in the disk buffer cache, and TCB recycling is enabled. This is an example of how hard it can be to find performance problems with conventional stacks. Interestingly, this was an application-specific behavior triggered only on Linux: in benchmarks we conducted with other web servers (e.g., lighttpd [3], OpenLiteSpeed [7]) we did not experience a similar performance collapse on Linux with more than four NICs.We have chosen, however, to present the nginx datasets as it offered the greatest overall scalability in both operating systems.

當我們添加更多的NIC時，如圖4c所示，對于更大范圍的文件大小，性能的差異變大。使文件大小為4-256KB 用6x10GbE NIC Sandstorm比FreeBSD更高>=10％的吞吐量，與具有較小文件大小和5-6個NIC的FreeBSD相比，Linux的性能下降（見圖4c）。低的CPU利用率通常是好的，但是在這里（圖4f，5b），空閑是不期望的，因為NIC尚未飽和。我們沒有識別出任何單個明顯的原因。數據包跟蹤顯示在被接受的連接和正在發送的響應之間發生的延遲。沒有一個單一的內核鎖持有特別長的時間，雖然鎖定是不可忽略的，它也不占主導地位。系統平均每兩個連接會出現一個軟頁故障，但沒有硬故障，因此數據已經在磁盤緩沖區緩存中，并且啟用了TCB回收。這是一個例子，說明用常規堆棧找到性能問題有多困難。有趣的是，這是一個應用程序特定的行為僅在Linux上觸發：在我們與其他Web服務器（例如lighttpd [3]，OpenLiteSpeed [7]）進行的基準測試中，我們沒有在具有四個以上NIC的Linux上經歷類似的性能崩潰然而，我們選擇呈現nginx上的數據集，因為它在兩個操作系統中提供最大的整體可伸縮性。

It is clear that Sandstorm dramatically improves network performance when it serves small web objects, but somewhat surprisingly,it performs better for larger files too. For completeness, we evaluate Sandstorm using a realistic workload: following the distribution of requested HTTP object sizes of the Yahoo! CDN [11], we generated a trace of 1000 files ranging from a few KB up to ~20MB which were served from both Sandstorm and nginx. On the clients,we modified weighttp to benchmark the server by concurrently requesting files in a random order. Figures 5a and 5b highlight the achieved network throughput and the CPU utilization of the server as a function of the number of the network adapters. The network performance improvement is more than 2 while CPU utilization is reduced.

很明顯，Sandstorm在服務小型Web對象時可以顯著提高網絡性能，但有些令人驚訝的是，它對大型文件的性能也更好。為了完整性，我們使用現實的工作負載評估Sandstorm：給Sandstorm 和nginx分發請求的HTTP對象大小的Yahoo! CDN [11]之后，我們生成了1000個文件的蹤跡，范圍從幾KB到?20MB。在客戶端上，我們修改weighttp以通過以隨機順序并發請求文件來對服務器進行基準測試。圖5a和5b突出了網絡適配器的數量變化所實現的網絡吞吐量和服務器的CPU利用率。網絡性能提升超過2倍同時降低CPU利用率。

Finally, we evaluated whether Sandstorm handles high packet loss correctly. With 80 simultaneous clients and 1% packet loss, as bexpected, throughput plummets. FreeBSD achieves approximately 640Mb/s and Sandstorm roughly 25% less. This is not fundamental,but due to FreeBSD’s more fine-grained retransmit timer and its use of NewReno congestion control rather than Reno, which could also be implemented in Sandstorm.Neither network nor server is stressed in this experiment – if there had been a real congested link causing the loss, both stacks would have filled it.

最后，我們評估Sandstorm是否正確處理高數據包丟失。隨著80個同時客戶端和1％的包丟失率，如預期，吞吐量直線下降。 FreeBSD實現約640Mb / s比Sandstorm減少約25％。這不是根本的，但是由于FreeBSD的更細粒度的重傳計時器和它的使用NewReno擁塞控制而不是Reno，這也可以在Sandstorm中實現。在這個實驗中強調的網絡和服務 - 如果有一個真正的擁塞的鏈路造成損失，兩個堆棧都會處理。

Throughout, we have invested considerable effort in profiling and optimizing conventional network stacks, both to understand their design choices and bottlenecks, and to provide the fairest possible comparison. We applied conventional performance tuning to Linux and FreeBSD, such as increasing hash-table sizes, manually tuning CPU work placement for multiqueue NICs, and adjusting NIC parameters such as interrupt mitigation. In collaboration with Netflix,we also developed a number of TCP and virtual-memory subsystem performance optimizations for FreeBSD, reducing lock contention under high packet loads. One important optimization is related to sendfile(), in which contention within the VM subsystem occurred while TCP-layer socket-buffer locks were held, triggering a cascade to the system as a whole. These changes have been upstreamed to FreeBSD for inclusion in a future release.

在整個過程中，我們投入了相當大的努力來分析和優化常規網絡堆棧，既了解他們的設計選擇和瓶頸，并提供最公平的可能的比較。我們將常規性能調優應用于Linux和FreeBSD，例如增加哈希表大小，手動調整多隊列NIC的CPU工作位置，以及調整NIC參數（如中斷緩沖）。與Netflix合作，我們還為FreeBSD開發了許多TCP和虛擬內存子系統性能優化，從而減少了高數據包負載下的鎖爭用。一個重要的優化與sendfile（）相關，沖突發生在TCP層套接字緩沖區鎖定時發生VM子系統內的爭用，觸發了作為整體的系統級聯。這些更改已上傳到FreeBSD以包含在將來的版本中。

To copy or not to copy

The pre-copy variant of Sandstorm maintains more than one copy of each segment in memory so that it can send the same segment to multiple clients simultaneously. This requires more memory than nginx serving files from RAM. The memcpy variant only enqueues copies, requiring a single long-lived version of each packet, and uses a similar amount of memory to nginx. How does this memcpy affect performance? Figure 6 explores network throughput, CPU utilization, and system-call rate for two- and six-NIC configurations.

Sandstorm的預拷貝版在內存中保存了每個segment的多個副本，以便它可以同時將同一個segment發送到多個客戶端。這需要比從RAM中提供文件的nginx更多的內存。 memcpy版僅排列副本，需要每個數據包的單個長期版本，并使用與nginx類似的內存量。這個memcpy如何影響性能？圖6探討了兩個和六個NIC配置的網絡吞吐量，CPU利用率和系統調用率。

With six NICs, the additional memcpy() marginally reduces performance (Figure 6b) while exhibiting slightly higher CPU load (Figure 6d). In this experiment, Sandstorm only uses three cores to simplify the comparison, so around 75% utilization saturates those cores. The memcpy variant saturates the CPU for files smaller than 32KB, whereas the pre-copy variant does not. Nginx, using sendfile() and all four cores, only catches up for file sizes of 512KB and above, and even then exhibits higher CPU load.

使用六個NIC時，額外的memcpy（）會略微降低性能（圖6b），同時顯示稍微更高的CPU負載（圖6d）。在這個實驗中，Sandstorm只使用三個核來簡化比較，因此大約75％的利用率使這些核飽和。對于小于32KB的文件，memcpy版使CPU飽和，而預拷貝版不會。 Nginx使用sendfile（）和所有四個核，只能達到512KB及以上的文件大小，甚至表現出更高的CPU負載。

As file size decreases, the expense of SYN/FIN and HTTPrequest processing becomes measurable for both variants, but the pre-copy version has more headroom so is affected less. It is interesting to observe the effects of batching under overload with the memcpy stack in Figure 6f.With large file sizes, pre-copy and memcpy make the same number of system calls per second. With small files, however, the memcpy stack makes substantially fewer system calls per second. This illustrates the efficacy of batching: memcpy has saturated the CPU, and consequently no longer polls the RX queue as often. As the batch size increases, the system-call cost decreases, helping the server weather the storm. The pre-copy variant is not stressed here and continues to poll frequently, but would behave the same way under overload. In the end, the cost of the additional memcpy is measurable, but still performs quite well.

隨著文件大小減小，SYN / FIN和HTTP請求處理的費用對于這兩種變體都變得可預期，但是預拷貝版本具有更大的空間，因此受影響較小。有趣的是觀察在過載下使用memcpy堆棧的batching的影響。對于大文件大小，pre-copy和memcpy每秒都會產生相同數量的系統調用。然而，對于小文件，memcpy堆棧每秒大大減少了系統調用。這說明了批處理的效率：memcpy已飽和CPU，因此不再頻繁輪詢RX隊列。隨著批量增加，降低系統調用幫助服務器承受風暴。預拷貝變體在這里繼續頻繁地輪詢，但是在過載下表現相同的方式。最后，額外的memcpy的成本是可衡量的，但仍然表現相當不錯。

Results on contemporary hardware are significantly different from those run on older pre-DDIO hardware. Figure 7 shows the results obtained on our 2006-era servers. On the older machines,Sandstorm outperforms nginx by a factor of three, but the memcpy variant suffers a 30% decrease in throughput compared to pre-copy Sandstorm as a result of adding a single memcpy to the code. It is clear that on these older systems,memory bandwidth is the main performance bottleneck.

當代硬件上的結果與舊的前DDIO硬件上的結果有顯著的不同。圖7顯示了我們2006年服務器上的結果。在舊機器上，Sandstorm的性能超過nginx的三分之一，但是與memcpy版相比，memcpy變體的吞吐量降低了30％，這是因為在代碼中添加了一個memcpy。很明顯，在這些舊系統上，內存帶寬是主要的性能瓶頸。

With DDIO, memory bandwidth is not such a limiting factor. Figure 9 in Section 3.5 shows the corresponding memory read throughput,as measured using CPU performance counters, for the networkthroughput graphs in Figure 6b. With small file sizes, the pre-copy variant of Sandstorm appears to do more work: the L3 cache cannot hold all of the data, so there are many more L3 misses than with memcpy. Memory-read throughput for both pre-copy and nginx are closely correlated with their network throughput, indicating that DDIO is not helping on transmit: DMA comes from memory rather than the cache. The memcpy variant, however, has higher network throughput than memory throughput, indicating that DDIO is transmitting from the cache. Unfortunately, this is offset by much higher memory write throughput. Still, this only causes a small reduction in service throughput. Larger files no longer fit in the L3 cache, even with memcpy. Memory-read throughput starts to rise with files above 64KB. Despite this, performance remains high and CPU load decreases, indicating these systems are not limited by memory bandwidth for this workload.

使用DDIO，內存帶寬不是這樣的限制因素。第3.5節中的圖9顯示了使用CPU性能計數器測量的圖6b中網絡吞吐量圖的相應內存讀取吞吐量。對于小文件大小，Sandstorm的預拷貝版似乎做了更多的工作：L3緩存不能保存所有數據，因此比memcpy 版的有更多的L3缺失。預拷貝和nginx的內存讀取吞吐量與它們的網絡吞吐量密切相關，表明DDIO對傳輸沒幫助：DMA來自內存，而不是緩存。 Memcpy版本網絡吞吐量反而比內存吞吐量更高，表明DDIO正在從緩存傳輸。不幸的是，這被更高的內存寫入吞吐量所抵消。但是，這只會導致服務吞吐量的小幅下降。較大的文件不再適合L3緩存，即使使用memcpy。隨著高于64KB的文件，內存讀取吞吐量開始上升。盡管如此，性能仍然很高，CPU負載降低，表明這些系統不受此工作負載的內存帶寬的限制。

3.3 Experiment Design: Namestorm

We use the same clients and server systems to evaluate Namestorm as we used for Sandstorm. Namestorm is expected to be significantly more CPU-intensive than Sandstorm, mostly due to fundamental DNS protocol properties: high packet rate and small packets. Based on this observation, we have changed the network topology of our experiment: we use only one NIC on the server connected to the client systems via a 10GbE cut-through switch. In order to balance the load on the server to all available CPU cores we use four dedicated NIC queues and four Namestorm instances.

We ran Nominum’s dnsperf [2] DNS profiling software on the clients. We created zone files of varying sizes, loaded them onto the DNS servers, and configured dnsperf to query the zone repeatedly.

我們使用相同的客戶端和服務器系統來評估Namestorm像用于Sandstorm那樣。 Namestorm預期比Sandstorm明顯更多的CPU密集，主要是由于基本的DNS協議屬性：高數據包速率和小數據包。基于這個觀察，我們改變了我們實驗的網絡拓撲：我們通過一個10GbE直通交換機在服務器上只使用一個網卡連接到客戶端系統。為了平衡服務器上的負載到所有可用的CPU核心，我們使用四個專用的NIC隊列和四個Namestorm實例。

我們在客戶端運行Nominum的dnsperf [2] DNS配置軟件。我們創建了不同大小的區域文件，將它們加載到DNS服務器上，并配置dnsperf重復查詢區域。

3.4 Namestorm Results

Figure 8a shows the performance of Namestorm and NSD running on Linux and FreeBSD when using a single 10GbE NIC. Performance results of NSD are similar with both FreeBSD and Linux.Neither operating system can saturate the 10GbE NIC, however, and both show some performance drop as the zone file grows. On Linux,NSD’s performance drops by ~14% (from ~689,000 to ~590,000 Queries/sec) as the zone file grows from 1 to 10,000 entries, and on FreeBSD, it drops by ~20% (from ~720,000 to ~574,000 Qps). For these benchmarks, NSD saturates all CPU cores on both systems.

圖8a顯示了使用單個10GbE NIC時在Linux和FreeBSD上運行的Namestorm和NSD的性能。 NSD的性能結果與FreeBSD和Linux類似。但是，操作系統可以使10GbE NIC飽和，并且隨著區域文件增長，兩者都顯示出一些性能下降。在Linux上，隨著區域文件從1到10,000個條目的增長，NSD的性能下降了約14％（從約689,000到約5.9萬查詢/秒），在FreeBSD上，它下降了?20％（從?720000到574,000 Qps ）。對于這些基準測試NSD使兩個系統上的所有CPU內核飽和。

For Namestorm, we utilized two datasets, one where the hash keys are in wire-format (w/o compr.), and one where they are in FQDN format (compr.). The latter requires copying the search term before hashing it to handle possible compressed requests.

對于Namestorm，我們使用了兩個數據集，一個是散列鍵是wire格式（w / o compr。），一個是FQDN格式（compr。）。后者處理可能的壓縮請求需要在對其進行哈希處理之前復制搜索項

With wire-format hashing, Namestorm memcpy performance is ~11–13 better, depending on the zone size, when compared to the best results from NSD with either Linux or FreeBSD. Namestorm’s throughput drops by ~30% as the zone file grows from 1 to 10,000 entries (from ~9,310,000 to ~6,410,000 Qps). The reason for this decrease is mainly the LLC miss rate, which more than doubles.Dnsperf does not report throughput in Gbps, but given the typical DNS response size for our zones we can calculate ~8.4Gbps and ~5.9Gbps for the smallest and largest zone respectively.

使用wire格式哈希，Namestorm memcpy性能是?11-13 更好，取決于區域大小與使用Linux或FreeBSD從NSD的最佳結果相比。當區域文件從1到10,000個條目（從?9,310,000到?6,410,000 Qps）增長時，Namestorm的吞吐量下降?30％。這種減少的原因主要是LLC未命中率，其超過雙倍。Dnsperf不報告以Gbps為單位的吞吐量，但給定我們區域的典型DNS響應大小，我們可以計算?8.4Gbps和?5.9Gbps最小和最大區域。

With FQDN-format hashing, Namestorm memcpy performance is worse than with wire-format hashing, but is still ~9–13 better compared to NSD. The extra processing with FQDN-format hashing costs ~10–20% in throughput, depending on the zone size.

Finally, in Figure 8a we observe a noticeable performance overhead with the pre-copy stack, which we explore in Section 3.5.

有了FQDN格式的哈希，Namestorm memcpy的性能比線格式哈希差，但仍是?9-13 比NSD好。使用FQDN格式哈希的額外處理的成本約為吞吐量的10-20％，具體取決于區域大小。

最后，在圖8a中，我們觀察到預拷貝堆棧的性能開銷明顯，我們在3.5節中探討。

3.4.1 Effectiveness of batching

One of the biggest performance benefits for Namestorm is that netmap provides an API that facilitates batching across the systemcall interface. To explore the effects of batching, we configured a single Namestorm instance and one hardware queue, and reran our benchmark with varying batch sizes. Figure 8b illustrates the results:

a more than 2 performance gain when growing the batch size from 1 packet (no batching) to 32 packets. Interestingly, the performance of a single-core Namestorm without any batching remains more than 2 better than NSD.

Batching的效果

Namestorm最大的性能優勢之一是netmap提供了一個API，可以在整個系統調用接口上實現批處理。為了探究批處理的效果，我們配置了一個單獨的Namestorm實例和一個硬件隊列，并用不同的批量大小重新調整我們的基準。圖8b示出了結果：

超過2倍的性能增長在將批處理大小從1個包（無批處理）增長到32個包。有趣的是，沒有任何batching的單核Namestorm的性能仍然超過2倍優于NSD。

At a minimum, NSD has to make one system call to receive each request and one to send a response. Recently Linux added the new recvmmsg() and sendmmsg() system calls to receive and send multiple UDP messages with a single call. These may go some way to improving NSD’s performance compared to Namestorm.They are, however, UDP-specific, and sendmmsg() requires the application to manage its own transmit-queue batching. When we implemented Namestorm, we already had libnmio, which abstracts and handles all the batching interactions with netmap, so there is no application-specific batching code in Namestorm.

至少，NSD必須進行一次系統調用以接收每個請求并且發送一次響應。最近Linux添加了新的recvmmsg（）和sendmmsg（）系統調用，通過一次調用接收和發送多個UDP消息。與Namestorm相比，這些可能在某種程度上提高NSD的性能。然而，它們是UDP特定的，sendmmsg（）要求應用程序管理自己的傳輸隊列批處理。當我們實現Namestorm時，我們已經有了libnmio，它使用netmap抽象和處理所有的批處理交互，因此在Namestorm中沒有應用程序特定的批處理代碼。

3.5 DDIO

With DDIO, incoming packets are DMAed directly to the CPU’s L3 cache, and outgoing packets are DMAed directly from the L3 cache, avoiding round trips from the CPU to the memory subsystem. For lightly loaded servers in which the working set is smaller than the L3 cache, or in which data is accessed with temporal locality by the processor and DMA engine (e.g., touched and immediately sent, or received and immediately accessed), DDIO can dramatically reduce latency by avoiding memory traffic. Thus DDIO is ideal for RPC-like mechanisms in which processing latency is low and data will be used immediately before or after DMA. On heavily loaded systems, it is far from clear whether DDIO will be a win or not. For applications with a larger cache footprint, or in which communication occurs at some delay from CPU generation or use of packet data, DDIO could unnecessarily pollute the cache and trigger additional memory traffic, damaging performance.

使用DDIO，傳入數據包直接DMA直接到CPU的L3緩存，而輸出數據包直接從L3緩存直接DMA，避免從CPU到內存子系統的往返。對于其中工作集小于L3高速緩存或其中數據由處理器和DMA引擎以時間局部性訪問（例如，被觸摸和立即發送，或接收和立即訪問）的輕負載服務器，DDIO可以顯著減少通過避免內存流量延遲。因此，DDIO是理想的類RPC機制，其中處理延遲低，數據將立即在DMA之前或之后使用。在負載重的系統上，很難說清楚DDIO是否會贏。對于具有較大高速緩存占用的應用程序，或者在CPU生成或使用數據包數據的某些延遲時發生通信時，DDIO可能會不必要地污染高速緩存并觸發額外的內存流量，從而損壞性能。

Intuitively, one might reasonably assume that Sandstorm’s precopy mode might interact best with DDIO: as with sendfile() based designs, only packet headers enter the L1/L2 caches, with payload content rarely touched by the CPU. Figure 9 illustrates a therefore surprising effect when operating on small file sizes: overall memory throughput from the CPU package, as measured using performance counters situated on the DRAM-facing interface of the LLC, sees significantly less traffic for the memcpy implementation relative to the pre-copy one, which shows a constant rate roughly equal to network throughput.

直觀地，人們可以合理地認為Sandstorm的預拷貝模式可能與DDIO最好地交互：與基于sendfile（）的設計一樣，只有包頭進入L1 / L2緩存，有效載荷內容很少被CPU觸發。圖9示出了當以小文件大小操作時的令人驚訝的效果：如使用位于LLC的面向DRAM的接口上的性能計數器所測量的，相對于預拷貝處理，memcpy版的內存吞吐量顯著減少，其顯示大致等于網絡吞吐量的恒定速率

We believe this occurs because DDIO is, by policy, limited from occupying most of the LLC: in the pre-copy cases, DDIO is responsible for pulling untouched data into the cache – as the file data cannot fit in this subset of the cache, DMA access thrashes the cache and all network transmit is done from DRAM. In the memcpy case, the CPU loads data into the cache, allowing more complete utilization of the cache for network data. However, as the DRAM memory interface is not a bottleneck in the system as configured, the net result of the additional memcpy, despite better cache utilization,is reduced performance. As file sizes increase, the overall footprint of memory copying rapidly exceeds the LLC size, exceeding network throughput, at which point pre-copy becomes more efficient.Likewise, one might mistakenly believe simply from inspection of CPU memory counters that nginx is somehow benefiting from this same effect: in fact, nginx is experiencing CPU saturation, and it is not until file size reaches 512K that sufficient CPU is available to converge with pre-copy’s saturation of the network link.

我們相信這是因為DDIO是，通過策略，限制占據大部分的LLC：在預拷貝的情況下，DDIO負責將未觸摸的數據拉入高速緩存 - 因為文件數據不能容納在高速緩存的這個子集中，DMA訪問使高速緩存thrashes所以所有網絡傳輸都由DRAM完成反而在memcpy情況下，CPU將數據加載到緩存中，從而允許更加完整地利用網絡數據的緩存，然而，由于DRAM存儲器接口不是如配置的系統中的瓶頸，盡管有更好的高速緩存利用率，附加memcpy的結果是降低的性能。隨著文件大小增加，存儲器復制的總體占用面積快速超過LLC大小，超過網絡吞吐量，此時預復制變得更有效。同樣，人們可能錯誤地認為，從CPU內存計數器的檢查nginx是以某種方式受益于這個相同的效果：事實上，nginx正在CPU飽和，并且直到文件大小達到512K，足夠的CPU可用的把預拷貝的網絡鏈路跑滿了。

By contrast, Namestorm sees improved performance using the memcpy implementation, as the cache lines holding packet data must be dirtied due to protocol requirements, in which case performing the memcpy has little CPU overhead yet allows much more efficient use of the cache by DDIO

相比之下，Namestorm使用memcpy實現看到改進的性能，因為持有分組數據的緩存線必須由于協議要求而受到污染，在這種情況下執行memcpy具有很少的CPU開銷，但允許更高效地通過DDIO使用緩存

(ps: 這個例子很神奇，雖然memcpy版每次都要內存拷貝但是更靈活，數據已經在cache中然后網絡吞吐量大致等于內存年吞吐量，而一般認為使用預處理好的數據更快因為不需要拷貝但是由于其不在cache中所以每次都要從dram走，文件越大這個情況越明顯)

4. DISCUSSION

We developed Sandstorm and Namestorm to explore the hypothesis that fundamental architectural change might be required to properly exploit rapidly growing CPU core counts and NIC capacity.Comparisons with Linux and FreeBSD appear to confirm this conclusion far more dramatically than we expected: while there are small-factor differences between Linux and FreeBSD performance curves, we observe that their shapes are fundamentally the same.We believe that this reflects near-identical underlying architectural decisions stemming from common intellectual ancestry (the BSD network stack and sockets API) and largely incremental changes from that original design.

我們開發了Sandstorm和Namestorm來探索這樣的假設，即可能需要進行基礎架構更改以正確利用快速增長的CPU核心數和NIC容量。與Linux和FreeBSD的比較似乎證實了這個結論比我們預期的更加顯著： Linux和FreeBSD性能曲線之間的因素差異，我們觀察到它們的形狀基本上是相同的。我們認為，這反映了源于共同知識祖先（BSD網絡棧和套接字API）和與原始設計有很大增量變化的幾乎相同的底層架構決策。

Sandstorm and Namestorm adopt fundamentally different architectural approaches, emphasizing transparent memory flow within applications (and not across expensive protection-domain boundaries), process-to-completion, heavy amortization, batching, and application-specific customizations that seem antithetical to generalpurpose stack design. The results are dramatic, accomplishing nearlinear speedup with increases in core and NIC capacity – completely different curves possible only with a completely different design.

Sandstorm 和Namestorm采用根本不同的架構方法，強調應用程序（而不是跨越昂貴的保護域邊界），過程到完成，重攤銷，批處理和應用程序特定的定制的透明內存流，這似乎與通用堆棧設計相對立。結果是驚人的，實現近線性加速隨著核心和NIC容量的增加 - 完全不同的曲線可能只有一個完全不同的設計。

4.1 Current network-stack specialization

Over the years there have been many attempts to add specialized features to general-purpose stacks such as FreeBSD and Linux. Examples include sendfile(), primarily for web servers,recvmmsg(), mostly aimed at DNS servers, and assorted socket options for telnet. In some cases, entire applications have been moved to the kernel [13, 24] because it was too difficult to achieve performance through the existing APIs. The problem with these enhancements is that each serves a narrow role, yet still must fit within a general OS architecture, and thus are constrained in what they can do. Special-purpose userspace stacks do not suffer from these constraints,and free the programmer to solve a narrow problem in an application-specific manner while still having the other advantages of a general-purpose OS stack.

多年來，已經有很多嘗試為通用堆棧（如FreeBSD和Linux）添加專門的功能。示例包括sendfile（）（主要用于Web服務器），recvmmsg（）（主要針對DNS服務器）和用于telnet的套接字選項。在某些情況下，整個應用程序已經移動到內核[13,24]，因為通過現有的API實現性能太難了。這些增強的問題在于每個服務器扮演著狹窄的角色，但仍然必須適合于一般的OS體系結構，因此被限制在他們可以做什么。專用用戶空間堆棧不受這些約束的困擾，并且釋放程序員以特定于應用的方式解決窄的問題，同時仍然具有通用OS堆棧的其他優點。

4.2 The generality of specialization

Our approach tightly integrates the network stack and application within a single process. This model, together with optimizations aimed at cache locality or pre-packetization, naturally fit a reasonably wide range of performance-critical, event-driven applications such as web servers, key-value stores, RPC-based services and name servers. Even rate-adaptive video streaming may benefit, as developments such as MPEG-DASH and Apple’s HLS have moved intelligence to the client leaving servers as dumb static-content farms.

我們的方法在單個進程中緊密的整合了網絡堆棧和應用邏輯。這種模型與針對高速緩存局部性或預分組化的優化一起，自然地適合于相當廣泛的性能關鍵的事件驅動應用，例如web服務器，鍵值存儲，基于RPC的服務和名稱服務器。即使速率自適應視頻流可以受益，因為諸如MPEG-DASH和蘋果的HLS的發展已經將智能移動到客戶端，將服務器留作靜態內容。

Not all network services are a natural fit. For example, CGI-based web services and general-purpose databases have inherently different properties and are generally CPU- or filesystem-intensive, deemphasizing networking bottlenecks. In our design, the control loop and transport-protocol correctness depend on the timely execution of application-layer functions; blocking in the application cannot be tolerated. A thread-based approach might be more suitable for such cases. Isolating the network stack and application into different threads still yields benefits: OS-bypass networking costs less, and saved CPU cycles are available for the application. However, such an approach requires synchronization, and so increases complexity and offers less room for cross-layer optimization.

并不是所有的網絡服務都是同類的。例如，基于CGI的Web服務和通用數據庫具有本質上不同的屬性，并且通常是CPU或文件系統密集型，削弱網絡瓶頸。在我們的設計中，控制回路和傳輸協議的正確性取決于應用層功能的及時執行; 在應用程序中的阻塞是不能容忍的。基于線程的方法可能更適合這種情況。將網絡堆棧和應用程序分離到不同的線程仍然產生以下好處：OS旁路網絡成本更低，并且節省的CPU周期可用于應用程序。然而，這種方法需要同步，因此增加了復雜性并且為跨層優化提供了較少的空間。

We are neither arguing for the exclusive use of specialized stacks over generalized ones, nor deployment of general-purpose network stacks in userspace. Instead, we propose selectively identifying key scale-out applications where informed but aggressive exploitation of domain-specific knowledge and micro-architectural properties will allow cross-layer optimizations. In such cases, the benefits outweigh the costs of developing and maintaining a specialized stack.

我們既不爭論專門的堆棧對廣義的堆棧的獨占使用，也不是在用戶空間中部署通用網絡堆棧。相反，我們建議選擇性地識別關鍵的橫向擴展應用程序并利用領域特定的知識和微架構屬性將允許跨層優化。在這種情況下，收益超過開發和維護專門堆棧的成本。

4.3 Tracing, profiling, and measurement

One of our greatest challenges in this work was the root-cause analysis of performance issues in contemporary hardware-software implementations. The amount of time spent analyzing networkstack behavior (often unsuccessfully) dwarfed the amount of time required to implement Sandstorm and Namestorm.

在這項工作中我們最大的挑戰之一是根本原因分析當代硬件 - 軟件實現中的性能問題。分析網絡堆棧行為所花費的時間（通常不成功）使實現Sandstorm 和Namestorm所需的時間變得相形見絀。

An enormous variety of tools exist – OS-specific PMC tools, lock contention measurement tools, tcpdump, Intel vTune, DTrace, and a plethora of application-specific tracing features – but they suffer many significant limitations. Perhaps most problematic is that the tools are not holistic: each captures only a fragment of the analysis space – different configuration models, file formats, and feature sets.

存在各種各樣的工具 - 特定于操作系統的PMC工具，鎖定爭用測量工具，tcpdump，Intel vTune，DTrace和大量的應用程序特定跟蹤功能，但是它們受到許多顯著的限制。也許最有問題的是工具不是整體的：每個只捕獲分析空間的一個片段 - 不同的配置模型，文件格式和特征集

Worse, as we attempted inter-OS analysis (e.g., comparing Linux and FreeBSD lock profiling), we discovered that tools often measure and report results differently, preventing sensible comparison.For example, we found that Linux took packet timestamps at different points than FreeBSD, FreeBSD uses different clocks for DTrace and BPF, and that while FreeBSD exports both per-process and percore PMC stats, Linux supports only the former. Where supported,DTrace attempts to bridge these gaps by unifying configuration,trace formats, and event namespaces [15]. However, DTrace also experiences high overhead causing bespoke tools to persist, and is unintegrated with packet-level tools preventing side-by-side comparison of packet and execution traces.We feel certain that improvement in the state-of-the-art would benefit not only research, but also the practice of network-stack implementation.

更糟糕的是，當我們嘗試跨操作系統分析（例如，比較Linux和FreeBSD鎖定分析）時，我們發現工具經常以不同的方式測量和報告結果，從而阻止了明智的比較。例如，我們發現Linux在不同于FreeBSD ，FreeBSD對DTrace和BPF使用不同的時鐘，當FreeBSD導出per-process和percore PMC stats時，Linux僅支持前者。在支持的情況下，DTrace通過統一配置，跟蹤格式和事件命名空間來嘗試彌合這些差距[15]。然而，DTrace還經歷了高開銷，導致定制工具持續存在，并且與分組級工具不集成，阻止了分組和執行跟蹤的并行比較。我們確信，現有技術的改進受益的不僅僅是研究，還有網絡棧實現的實踐。

Our special-purpose stacks are synchronous; after netmap hands off packets to userspace, the control flow is generally linear, and we process packets to completion. This, combined with lock-free design, means that it is very simple to reason about where time goes when handling a request flow. General-purpose stacks cannot, by their nature, be synchronous. They must be asynchronous to balance all the conflicting demands of hardware and applications, managing queues without application knowledge, allocating processing to threads in order to handle those queues, and ensuring safety via locking. To reason about performance in such systems, we often resort to statistical sampling because it is not possible to directly follow the control flow. Of course, not all network applications are well suited to synchronous models; we argue, however, that imposing the asynchrony of a general-purpose stack on all applications can unnecessarily complicate debugging, performance analysis, and performance optimization.

我們的專用堆棧是同步的;在netmap將包交給用戶空間后，控制流通常是線性的，我們處理包完成。這與無鎖設計相結合，意味著在處理請求流時處理時間是非常簡單的。通用堆棧根據其性質不能是同步的。它們必須異步以平衡硬件和應用程序的所有沖突需求，在沒有應用程序知識的情況下管理隊列，為線程分配處理以處理這些隊列，以及通過鎖定確保安全性。為了說明在這樣的系統中的性能，我們經常采用統計采樣，因為不可能直接跟隨控制流。當然，并不是所有的網絡應用都非常適合同步模型;但我們認為，在所有應用程序上施加通用堆棧的不同步可能不必要地使調試，性能分析和性能優化復雜化。

5. RELATED WORK

Web server and network-stack performance optimization is not a new research area. Past studies have come up with many optimization techniques as well as completely different design choices.These designs range from userspace and kernel-based implementations to specialized operating systems.

Web服務器和網絡堆棧性能優化不是一個新的研究領域。過去的研究已經提出了許多優化技術以及完全不同的設計選擇。這些設計從基于用戶空間和基于內核的實現到專用操作系統。

With the conventional approaches, userspace applications [1, 6] utilize general-purpose network stacks, relying heavily on operatingsystem primitives to achieve data movement and event notification [26]. Several proposals [23, 12, 30] focus on reducing the overhead of such primitives (e.g., KQueue, epoll, sendfile()).IO-Lite [27] unifies the data management between OS subsystems and userspace applications by providing page-based mechanisms to safely and concurrently share data. Fbufs [17] utilize techniques such as page remapping and shared memory to provide high-performance cross-domain transfers and buffer management.Pesterev and Wickizer [28, 14] have proposed efficient techniques to improve commodity-stack performance by controlling connection locality and taking advantage of modern multicore systems.Similarly, MegaPipe [21] shows significant performance gain by introducing a bidirectional, per-core pipe to facilitate data exchange and event notification between kernel and userspace applications.

使用傳統方法，用戶空間應用[1,6]利用通用網絡堆棧，嚴重依賴操作系統原語來實現數據移動和事件通知[26]。幾個提案[23,12,30]集中在減少這樣的原語的開銷（例如，KQueue，epoll，sendfile（））。IO-Lite [27]通過提供基于頁面的協議來統一操作系統子系統和用戶空間應用程序之間的數據管理，安全和并發共享數據的機制。 Fbufs [17]利用諸如頁面重映射和共享內存等技術來提供高性能的跨域傳輸和緩沖管理.Pesterev和Wickizer [28,14]提出了通過控制連接局部性和利用現代多核系統來提高商品堆棧性能的高效技術，MegaPipe [21]通過引入雙向，每核心管道以促進內核和用戶空間應用程序之間的數據交換和事件通知，顯示出顯著的性能增益。

A significant number of research proposals follow a substantially different approach: they propose partial or full implementation of network applications in kernel, aiming to eliminate the cost of communication between kernel and userspace. Although this design decision improves performance significantly, it comes at the cost of limited security and reliability. A representative example of this category is kHTTPd [13], a kernel-based web server which uses the socket interface. Similar to kHTTPd, TUX [24] is another noteworthy example of in-kernel network applications. TUX achieves greater performance by eliminating the socket layer and pinning the static content it serves in memory. We have adopted several of these ideas in our prototype, although our approach is not kernel based.

大量的研究建議遵循一種截然不同的方法：它們提出在內核中部分或全部實現網絡應用，旨在消除內核和用戶空間之間的通信成本。雖然這種設計決策顯著提高了性能，但其代價是有限的安全性和可靠性。這個類別的代表性示例是kHTTPd [13]，一種基于內核的Web服務器，它使用套接字接口。與kHTTPd類似，TUX [24]是內核網絡應用的另一個值得注意的例子。 TUX通過消除套接字層并鎖定其在存儲器中提供的靜態內容來實現更高的性能。我們在我們的原型中采用了幾個這樣的想法，雖然我們的方法不是基于內核的。

Microkernel designs such as Mach [10] have long appealed to OS designers, pushing core services (such as network stacks) into user processes so that they can be more easily developed, customized, and multiply-instantiated. In this direction, Thekkath et al [32], have prototyped capability-enabled, library-synthesized userspace network stacks implemented on Mach. The Cheetah web server is built on top of an Exokernel [19] library operating system that provides a filesystem and an optimized TCP/IP implementation. Lightweight libOSes enable application developers to exploit domain-specific knowledge and improve performance. Unikernel designs such as MirageOS [25] likewise blend operating-system and application components at compile-time, trimming unneeded software elements to accomplish extremely small memory footprints – although by static code analysis rather than application-specific specialization.

諸如Mach [10]的微內核設計長期以來都呼吁OS設計者，將核心服務（例如網絡堆棧）推入用戶進程，以便能夠更容易地開發，定制和多次實例化。在這個方向上，Thekkath等人[32]，在Mach上實現了基于功能的，庫合成的用戶空間網絡棧。 Cheetah Web服務器構建在提供文件系統和優化的TCP / IP實現的Exokernel [19]庫操作系統之上。輕量級的libOS使應用程序開發人員能夠利用領域特定的知識并提高性能。 Unikernel設計，如MirageOS [25]在編譯時同樣混合操作系統和應用程序組件，修剪不需要的軟件元素以完成極小的內存占用 - 盡管通過靜態代碼分析而不是專用于專用化。

6. CONCLUSION

In this paper, we have demonstrated that specialized userspace stacks, built on top of netmap framework, can vastly improve the performance of scale-out applications. These performance gains sacrifice generality by adopting design principles at odds with contemporary stack design: application-specific cross-layer cost amortizations, synchronous and buffering-free protocol implementations, and an extreme focus on interactions between processors, caches, and NICs. This approach reflects a widespread adoption of scale-out computing in data centers, which deemphasizes multifunction hosts in favor of increased large-scale specialization. Our performance results are compelling: a 2–10 improvement for web service, and a roughly 9 improvement for DNS service. Further, these stacks have proven easier to develop and tune than conventional stacks, and their performance improvements are portable over multiple generations

of hardware.

在本文中，我們已經證明，專門的用戶空間堆棧，建立在netmap框架之上，可以大大提高橫向擴展應用程序的性能。這些性能增益通過采用與當代堆棧設計不同的設計原理來犧牲通用性：特定于應用程序的跨層成本分攤，同步和無緩沖協議實現，以及極其側重于處理器，緩存和NIC之間的交互。這種方法反映了在數據中心中橫向擴展計算的廣泛采用，這削弱了多功能主機，有利于增加大規模專業化。我們的性能結果令人信服：2-10倍的性能改進web服務，和大約9倍的提高DNS服務。此外，這些堆棧已經被證明比常規堆棧更容易開發和調整，并且它們的性能改進在多代硬件上是可移植的。

General-purpose operating system stacks have been around a long time, and have demonstrated the ability to transcend multiple generations of hardware. We believe the same should be true of special-purpose stacks, but that tuning for particular hardware should be easier. We examined performance on servers manufactured seven years apart, and demonstrated that although the performance bottlenecks were now in different places, the same design delivered significant benefits on both platforms.

通用操作系統堆棧已經有很長時間了，并且已經證明了超越多代硬件的能力。我們認為專用堆棧也應該是這樣，但是對于特定硬件的調整應該更容易。我們研究了相隔七年的服務器上的性能，并證明盡管性能瓶頸現在在不同的地方，但是相同的設計在這兩個平臺上帶來了顯著的優勢。

posted on 2017-01-22 18:01 clcl 閱讀(604) 評論(0) 編輯收藏引用

只有注冊用戶登錄后才能發表評論。
【推薦】100%開源！大型工業跨平臺軟件C++源碼提供，建模，組態！



網站導航: 博客園 IT新聞 BlogJava 博問 Chat2DB 管理

xiaoxiaoling

常用鏈接

留言簿(1)

隨筆檔案

文章檔案

搜索

最新評論

閱讀排行榜

評論排行榜