Introduction
The purpose of this document is to share some ideas that I've developed over the years about how to develop a certain kind of application for which the term "server" is only a weak approximation. More accurately, I'll be writing about a broad class of programs that are designed to handle very large numbers of discrete messages or requests per second. Network servers most commonly fit this definition, but not all programs that do are really servers in any sense of the word. For the sake of simplicity, though, and because "High-Performance Request-Handling Programs" is a really lousy title, we'll just say "server" and be done with it.
I will not be writing about "mildly parallel" applications, even though multitasking within a single program is now commonplace. The browser you're using to read this probably does some things in parallel, but such low levels of parallelism really don't introduce many interesting challenges. The interesting challenges occur when the request-handling infrastructure itself is the limiting factor on overall performance, so that improving the infrastructure actually improves performance. That's not often the case for a browser running on a gigahertz processor with a gigabyte of memory doing six simultaneous downloads over a DSL line. The focus here is not on applications that sip through a straw but on those that drink from a firehose, on the very edge of hardware capabilities where how you do it really does matter.
Some people will inevitably take issue with some of my comments and suggestions, or think they have an even better way. Fine. I'm not trying to be the Voice of God here; these are just methods that I've found to work for me, not only in terms of their effects on performance but also in terms of their effects on the difficulty of debugging or extending code later. Your mileage may vary. If something else works better for you that's great, but be warned that almost everything I suggest here exists as an alternative to something else that I tried once only to be disgusted or horrified by the results. Your pet idea might very well feature prominently in one of these stories, and innocent readers might be bored to death if you encourage me to start telling them. You wouldn't want to hurt them, would you?
The rest of this article is going to be centered around what I'll call the Four Horsemen of Poor Performance:
- Data copies
- Context switches
- Memory allocation
- Lock contention
There will also be a catch-all section at the end, but these are the biggest performance-killers. If you can handle most requests without copying data, without a context switch, without going through the memory allocator and without contending for locks, you'll have a server that performs well even if it gets some of the minor parts wrong.
Data Copies
This could be a very short section, for one very simple reason: most people have learned this lesson already. Everybody knows data copies are bad; it's obvious, right? Well, actually, it probably only seems obvious because you learned it very early in your computing career, and that only happened because somebody started putting out the word decades ago. I know that's true for me, but I digress. Nowadays it's covered in every school curriculum and in every informal how-to. Even the marketing types have figured out that "zero copy" is a good buzzword.
Despite the after-the-fact obviousness of copies being bad, though, there still seem to be nuances that people miss. The most important of these is that data copies are often hidden and disguised. Do you really know whether any code you call in drivers or libraries does data copies? It's probably more than you think. Guess what "Programmed I/O" on a PC refers to. An example of a copy that's disguised rather than hidden is a hash function, which has all the memory-access cost of a copy and also involves more computation. Once it's pointed out that hashing is effectively "copying plus" it seems obvious that it should be avoided, but I know at least one group of brilliant people who had to figure it out the hard way. If you really want to get rid of data copies, either because they really are hurting performance or because you want to put "zero-copy operation" on your hacker-conference slides, you'll need to track down a lot of things that really are data copies but don't advertise themselves as such.
The tried and true method for avoiding data copies is to use indirection, and pass buffer descriptors (or chains of buffer descriptors) around instead of mere buffer pointers. Each descriptor typically consists of the following:
- A pointer and length for the whole buffer.
- A pointer and length, or offset and length, for the part of the buffer that's actually filled.
- Forward and back pointers to other buffer descriptors in a list.
- A reference count.
Now, instead of copying a piece of data to make sure it stays in memory, code can simply increment a reference count on the appropriate buffer descriptor. This can work extremely well under some conditions, including the way that a typical network protocol stack operates, but it can also become a really big headache. Generally speaking, it's easy to add buffers at the beginning or end of a chain, to add references to whole buffers, and to deallocate a whole chain at once. Adding in the middle, deallocating piece by piece, or referring to partial buffers will each make life increasingly difficult. Trying to split or combine buffers will simply drive you insane.
I don't actually recommend using this approach for everything, though. Why not? Because it gets to be a huge pain when you have to walk through descriptor chains every time you want to look at a header field. There really are worse things than data copies. I find that the best thing to do is to identify the large objects in a program, such as data blocks, make sure those get allocated separately as described above so that they don't need to be copied, and not sweat too much about the other stuff.
This brings me to my last point about data copies: don't go overboard avoiding them. I've seen way too much code that avoids data copies by doing something even worse, like forcing a context switch or breaking up a large I/O request. Data copies are expensive, and when you're looking for places to avoid redundant operations they're one of the first things you should look at, but there is a point of diminishing returns. Combing through code and then making it twice as complicated just to get rid of that last few data copies is usually a waste of time that could be better spent in other ways.
Context Switches
Whereas everyone thinks it's obvious that data copies are bad, I'm often surprised by how many people totally ignore the effect of context switches on performance. In my experience, context switches are actually behind more total "meltdowns" at high load than data copies; the system starts spending more time going from one thread to another than it actually spends within any thread doing useful work. The amazing thing is that, at one level, it's totally obvious what causes excessive context switching. The #1 cause of context switches is having more active threads than you have processors. As the ratio of active threads to processors increases, the number of context switches also increases - linearly if you're lucky, but often exponentially. This very simple fact explains why multi-threaded designs that have one thread per connection scale very poorly. The only realistic alternative for a scalable system is to limit the number of active threads so it's (usually) less than or equal to the number of processors. One popular variant of this approach is to use only one thread, ever; while such an approach does avoid context thrashing, and avoids the need for locking as well, it is also incapable of achieving more than one processor's worth of total throughput and thus remains beneath contempt unless the program will be non-CPU-bound (usually network-I/O-bound) anyway.
The first thing that a "thread-frugal" program has to do is figure out how it's going to make one thread handle multiple connections at once. This usually implies a front end that uses select/poll, asynchronous I/O, signals or completion ports, with an event-driven structure behind that. Many "religious wars" have been fought, and continue to be fought, over which of the various front-end APIs is best. Dan Kegel's C10K paper is a good resource is this area. Personally, I think all flavors of select/poll and signals are ugly hacks, and therefore favor either AIO or completion ports, but it actually doesn't matter that much. They all - except maybe select() - work reasonably well, and don't really do much to address the matter of what happens past the very outermost layer of your program's front end.
The simplest conceptual model of a multi-threaded event-driven server has a queue at its center; requests are read by one or more "listener" threads and put on queues, from which one or more "worker" threads will remove and process them. Conceptually, this is a good model, but all too often people actually implement their code this way. Why is this wrong? Because the #2 cause of context switches is transferring work from one thread to another. Some people even compound the error by requiring that the response to a request be sent by the original thread - guaranteeing not one but two context switches per request. It's very important to use a "symmetric" approach in which a given thread can go from being a listener to a worker to a listener again without ever changing context. Whether this involves partitioning connections between threads or having all threads take turns being listener for the entire set of connections seems to matter a lot less.
Usually, it's not possible to know how many threads will be active even one instant into the future. After all, requests can come in on any connection at any moment, or "background" threads dedicated to various maintenance tasks could pick that moment to wake up. If you don't know how many threads are active, how can you limit how many are active? In my experience, one of the most effective approaches is also one of the simplest: use an old-fashioned counting semaphore which each thread must hold whenever it's doing "real work". If the thread limit has already been reached then each listen-mode thread might incur one extra context switch as it wakes up and then blocks on the semaphore, but once all listen-mode threads have blocked in this way they won't continue contending for resources until one of the existing threads "retires" so the system effect is negligible. More importantly, this method handles maintenance threads - which sleep most of the time and therefore dont' count against the active thread count - more gracefully than most alternatives.
Once the processing of requests has been broken up into two stages (listener and worker) with multiple threads to service the stages, it's natural to break up the processing even further into more than two stages. In its simplest form, processing a request thus becomes a matter of invoking stages successively in one direction, and then in the other (for replies). However, things can get more complicated; a stage might represent a "fork" between two processing paths which involve different stages, or it might generate a reply (e.g. a cached value) itself without invoking further stages. Therefore, each stage needs to be able to specify "what should happen next" for a request. There are three possibilities, represented by return values from the stage's dispatch function:
- The request needs to be passed on to another stage (an ID or pointer in the return value).
- The request has been completed (a special "request done" return value)
- The request was blocked (a special "request blocked" return value). This is equivalent to the previous case, except that the request is not freed and will be continued later from another thread.
Note that, in this model, queuing of requests is done within stages, not between stages. This avoids the common silliness of constantly putting a request on a successor stage's queue, then immediately invoking that successor stage and dequeuing the request again; I call that lots of queue activity - and locking - for nothing.
If this idea of separating a complex task into multiple smaller communicating parts seems familiar, that's because it's actually very old. My approach has its roots in the Communicating Sequential Processes concept elucidated by C.A.R. Hoare in 1978, based in turn on ideas from Per Brinch Hansen and Matthew Conway going back to 1963 - before I was born! However, when Hoare coined the term CSP he meant "process" in the abstract mathematical sense, and a CSP process need bear no relation to the operating-system entities of the same name. In my opinion, the common approach of implementing CSP via thread-like coroutines within a single OS thread gives the user all of the headaches of concurrency with none of the scalability.
A contemporary example of the staged-execution idea evolved in a saner direction is Matt Welsh's SEDA. In fact, SEDA is such a good example of "server architecture done right" that it's worth commenting on some of its specific characteristics (especially where those differ from what I've outlined above).
- SEDA's "batching" tends to emphasize processing multiple requests through a stage at once, while my approach tends to emphasize processing a single request through multiple stages at once.
- SEDA's one significant flaw, in my opinion, is that it allocates a separate thread pool to each stage with only "background" reallocation of threads between stages in response to load. As a result, the #1 and #2 causes of context switches noted above are still very much present.
- In the context of an academic research project, implementing SEDA in Java might make sense. In the real world, though, I think the choice can be characterized as unfortunate.
Memory Allocation
Allocating and freeing memory is one of the most common operations in many applications. Accordingly, many clever tricks have been developed to make general-purpose memory allocators more efficient. However, no amount of cleverness can make up for the fact that the very generality of such allocators inevitably makes them far less efficient than the alternatives in many cases. I therefore have three suggestions for how to avoid the system memory allocator altogether.
Suggestion #1 is simple preallocation. We all know that static allocation is bad when it imposes artificial limits on program functionality, but there are many other forms of preallocation that can be quite beneficial. Usually the reason comes down to the fact that one trip through the system memory allocator is better than several, even when some memory is "wasted" in the process. Thus, if it's possible to assert that no more than N items could ever be in use at once, preallocation at program startup might be a valid choice. Even when that's not the case, preallocating everything that a request handler might need right at the beginning might be preferable to allocating each piece as it's needed; aside from the possibility of allocating multiple items contiguously in one trip through the system allocator, this often greatly simplifies error-recovery code. If memory is very tight then preallocation might not be an option, but in all but the most extreme circumstances it generally turns out to be a net win.
Suggestion #2 is to use lookaside lists for objects that are allocated and freed frequently. The basic idea is to put recently-freed objects onto a list instead of actually freeing them, in the hope that if they're needed again soon they need merely be taken off the list instead of being allocated from system memory. As an additional benefit, transitions to/from a lookaside list can often be implemented to skip complex object initialization/finalization.
It's generally undesirable to have lookaside lists grow without bound, never actually freeing anything even when your program is idle. Therefore, it's usually necessary to have some sort of periodic "sweeper" task to free inactive objects, but it would also be undesirable if the sweeper introduced undue locking complexity or contention. A good compromise is therefore a system in which a lookaside list actually consists of separately locked "old" and "new" lists. Allocation is done preferentially from the new list, then from the old list, and from the system only as a last resort; objects are always freed onto the new list. The sweeper thread operates as follows:
- Lock both lists.
- Save the head for the old list.
- Make the (previously) new list into the old list by assigning list heads.
- Unlock.
- Free everything on the saved old list at leisure.
Objects in this sort of system are only actually freed when they have not been needed for at least one full sweeper interval, but always less than two. Most importantly, the sweeper does most of its work without holding any locks to contend with regular threads. In theory, the same approach can be generalized to more than two stages, but I have yet to find that useful.
One concern with using lookaside lists is that the list pointers might increase object size. In my experience, most of the objects that I'd use lookaside lists for already contain list pointers anyway, so it's kind of a moot point. Even if the pointers were only needed for the lookaside lists, though, the savings in terms of avoided trips through the system memory allocator (and object initialization) would more than make up for the extra memory.
Suggestion #3 actually has to do with locking, which we haven't discussed yet, but I'll toss it in anyway. Lock contention is often the biggest cost in allocating memory, even when lookaside lists are in use. One solution is to maintain multiple private lookaside lists, such that there's absolutely no possibility of contention for any one list. For example, you could have a separate lookaside list for each thread. One list per processor can be even better, due to cache-warmth considerations, but only works if threads cannot be preempted. The private lookaside lists can even be combined with a shared list if necessary, to create a system with extremely low allocation overhead.
Lock Contention
Efficient locking schemes are notoriously hard to design, because of what I call Scylla and Charybdis after the monsters in the Odyssey. Scylla is locking that's too simplistic and/or coarse-grained, serializing activities that can or should proceed in parallel and thus sacrificing performance and scalability; Charybdis is overly complex or fine-grained locking, with space for locks and time for lock operations again sapping performance. Near Scylla are shoals representing deadlock and livelock conditions; near Charybdis are shoals representing race conditions. In between, there's a narrow channel that represents locking which is both efficient and correct...or is there? Since locking tends to be deeply tied to program logic, it's often impossible to design a good locking scheme without fundamentally changing how the program works. This is why people hate locking, and try to rationalize their use of non-scalable single-threaded approaches.
Almost every locking scheme starts off as "one big lock around everything" and a vague hope that performance won't suck. When that hope is dashed, and it almost always is, the big lock is broken up into smaller ones and the prayer is repeated, and then the whole process is repeated, presumably until performance is adequate. Often, though, each iteration increases complexity and locking overhead by 20-50% in return for a 5-10% decrease in lock contention. With luck, the net result is still a modest increase in performance, but actual decreases are not uncommon. The designer is left scratching his head (I use "his" because I'm a guy myself; get over it). "I made the locks finer grained like all the textbooks said I should," he thinks, "so why did performance get worse?"
In my opinion, things got worse because the aforementioned approach is fundamentally misguided. Imagine the "solution space" as a mountain range, with high points representing good solutions and low points representing bad ones. The problem is that the "one big lock" starting point is almost always separated from the higher peaks by all manner of valleys, saddles, lesser peaks and dead ends. It's a classic hill-climbing problem; trying to get from such a starting point to the higher peaks only by taking small steps and never going downhill almost never works. What's needed is a fundamentally different way of approaching the peaks.
The first thing you have to do is form a mental map of your program's locking. This map has two axes:
- The vertical axis represents code. If you're using a staged architecture with non-branching stages, you probably already have a diagram showing these divisions, like the ones everybody uses for OSI-model network protocol stacks.
- The horizontal axis represents data. In every stage, each request should be assigned to a data set with its own resources separate from any other set.
You now have a grid, where each cell represents a particular data set in a particular processing stage. What's most important is the following rule: two requests should not be in contention unless they are in the same data set and the same processing stage. If you can manage that, you've already won half the battle.
Once you've defined the grid, every type of locking your program does can be plotted, and your next goal is to ensure that the resulting dots are as evenly distributed along both axes as possible. Unfortunately, this part is very application-specific. You have to think like a diamond-cutter, using your knowledge of what the program does to find the natural "cleavage lines" between stages and data sets. Sometimes they're obvious to start with. Sometimes they're harder to find, but seem more obvious in retrospect. Dividing code into stages is a complicated matter of program design, so there's not much I can offer there, but here are some suggestions for how to define data sets:
- If you have some sort of a block number or hash or transaction ID associated with requests, you can rarely do better than to divide that value by the number of data sets.
- Sometimes, it's better to assign requests to data sets dynamically, based on which data set has the most resources available rather than some intrinsic property of the request. Think of it like multiple integer units in a modern CPU; those guys know a thing or two about making discrete requests flow through a system.
- It's often helpful to make sure that the data-set assignment is different for each stage, so that requests which would contend at one stage are guaranteed not to do so at another stage.
If you've divided your "locking space" both vertically and horizontally, and made sure that lock activity is spread evenly across the resulting cells, you can be pretty sure that your locking is in pretty good shape. There's one more step, though. Do you remember the "small steps" approach I derided a few paragraphs ago? It still has its place, because now you're at a good starting point instead of a terrible one. In metaphorical terms you're probably well up the slope on one of the mountain range's highest peaks, but you're probably not at the top of one. Now is the time to collect contention statistics and see what you need to do to improve, splitting stages and data sets in different ways and then collecting more statistics until you're satisfied. If you do all that, you're sure to have a fine view from the mountaintop.
Other Stuff
As promised, I've covered the four biggest performance problems in server design. There are still some important issues that any particular server will need to address, though. Mostly, these come down to knowing your platform/environment:
- How does your storage subsystem perform with larger vs. smaller requests? With sequential vs. random? How well do read-ahead and write-behind work?
- How efficient is the network protocol you're using? Are there parameters or flags you can set to make it perform better? Are there facilities like TCP_CORK, MSG_PUSH, or the Nagle-toggling trick that you can use to avoid tiny messages?
- Does your system support scatter/gather I/O (e.g. readv/writev)? Using these can improve performance and also take much of the pain out of using buffer chains.
- What's your page size? What's your cache-line size? Is it worth it to align stuff on these boundaries? How expensive are system calls or context switches, relative to other things?
- Are your reader/writer lock primitives subject to starvation? Of whom? Do your events have "thundering herd" problems? Does your sleep/wakeup have the nasty (but very common) behavior that when X wakes Y a context switch to Y happens immediately even if X still has things to do?
I'm sure I could think of many more questions in this vein. I'm sure you could too. In any particular situation it might not be worthwhile to do anything about any one of these issues, but it's usually worth at least thinking about them. If you don't know the answers - many of which you will not find in the system documentation - find out. Write a test program or micro-benchmark to find the answers empirically; writing such code is a useful skill in and of itself anyway. If you're writing code to run on multiple platforms, many of these questions correlate with points where you should probably be abstracting functionality into per-platform libraries so you can realize a performance gain on that one platform that supports a particular feature.
The "know the answers" theory applies to your own code, too. Figure out what the important high-level operations in your code are, and time them under different conditions. This is not quite the same as traditional profiling; it's about measuring design elements, not actual implementations. Low-level optimization is generally the last resort of someone who screwed up the design.
高性能服務(wù)器架構(gòu)
來(lái)源:http://pl.atyp.us/content/tech/servers.html
譯文來(lái)源:http://www.lupaworld.com/home/space-341888-do-blog-id-136718.html
(map注:本人看了一遍,“于我心有戚戚焉”,翻譯得也很好,于是整理了一下,重新發(fā)布,備忘)
引言
本文將與你分享我多年來(lái)在服務(wù)器開(kāi)發(fā)方面的一些經(jīng)驗(yàn)。對(duì)于這里所說(shuō)的服務(wù)器,更精確的定義應(yīng)該是每秒處理大量離散消息或者請(qǐng)求的服務(wù)程序,網(wǎng)絡(luò)服務(wù)器更符合這種情況,但并非所有的網(wǎng)絡(luò)程序都是嚴(yán)格意義上的服務(wù)器。使用“高性能請(qǐng)求處理程序”是一個(gè)很糟糕的標(biāo)題,為了敘述起來(lái)簡(jiǎn)單,下面將簡(jiǎn)稱(chēng)為“服務(wù)器”。
本文不會(huì)涉及到多任務(wù)應(yīng)用程序,在單個(gè)程序里同時(shí)處理多個(gè)任務(wù)現(xiàn)在已經(jīng)很常見(jiàn)。比如你的瀏覽器可能就在做一些并行處理,但是這類(lèi)并行程序設(shè)計(jì)沒(méi)有多大挑戰(zhàn)性。真正的挑戰(zhàn)出現(xiàn)在服務(wù)器的架構(gòu)設(shè)計(jì)對(duì)性能產(chǎn)生制約時(shí),如何通過(guò)改善架構(gòu)來(lái)提升系統(tǒng)性能。對(duì)于在擁有上G內(nèi)存和G赫茲CPU上運(yùn)行的瀏覽器來(lái)說(shuō),通過(guò)DSL進(jìn)行多個(gè)并發(fā)下載任務(wù)不會(huì)有如此的挑戰(zhàn)性。這里,應(yīng)用的焦點(diǎn)不在于通過(guò)吸管小口吮吸,而是如何通過(guò)水龍頭大口暢飲,這里麻煩是如何解決在硬件性能的制約.(作者的意思應(yīng)該是怎么通過(guò)網(wǎng)絡(luò)硬件的改善來(lái)增大流量)
一些人可能會(huì)對(duì)我的某些觀點(diǎn)和建議發(fā)出置疑,或者自認(rèn)為有更好的方法, 這是無(wú)法避免的。在本文中我不想扮演上帝的角色;這里所談?wù)摰氖俏易约旱囊恍┙?jīng)驗(yàn),這些經(jīng)驗(yàn)對(duì)我來(lái)說(shuō), 不僅在提高服務(wù)器性能上有效,而且在降低調(diào)試?yán)щy度和增加系統(tǒng)的可擴(kuò)展性上也有作用。但是對(duì)某些人的系統(tǒng)可能會(huì)有所不同。如果有其它更適合于你的方法,那實(shí)在是很不錯(cuò). 但是值得注意的是,對(duì)本文中所提出的每一條建議的其它一些可替代方案,我經(jīng)過(guò)實(shí)驗(yàn)得出的結(jié)論都是悲觀的。你自己的小聰明在這些實(shí)驗(yàn)中或許有更好的表現(xiàn),但是如果因此慫恿我在這里建議讀者這么做,可能會(huì)引起無(wú)辜讀者的反感。你并不想惹怒讀者,對(duì)吧?
本文的其余部分將主要說(shuō)明影響服務(wù)器性能的四大殺手:
1) 數(shù)據(jù)拷貝(Data Copies)
2) 環(huán)境切換(Context Switches)
3) 內(nèi)存分配(Memory allocation)
4) 鎖競(jìng)爭(zhēng)(Lock contention)
在文章結(jié)尾部分還會(huì)提出其它一些比較重要的因素,但是上面的四點(diǎn)是主要因素。如果服務(wù)器在處理大部分請(qǐng)求時(shí)能夠做到?jīng)]有數(shù)據(jù)拷貝,沒(méi)有環(huán)境切換,沒(méi)有內(nèi)存分配,沒(méi)有鎖競(jìng)爭(zhēng),那么我敢保證你的服務(wù)器的性能一定很出色。
數(shù)據(jù)拷貝(Data Copies)
本節(jié)會(huì)有點(diǎn)短,因?yàn)榇蠖鄶?shù)人在數(shù)據(jù)拷貝上吸取過(guò)教訓(xùn)。幾乎每個(gè)人都知道產(chǎn)生數(shù)據(jù)拷貝是不對(duì)的,這點(diǎn)是顯而易見(jiàn)的,在你的職業(yè)生涯中, 你很早就會(huì)見(jiàn)識(shí)過(guò)它;而且遇到過(guò)這個(gè)問(wèn)題,因?yàn)?0年前就有人開(kāi)始說(shuō)這個(gè)詞。對(duì)我來(lái)說(shuō)確實(shí)如此。現(xiàn)今,幾乎每個(gè)大學(xué)課程和幾乎所有how-to文檔中都提到了它。甚至在某些商業(yè)宣傳冊(cè)中,"零拷貝" 都是個(gè)流行用語(yǔ)。
盡管數(shù)據(jù)拷貝的壞處顯而易見(jiàn),但是還是會(huì)有人忽視它。因?yàn)楫a(chǎn)生數(shù)據(jù)拷貝的代碼常常隱藏很深且?guī)в袀窝b,你知道你所調(diào)用的庫(kù)或驅(qū)動(dòng)的代碼會(huì)進(jìn)行數(shù)據(jù)拷貝嗎?答案往往超出想象。猜猜“程序I/O”在計(jì)算機(jī)上到底指什么?哈希函數(shù)是偽裝的數(shù)據(jù)拷貝的例子,它有帶拷貝的內(nèi)存訪問(wèn)消耗和更多的計(jì)算。曾經(jīng)指出哈希算法是一種有效的“拷貝+”似乎能夠被避免,但據(jù)我所知,有一些非常聰明的人說(shuō)過(guò)要做到這一點(diǎn)是相當(dāng)困難的。如果想真正去除數(shù)據(jù)拷貝,不管是因?yàn)橛绊懥朔?wù)器性能,還是想在黑客大會(huì)上展示"零復(fù)制”技術(shù),你必須自己跟蹤可能發(fā)生數(shù)據(jù)拷貝的所有地方,而不是輕信宣傳。
有一種可以避免數(shù)據(jù)拷貝的方法是使用buffer的描述符(或者buffer chains的描述符)來(lái)取代直接使用buffer指針,每個(gè)buffer描述符應(yīng)該由以下元素組成:
l 一個(gè)指向buffer的指針和整個(gè)buffer的長(zhǎng)度
l 一個(gè)指向buffer中真實(shí)數(shù)據(jù)的指針和真實(shí)數(shù)據(jù)的長(zhǎng)度,或者長(zhǎng)度的偏移
l 以雙向鏈表的形式提供指向其它buffer的指針
l 一個(gè)引用計(jì)數(shù)
現(xiàn)在,代碼可以簡(jiǎn)單的在相應(yīng)的描述符上增加引用計(jì)數(shù)來(lái)代替內(nèi)存中數(shù)據(jù)的拷貝。這種做法在某些條件下表現(xiàn)的相當(dāng)好,包括在典型的網(wǎng)絡(luò)協(xié)議棧的操作上,但有些情況下這做法也令人很頭大。一般來(lái)說(shuō),在buffer chains的開(kāi)頭和結(jié)尾增加buffer很容易,對(duì)整個(gè)buffer增加引用計(jì)數(shù),以及對(duì)buffer chains的即刻釋放也很容易。在chains的中間增加buffer,一塊一塊的釋放buffer,或者對(duì)部分buffer增加引用技術(shù)則比較困難。而分割,組合chains會(huì)讓人立馬崩潰。
我 不建議在任何情況下都使用這種技術(shù),因?yàn)楫?dāng)你想在鏈上搜索你想要的一個(gè)塊時(shí),就不得不遍歷一遍描述符鏈,這甚至比數(shù)據(jù)拷貝更糟糕。最適用這種技術(shù)地方是在 程序中大的數(shù)據(jù)塊上,這些大數(shù)據(jù)塊應(yīng)該按照上面所說(shuō)的那樣獨(dú)立的分配描述符,以避免發(fā)生拷貝,也能避免影響服務(wù)器其它部分的工作.(大數(shù)據(jù)塊拷貝很消耗CPU,會(huì)影響其它并發(fā)線程的運(yùn)行)。
關(guān)于數(shù)據(jù)拷貝最后要指出的是:在避免數(shù)據(jù)拷貝時(shí)不要走極端。我看到過(guò)太多的代碼為了避免數(shù)據(jù)拷貝,最后結(jié)果反而比拷貝數(shù)據(jù)更糟糕,比如產(chǎn)生環(huán)境切換或者一個(gè)大的I/O請(qǐng)求被分解了。數(shù)據(jù)拷貝是昂貴的,但是在避免它時(shí),是收益遞減的(意思是做過(guò)頭了,效果反而不好)。為了除去最后少量的數(shù)據(jù)拷貝而改變代碼,繼而讓代碼復(fù)雜度翻番,不如把時(shí)間花在其它方面。
上下文切換(Context Switches)
相 對(duì)于數(shù)據(jù)拷貝影響的明顯,非常多的人會(huì)忽視了上下文切換對(duì)性能的影響。在我的經(jīng)驗(yàn)里,比起數(shù)據(jù)拷貝,上下文切換是讓高負(fù)載應(yīng)用徹底完蛋的真正殺手。系統(tǒng)更 多的時(shí)間都花費(fèi)在線程切換上,而不是花在真正做有用工作的線程上。令人驚奇的是,(和數(shù)據(jù)拷貝相比)在同一個(gè)水平上,導(dǎo)致上下文切換原因總是更常見(jiàn)。引起 環(huán)境切換的第一個(gè)原因往往是活躍線程數(shù)比CPU個(gè)數(shù)多。隨著活躍線程數(shù)相對(duì)于CPU個(gè)數(shù)的增加,上下文切換的次數(shù)也在增加,如果你夠幸運(yùn),這種增長(zhǎng)是線性的,但更常見(jiàn)是指數(shù)增長(zhǎng)。這個(gè)簡(jiǎn)單的事實(shí)解釋了為什么每個(gè)連接一個(gè)線程的多線程設(shè)計(jì)的可伸縮性更差。對(duì)于一個(gè)可伸縮性的系統(tǒng)來(lái)說(shuō),限制活躍線程數(shù)少于或等于CPU個(gè)數(shù)是更有實(shí)際意義的方案。曾經(jīng)這種方案的一個(gè)變種是只使用一個(gè)活躍線程,雖然這種方案避免了環(huán)境爭(zhēng)用,同時(shí)也避免了鎖,但它不能有效利用多CPU在增加總吞吐量上的價(jià)值,因此除非程序無(wú)CPU限制(non-CPU-bound),(通常是網(wǎng)絡(luò)I/O限制 network-I/O-bound),應(yīng)該繼續(xù)使用更實(shí)際的方案。
一個(gè)有適量線程的程序首先要考慮的事情是規(guī)劃出如何創(chuàng)建一個(gè)線程去管理多連接。這通常意味著前置一個(gè)select/poll, 異步I/O,信號(hào)或者完成端口,而后臺(tái)使用一個(gè)事件驅(qū)動(dòng)的程序框架。關(guān)于哪種前置API是最好的有很多爭(zhēng)論。 Dan Kegel的C10K在這個(gè)領(lǐng)域是一篇不錯(cuò)的論文。個(gè)人認(rèn)為,select/poll和信號(hào)通常是一種丑陋的方案,因此我更傾向于使用AIO或者完成端口,但是實(shí)際上它并不會(huì)好太多。也許除了select(),它們都還不錯(cuò)。所以不要花太多精力去探索前置系統(tǒng)最外層內(nèi)部到底發(fā)生了什么。
對(duì)于最簡(jiǎn)單的多線程事件驅(qū)動(dòng)服務(wù)器的概念模型, 其內(nèi)部有一個(gè)請(qǐng)求緩存隊(duì)列,客戶(hù)端請(qǐng)求被一個(gè)或者多個(gè)監(jiān)聽(tīng)線程獲取后放到隊(duì)列里,然后一個(gè)或者多個(gè)工作線程從隊(duì)列里面取出請(qǐng)求并處理。從概念上來(lái)說(shuō),這是一個(gè)很好的模型,有很多用這種方式來(lái)實(shí)現(xiàn)他們的代碼。這會(huì)產(chǎn)生什么問(wèn)題嗎?引起環(huán)境切換的第二個(gè)原因是把對(duì)請(qǐng)求的處理從一個(gè)線程轉(zhuǎn)移到另一個(gè)線程。有些人甚至把對(duì)請(qǐng)求的回應(yīng)又切換回最初的線程去做,這真是雪上加霜,因?yàn)槊恳粋€(gè)請(qǐng)求至少引起了2次環(huán)境切換。把一個(gè)請(qǐng)求從監(jiān)聽(tīng)線程轉(zhuǎn)換到成工作線程,又轉(zhuǎn)換回監(jiān)聽(tīng)線程的過(guò)程中,使用一種“平滑”的方法來(lái)避免環(huán)境切換是非常重要的。此時(shí),是否把連接請(qǐng)求分配到多個(gè)線程,或者讓所有線程依次作為監(jiān)聽(tīng)線程來(lái)服務(wù)每個(gè)連接請(qǐng)求,反而不重要了。
即使在將來(lái),也不可能有辦法知道在服務(wù)器中同一時(shí)刻會(huì)有多少激活線程.畢竟,每時(shí)每刻都可能有請(qǐng)求從任意連接發(fā)送過(guò)來(lái),一些進(jìn)行特殊任務(wù)的“后臺(tái)”線 程也會(huì)在任意時(shí)刻被喚醒。那么如果你不知道當(dāng)前有多少線程是激活的,又怎么能夠限制激活線程的數(shù)量呢?根據(jù)我的經(jīng)驗(yàn),最簡(jiǎn)單同時(shí)也是最有效的方法之一是: 用一個(gè)老式的帶計(jì)數(shù)的信號(hào)量,每一個(gè)線程執(zhí)行的時(shí)候就先持有信號(hào)量。如果信號(hào)量已經(jīng)到了最大值,那些處于監(jiān)聽(tīng)模式的線程被喚醒的時(shí)候可能會(huì)有一次額外的環(huán) 境切換,(監(jiān)聽(tīng)線程被喚醒是因?yàn)橛羞B接請(qǐng)求到來(lái), 此時(shí)監(jiān)聽(tīng)線程持有信號(hào)量時(shí)發(fā)現(xiàn)信號(hào)量已滿(mǎn),所以即刻休眠), 接 著它就會(huì)被阻塞在這個(gè)信號(hào)量上,一旦所有監(jiān)聽(tīng)模式的線程都這樣阻塞住了,那么它們就不會(huì)再競(jìng)爭(zhēng)資源了,直到其中一個(gè)線程釋放信號(hào)量,這樣環(huán)境切換對(duì)系統(tǒng)的 影響就可以忽略不計(jì)。更主要的是,這種方法使大部分時(shí)間處于休眠狀態(tài)的線程避免在激活線程數(shù)中占用一個(gè)位置,這種方式比其它的替代方案更優(yōu)雅。
一旦處理請(qǐng)求的過(guò)程被分成兩個(gè)階段(監(jiān)聽(tīng)和工作),那么更進(jìn)一步,這些處理過(guò)程在將來(lái)被分成更多的階段(更多的線程)就是很自然的事了。最簡(jiǎn)單的情況是一個(gè)完整的請(qǐng)求先完成第一步,然后是第二步(比如回應(yīng))。然而實(shí)際會(huì)更復(fù)雜:一個(gè)階段可能產(chǎn)生出兩個(gè)不同執(zhí)行路徑,也可能只是簡(jiǎn)單的生成一個(gè)應(yīng)答(例如返回一個(gè)緩存的值)。由此每個(gè)階段都需要知道下一步該如何做,根據(jù)階段分發(fā)函數(shù)的返回值有三種可能的做法:
l 請(qǐng)求需要被傳遞到另外一個(gè)階段(返回一個(gè)描述符或者指針)
l 請(qǐng)求已經(jīng)完成(返回ok)
l 請(qǐng)求被阻塞(返回"請(qǐng)求阻塞")。這和前面的情況一樣,阻塞到直到別的線程釋放資源
應(yīng)該注意到在這種模式下,對(duì)階段的排隊(duì)是在一個(gè)線程內(nèi)完成的,而不是經(jīng)由兩個(gè)線程中完成。這樣避免不斷把請(qǐng)求放在下一階段的隊(duì)列里,緊接著又從該隊(duì)列取出這個(gè)請(qǐng)求來(lái)執(zhí)行。這種經(jīng)由很多活動(dòng)隊(duì)列和鎖的階段很沒(méi)必要。
這種把一個(gè)復(fù)雜的任務(wù)分解成多個(gè)較小的互相協(xié)作的部分的方式,看起來(lái)很熟悉,這是因?yàn)檫@種做法確實(shí)很老了。我的方法,源于CAR在1978年發(fā)明的"通信序列化進(jìn)程"(Communicating Sequential Processes CSP),它的基礎(chǔ)可以上溯到1963時(shí)的Per Brinch Hansen and Matthew Conway--在我出生之前!然而,當(dāng)Hoare創(chuàng)造出CSP這個(gè)術(shù)語(yǔ)的時(shí)候,“進(jìn)程”是從抽象的數(shù)學(xué)角度而言的,而且,這個(gè)CSP術(shù)語(yǔ)中的進(jìn)程和操作系統(tǒng)中同名的那個(gè)進(jìn)程并沒(méi)有關(guān)系。依我看來(lái),這種在操作系統(tǒng)提供的單個(gè)線程之內(nèi),實(shí)現(xiàn)類(lèi)似多線程一樣協(xié)同并發(fā)工作的CSP的方法,在可擴(kuò)展性方面讓很多人頭疼。
一個(gè)實(shí)際的例子是,Matt Welsh的SEDA,這個(gè)例子表明分段執(zhí)行的(stage-execution)思想朝著一個(gè)比較合理的方向發(fā)展。SEDA是一個(gè)很好的“server Aarchitecture done right”的例子,值得把它的特性評(píng)論一下:
1. SEDA的批處理傾向于強(qiáng)調(diào)一個(gè)階段處理多個(gè)請(qǐng)求,而我的方式傾向于強(qiáng)調(diào)一個(gè)請(qǐng)求分成多個(gè)階段處理。
2. 在我看來(lái)SEDA的一個(gè)重大缺陷是給每個(gè)階段申請(qǐng)一個(gè)獨(dú)立的在加載響應(yīng)階段中線程“后臺(tái)”重分配的線程池。結(jié)果,原因1和原因2引起的環(huán)境切換仍然很多。
3. 在純技術(shù)的研究項(xiàng)目中,在Java中使用SEDA是有用的,然而在實(shí)際應(yīng)用場(chǎng)合,我覺(jué)得這種方法很少被選擇。
內(nèi)存分配(Memory Allocator)
申請(qǐng)和釋放內(nèi)存是應(yīng)用程序中最常見(jiàn)的操作, 因此發(fā)明了許多聰明的技巧使得內(nèi)存的申請(qǐng)效率更高。然而再聰明的方法也不能彌補(bǔ)這種事實(shí):在很多場(chǎng)合中,一般的內(nèi)存分配方法非常沒(méi)有效率。所以為了減少向系統(tǒng)申請(qǐng)內(nèi)存,我有三個(gè)建議。
建 議一是使用預(yù)分配。我們都知道由于使用靜態(tài)分配而對(duì)程序的功能加上人為限制是一種糟糕的設(shè)計(jì)。但是還是有許多其它很不錯(cuò)的預(yù)分配方案。通常認(rèn)為,通過(guò)系統(tǒng) 一次性分配內(nèi)存要比分開(kāi)幾次分配要好,即使這樣做在程序中浪費(fèi)了某些內(nèi)存。如果能夠確定在程序中會(huì)有幾項(xiàng)內(nèi)存使用,在程序啟動(dòng)時(shí)預(yù)分配就是一個(gè)合理的選 擇。即使不能確定,在開(kāi)始時(shí)為請(qǐng)求句柄預(yù)分配可能需要的所有內(nèi)存也比在每次需要一點(diǎn)的時(shí)候才分配要好。通過(guò)系統(tǒng)一次性連續(xù)分配多項(xiàng)內(nèi)存還能極大減少錯(cuò)誤處 理代碼。在內(nèi)存比較緊張時(shí),預(yù)分配可能不是一個(gè)好的選擇,但是除非面對(duì)最極端的系統(tǒng)環(huán)境,否則預(yù)分配都是一個(gè)穩(wěn)賺不賠的選擇。
建議二是使用一個(gè)內(nèi)存釋放分配的lookaside list(監(jiān)視列表或者后備列表)。基本的概念是把最近釋放的對(duì)象放到鏈表里而不是真的釋放它,當(dāng)不久再次需要該對(duì)象時(shí),直接從鏈表上取下來(lái)用,不用通過(guò)系統(tǒng)來(lái)分配。使用lookaside list的一個(gè)額外好處是可以避免復(fù)雜對(duì)象的初始化和清理.
通常,讓lookaside list不受限制的增長(zhǎng),即使在程序空閑時(shí)也不釋放占用的對(duì)象是個(gè)糟糕的想法。在避免引入復(fù)雜的鎖或競(jìng)爭(zhēng)情況下,不定期的“清掃"非活躍對(duì)象是很有必要的。一個(gè)比較妥當(dāng)?shù)霓k法是,讓lookaside list由兩個(gè)可以獨(dú)立鎖定的鏈表組成:一個(gè)"新鏈"和一個(gè)"舊鏈".使用時(shí)優(yōu)先從"新"鏈分配,然后最后才依靠"舊"鏈。對(duì)象總是被釋放的"新"鏈上。清除線程則按如下規(guī)則運(yùn)行:
1. 鎖住兩個(gè)鏈
2. 保存舊鏈的頭結(jié)點(diǎn)
3. 把前一個(gè)新鏈掛到舊鏈的前頭
4. 解鎖
5. 在空閑時(shí)通過(guò)第二步保存的頭結(jié)點(diǎn)開(kāi)始釋放舊鏈的所有對(duì)象。
使用了這種方式的系統(tǒng)中,對(duì)象只有在真的沒(méi)用時(shí)才會(huì)釋放,釋放至少延時(shí)一個(gè)清除間隔期(指清除線程的運(yùn)行間隔),但同常不會(huì)超過(guò)兩個(gè)間隔期。清除線程不會(huì)和普通線程發(fā)生鎖競(jìng)爭(zhēng)。理論上來(lái)說(shuō),同樣的方法也可以應(yīng)用到請(qǐng)求的多個(gè)階段,但目前我還沒(méi)有發(fā)現(xiàn)有這么用的。
使用lookaside lists有一個(gè)問(wèn)題是,保持分配對(duì)象需要一個(gè)鏈表指針(鏈表結(jié)點(diǎn)),這可能會(huì)增加內(nèi)存的使用。但是即使有這種情況,使用它帶來(lái)的好處也能夠遠(yuǎn)遠(yuǎn)彌補(bǔ)這些額外內(nèi)存的花銷(xiāo)。
第三條建議與我們還沒(méi)有討論的鎖有關(guān)系。先拋開(kāi)它不說(shuō)。即使使用lookaside list,內(nèi)存分配時(shí)的鎖競(jìng)爭(zhēng)也常常是最大的開(kāi)銷(xiāo)。解決方法是使用線程私有的lookasid list, 這樣就可以避免多個(gè)線程之間的競(jìng)爭(zhēng)。更進(jìn)一步,每個(gè)處理器一個(gè)鏈會(huì)更好,但這樣只有在非搶先式線程環(huán)境下才有用?;跇O端考慮,私有l(wèi)ookaside list甚至可以和一個(gè)共用的鏈工作結(jié)合起來(lái)使用。
鎖競(jìng)爭(zhēng)(Lock Contention)
高效率的鎖是非常難規(guī)劃的, 以至于我把它稱(chēng)作卡律布狄斯和斯庫(kù)拉(參見(jiàn)附錄)。一方面, 鎖的簡(jiǎn)單化(粗粒度鎖)會(huì)導(dǎo)致并行處理的串行化,因而降低了并發(fā)的效率和系統(tǒng)可伸縮性; 另一方面, 鎖的復(fù)雜化(細(xì)粒度鎖)在空間占用上和操作時(shí)的時(shí)間消耗上都可能產(chǎn)生對(duì)性能的侵蝕。偏向于粗粒度鎖會(huì)有死鎖發(fā)生,而偏向于細(xì)粒度鎖則會(huì)產(chǎn)生競(jìng)爭(zhēng)。在這兩者之間,有一個(gè)狹小的路徑通向正確性和高效率,但是路在哪里?
由于鎖傾向于對(duì)程序邏輯產(chǎn)生束縛,所以如果要在不影響程序正常工作的基礎(chǔ)上規(guī)劃出鎖方案基本是不可能的。這也就是人們?yōu)槭裁丛骱捩i,并且為自己設(shè)計(jì)的不可擴(kuò)展的單線程方案找借口了。
幾乎我們每個(gè)系統(tǒng)中鎖的設(shè)計(jì)都始于一個(gè)"鎖住一切的超級(jí)大鎖",并寄希望于它不會(huì)影響性能,當(dāng)希望落空時(shí)(幾乎是必然), 大鎖被分成多個(gè)小鎖,然后我們繼續(xù)禱告(性能不會(huì)受影響),接著,是重復(fù)上面的整個(gè)過(guò)程(許多小鎖被分成更小的鎖), 直到性能達(dá)到可接受的程度。通常,上面過(guò)程的每次重復(fù)都回增加大于20%-50%的復(fù)雜性和鎖負(fù)荷,并減少5%-10%的鎖競(jìng)爭(zhēng)。最終結(jié)果是取得了適中的效率,但是實(shí)際效率的降低是不可避免的。設(shè)計(jì)者開(kāi)始抓狂:"我已經(jīng)按照書(shū)上的指導(dǎo)設(shè)計(jì)了細(xì)粒度鎖,為什么系統(tǒng)性能還是很糟糕?"
在我的經(jīng)驗(yàn)里,上面的方法從基礎(chǔ)上來(lái)說(shuō)就不正確。設(shè)想把解決方案當(dāng)成一座山,優(yōu)秀的方案表示山頂,糟糕的方案表示山谷。上面始于"超級(jí)鎖"的解決方案就好像被形形色色的山谷,凹溝,小山頭和死胡同擋在了山峰之外的登山者一樣,是一個(gè)典型的糟糕爬山法;從這樣一個(gè)地方開(kāi)始登頂,還不如下山更容易一些。那么登頂正確的方法是什么?
首要的事情是為你程序中的鎖形成一張圖表,有兩個(gè)軸:
l 圖表的縱軸表示代碼。如果你正在應(yīng)用剔出了分支的階段架構(gòu)(指前面說(shuō)的為請(qǐng)求劃分階段),你可能已經(jīng)有這樣一張劃分圖了,就像很多人見(jiàn)過(guò)的OSI七層網(wǎng)絡(luò)協(xié)議架構(gòu)圖一樣。
l 圖表的水平軸表示數(shù)據(jù)集。在請(qǐng)求的每個(gè)階段都應(yīng)該有屬于該階段需要的數(shù)據(jù)集。
現(xiàn)在,你有了一張網(wǎng)格圖,圖上每個(gè)單元格表示一個(gè)特定階段需要的特定數(shù)據(jù)集。下面是應(yīng)該遵守的最重要的規(guī)則:兩個(gè)請(qǐng)求不應(yīng)該產(chǎn)生競(jìng)爭(zhēng),除非它們?cè)谕粋€(gè)階段需要同樣的數(shù)據(jù)集。如果你嚴(yán)格遵守這個(gè)規(guī)則,那么你已經(jīng)成功了一半。
一旦你定義出了上面那個(gè)網(wǎng)格圖,在你的系統(tǒng)中的每種類(lèi)型的鎖就都可以被標(biāo)識(shí)出來(lái)了。你的下一個(gè)目標(biāo)是確保這些標(biāo)識(shí)出來(lái)的鎖盡可能在兩個(gè)軸之間均勻的分布, 這 部分工作是和具體應(yīng)用相關(guān)的。你得像個(gè)鉆石切割工一樣,根據(jù)你對(duì)程序的了解,找出請(qǐng)求階段和數(shù)據(jù)集之間的自然“紋理線”。有時(shí)候它們很容易發(fā)現(xiàn),有時(shí)候又 很難找出來(lái),此時(shí)需要不斷回顧來(lái)發(fā)現(xiàn)它。在程序設(shè)計(jì)時(shí),把代碼分隔成不同階段是很復(fù)雜的事情,我也沒(méi)有好的建議,但是對(duì)于數(shù)據(jù)集的定義,有一些建議給你:
l 如果你能對(duì)請(qǐng)求按順序編號(hào),或者能對(duì)請(qǐng)求進(jìn)行哈希,或者能把請(qǐng)求和事物ID關(guān)聯(lián)起來(lái),那么根據(jù)這些編號(hào)或者ID就能對(duì)數(shù)據(jù)更好的進(jìn)行分隔。
l 有時(shí),基于數(shù)據(jù)集的資源最大化利用,把請(qǐng)求動(dòng)態(tài)的分配給數(shù)據(jù),相對(duì)于依據(jù)請(qǐng)求的固有屬性來(lái)分配會(huì)更有優(yōu)勢(shì)。就好像現(xiàn)代CPU的多個(gè)整數(shù)運(yùn)算單元知道把請(qǐng)求分離一樣。
l 確定每個(gè)階段指定的數(shù)據(jù)集是不一樣的是非常有用的,以便保證一個(gè)階段爭(zhēng)奪的數(shù)據(jù)在另外階段不會(huì)爭(zhēng)奪。
如果你在縱向和橫向上把“鎖空間(這里實(shí)際指鎖的分布)" 分 隔了,并且確保了鎖均勻分布在網(wǎng)格上,那么恭喜你獲得了一個(gè)好方案?,F(xiàn)在你處在了一個(gè)好的登山點(diǎn),打個(gè)比喻,你面有了一條通向頂峰的緩坡,但你還沒(méi)有到山 頂?,F(xiàn)在是時(shí)候?qū)︽i競(jìng)爭(zhēng)進(jìn)行統(tǒng)計(jì),看看該如何改進(jìn)了。以不同的方式分隔階段和數(shù)據(jù)集,然后統(tǒng)計(jì)鎖競(jìng)爭(zhēng),直到獲得一個(gè)滿(mǎn)意的分隔。當(dāng)你做到這個(gè)程度的時(shí)候, 那么無(wú)限風(fēng)景將呈現(xiàn)在你腳下。
其他方面
我已經(jīng)闡述完了影響性能的四個(gè)主要方面。然而還有一些比較重要的方面需要說(shuō)一說(shuō),大所屬都可歸結(jié)于你的平臺(tái)或系統(tǒng)環(huán)境:
l 你的存儲(chǔ)子系統(tǒng)在大數(shù)據(jù)讀寫(xiě)和小數(shù)據(jù)讀寫(xiě),隨即讀寫(xiě)和順序讀寫(xiě)方面是如何進(jìn)行?在預(yù)讀和延遲寫(xiě)入方面做得怎樣?
l 你使用的網(wǎng)絡(luò)協(xié)議效率如何?是否可以通過(guò)修改參數(shù)改善性能?是否有類(lèi)似于TCP_CORK, MSG_PUSH,Nagle-toggling算法的手段來(lái)避免小消息產(chǎn)生?
l 你的系統(tǒng)是否支持Scatter-Gather I/O(例如readv/writev)? 使用這些能夠改善性能,也能避免使用緩沖鏈(見(jiàn)第一節(jié)數(shù)據(jù)拷貝的相關(guān)敘述)帶來(lái)的麻煩。(說(shuō)明:在dma傳輸數(shù)據(jù)的過(guò)程中,要求源物理地址和目標(biāo)物理地址必須是連續(xù)的。但在有的計(jì)算機(jī)體系中,如IA,連續(xù)的存儲(chǔ)器地址在物理上不一定是連續(xù)的,則dma傳輸要分成多次完成。如果傳輸完一塊物理連續(xù)的數(shù)據(jù)后發(fā)起一次中斷,同時(shí)主機(jī)進(jìn)行下一塊物理連續(xù)的傳輸,則這種方式即為block dma方式。scatter/gather方式則不同,它是用一個(gè)鏈表描述物理不連續(xù)的存儲(chǔ)器,然后把鏈表首地址告訴dma master。dma master傳輸完一塊物理連續(xù)的數(shù)據(jù)后,就不用再發(fā)中斷了,而是根據(jù)鏈表傳輸下一塊物理連續(xù)的數(shù)據(jù),最后發(fā)起一次中斷。很顯然 scatter/gather方式比block dma方式效率高)
l 你的系統(tǒng)的頁(yè)大小是多少?高速緩存大小是多少?向這些大小邊界進(jìn)行對(duì)起是否有用?系統(tǒng)調(diào)用和上下文切換花的代價(jià)是多少?
l 你是否知道鎖原語(yǔ)的饑餓現(xiàn)象?你的事件機(jī)制有沒(méi)有"驚群"問(wèn)題?你的喚醒/睡眠機(jī)制是否有這樣糟糕的行為: 當(dāng)X喚醒了Y, 環(huán)境立刻切換到了Y,但是X還有沒(méi)完成的工作?
我 在這里考慮的了很多方面,相信你也考慮過(guò)。在特定情況下,應(yīng)用這里提到的某些方面可能沒(méi)有價(jià)值,但能考慮這些因素的影響還是有用的。如果在系統(tǒng)手冊(cè)中,你 沒(méi)有找到這些方面的說(shuō)明,那么就去努力找出答案。寫(xiě)一個(gè)測(cè)試程序來(lái)找出答案;不管怎樣,寫(xiě)這樣的測(cè)試代碼都是很好的技巧鍛煉。如果你寫(xiě)的代碼在多個(gè)平臺(tái)上 都運(yùn)行過(guò),那么把這些相關(guān)的代碼抽象為一個(gè)平臺(tái)相關(guān)的庫(kù),將來(lái)在某個(gè)支持這里提到的某些功能的平臺(tái)上,你就贏得了先機(jī)。
對(duì)你的代碼,“知其所以然”, 弄明白其中高級(jí)的操作, 以及在不同條件下的花銷(xiāo).這不同于傳統(tǒng)的性能分析, 不是關(guān)于具體的實(shí)現(xiàn),而是關(guān)乎設(shè)計(jì). 低級(jí)別的優(yōu)化永遠(yuǎn)是蹩腳設(shè)計(jì)的最后救命稻草.
(map注:下面這段文字原文沒(méi)有,這是譯者對(duì)于翻譯的理)
[附錄:奧德修斯(Odysseus,又譯“奧德賽”),神話中伊塔刻島國(guó)王,《伊利亞特》和《奧德賽》兩大史詩(shī)中的主人公(公元前11世紀(jì)到公元前9世紀(jì)的希臘史稱(chēng)作“荷馬時(shí)代”。包括《伊利亞特》和《奧德賽》兩部分的《荷馬史詩(shī)》,是古代世界一部著名的杰作)。奧德修斯曾參加過(guò)著名的特洛伊戰(zhàn)爭(zhēng),在戰(zhàn)爭(zhēng)中他以英勇善戰(zhàn)、足智多謀而著稱(chēng),為贏得戰(zhàn)爭(zhēng)的勝利,他設(shè)計(jì)制造了著名的“特洛伊木馬”(后來(lái)在西方成了“為毀滅敵人而送的禮物”的代名詞)。特洛伊城毀滅后,他在回國(guó)途中又經(jīng)歷了許多風(fēng)險(xiǎn),荷馬的《奧德賽》就是奧德修斯歷險(xiǎn)的記述。“斯庫(kù)拉和卡律布狄斯”的故事是其中最驚險(xiǎn)、最恐怖的一幕。
相傳,斯庫(kù)拉和卡律布狄斯是古希臘神話中的女妖和魔怪,女妖斯庫(kù)拉住在意大利和西西里島之間海峽中的一個(gè)洞穴里,她的對(duì)面住著另一個(gè)妖怪卡律布狄斯。它們?yōu)楹λ羞^(guò)往航海的人。據(jù)荷馬說(shuō),女妖斯庫(kù)拉長(zhǎng)著12只不規(guī)則的腳,有6個(gè)蛇一樣的脖子,每個(gè)脖子上各有一顆可怕的頭,張著血盆大口,每張嘴有3 排毒牙,隨時(shí)準(zhǔn)備把獵物咬碎。它們每天在意大利和西西里島之間海峽中興風(fēng)作浪,航海者在兩個(gè)妖怪之間通過(guò)是異常危險(xiǎn)的,它們時(shí)刻在等待著穿過(guò)西西里海峽的船舶。在海峽中間,卡律布狄斯化成一個(gè)大旋渦,波濤洶涌、水花飛濺,每天3次 從懸崖上奔涌而出,在退落時(shí)將通過(guò)此處的船只全部淹沒(méi)。當(dāng)奧德修斯的船接近卡律布狄斯大旋渦時(shí),它像火爐上的一鍋沸水,波濤滔天,激起漫天雪白的水花。當(dāng) 潮退時(shí),海水混濁,濤聲如雷,驚天動(dòng)地。這時(shí),黑暗泥濘的巖穴一見(jiàn)到底。正當(dāng)他們驚恐地注視著這一可怕的景象時(shí),正當(dāng)舵手小心翼翼地駕駛著船只從左繞過(guò)旋 渦時(shí),突然海怪斯庫(kù)拉出現(xiàn)在他們面前,她一口叼住了6個(gè)同伴。奧德修斯親眼看見(jiàn)自己的同伴在妖怪的牙齒中間扭動(dòng)著雙手和雙腳,掙扎了一會(huì)兒,他們便被嚼碎,成了血肉模糊的一團(tuán)。其余的人僥幸通過(guò)了卡律布狄斯大旋渦和海怪斯庫(kù)拉之間的危險(xiǎn)的隘口。后來(lái)又歷經(jīng)種種災(zāi)難,最后終于回到了故鄉(xiāng)——伊塔刻島。
這個(gè)故事在語(yǔ)言學(xué)界和翻譯界被廣為流傳。前蘇聯(lián)著名翻譯家巴爾胡達(dá)羅夫就曾把 “斯庫(kù)拉和卡律布狄斯”比作翻譯中“直譯和意譯”。他說(shuō):“形象地說(shuō),譯者總是不得不在直譯和意譯之間迂回應(yīng)變,猶如在斯庫(kù)拉和卡律布狄斯之間曲折前行,以求在這海峽兩岸之間找到一條狹窄然而卻足夠深邃的航道,以便達(dá)到理想的目的地——最大限度的等值翻譯。”
德國(guó)著名語(yǔ)言學(xué)家洪堡特也說(shuō)過(guò)類(lèi)似的話:“我確信任何翻譯無(wú)疑地都是企圖解決不可能解決的任務(wù)。因?yàn)槿魏我粋€(gè)翻譯家都會(huì)碰到一個(gè)暗礁而遭到失敗,他們不是由于十分準(zhǔn)確地遵守了原文的形式而破壞了譯文語(yǔ)言的特點(diǎn),就是為了照顧譯文語(yǔ)言的特點(diǎn)而損壞了原文。介于兩者之間的做法不僅難于辦到,而且簡(jiǎn)直是不可能辦到。”
歷史上長(zhǎng)久以來(lái)都認(rèn)為,翻譯只能選擇兩個(gè)極端的一種:或者這種——逐字翻譯(直譯);或者那種——自由翻譯(意譯)。就好像翻譯中的斯庫(kù)拉和卡律布狄斯”一樣。如今 “斯庫(kù)拉和卡律布狄斯”已成為表示雙重危險(xiǎn)——海怪和旋渦的代名詞,人們常說(shuō)“介于斯庫(kù)拉和卡律布狄斯之間”,這就是說(shuō):處于兩面受敵的險(xiǎn)境,比喻“危機(jī)四伏”,用來(lái)喻指譯者在直譯和意譯之間反復(fù)作出抉擇之艱難。]