亚洲一区二区三区四区中文,99在线观看免费视频精品观看,欧美精品97

Comparing Two High-Performance I/O Design Patterns

chatler — Mon, 21 May 2012 03:24:00 GMT

by Alexander Libman with Vladimir Gilbourd
November 25, 2005

Summary

This article investigates and compares different design patterns of high performance TCP-based servers. In addition to existing approaches, it proposes a scalable single-codebase, multi-platform solution (with code examples) and describes its fine-tuning on different platforms. It also compares performance of Java, C# and C++ implementations of proposed and existing solutions.

System I/O can be blocking, or non-blocking synchronous, or non-blocking asynchronous [1, 2]. Blocking I/O means that the calling system does not return control to the caller until the operation is finished. As a result, the caller is blocked and cannot perform other activities during that time. Most important, the caller thread cannot be reused for other request processing while waiting for the I/O to complete, and becomes a wasted resource during that time. For example, aread() operation on a socket in blocking mode will not return control if the socket buffer is empty until some data becomes available.

By contrast, a non-blocking synchronous call returns control to the caller immediately. The caller is not made to wait, and the invoked system immediately returns one of two responses: If the call was executed and the results are ready, then the caller is told of that. Alternatively, the invoked system can tell the caller that the system has no resources (no data in the socket) to perform the requested action. In that case, it is the responsibility of the caller may repeat the call until it succeeds. For example, a read() operation on a socket in non-blocking mode may return the number of read bytes or a special return code -1 with errno set to EWOULBLOCK/EAGAIN, meaning "not ready; try again later."

In a non-blocking asynchronous call, the calling function returns control to the caller immediately, reporting that the requested action was started. The calling system will execute the caller's request using additional system resources/threads and will notify the caller (by callback for example), when the result is ready for processing. For example, a Windows ReadFile() or POSIX aio_read() API returns immediately and initiates an internal system read operation. Of the three approaches, this non-blocking asynchronous approach offers the best scalability and performance.

This article investigates different non-blocking I/O multiplexing mechanisms and proposes a single multi-platform design pattern/solution. We hope that this article will help developers of high performance TCP based servers to choose optimal design solution. We also compare the performance of Java, C# and C++ implementations of proposed and existing solutions. We will exclude the blocking approach from further discussion and comparison at all, as it the least effective approach for scalability and performance.

Reactor and Proactor: two I/O multiplexing approaches

In general, I/O multiplexing mechanisms rely on an event demultiplexor [1, 3], an object that dispatches I/O events from a limited number of sources to the appropriate read/write event handlers. The developer registers interest in specific events and provides event handlers, or callbacks. The event demultiplexor delivers the requested events to the event handlers.

Two patterns that involve event demultiplexors are called Reactor and Proactor [1]. The Reactor patterns involve synchronous I/O, whereas the Proactor pattern involves asynchronous I/O. In Reactor, the event demultiplexor waits for events that indicate when a file descriptor or socket is ready for a read or write operation. The demultiplexor passes this event to the appropriate handler, which is responsible for performing the actual read or write.

In the Proactor pattern, by contrast, the handler—or the event demultiplexor on behalf of the handler—initiates asynchronous read and write operations. The I/O operation itself is performed by the operating system (OS). The parameters passed to the OS include the addresses of user-defined data buffers from which the OS gets data to write, or to which the OS puts data read. The event demultiplexor waits for events that indicate the completion of the I/O operation, and forwards those events to the appropriate handlers. For example, on Windows a handler could initiate async I/O (overlapped in Microsoft terminology) operations, and the event demultiplexor could wait for IOCompletion events [1]. The implementation of this classic asynchronous pattern is based on an asynchronous OS-level API, and we will call this implementation the "system-level" or "true" async, because the application fully relies on the OS to execute actual I/O.

An example will help you understand the difference between Reactor and Proactor. We will focus on the read operation here, as the write implementation is similar. Here's a read in Reactor:

An event handler declares interest in I/O events that indicate readiness for read on a particular socket
The event demultiplexor waits for events
An event comes in and wakes-up the demultiplexor, and the demultiplexor calls the appropriate handler
The event handler performs the actual read operation, handles the data read, declares renewed interest in I/O events, and returns control to the dispatcher

By comparison, here is a read operation in Proactor (true async):

A handler initiates an asynchronous read operation (note: the OS must support asynchronous I/O). In this case, the handler does not care about I/O readiness events, but is instead registers interest in receiving completion events.
The event demultiplexor waits until the operation is completed
While the event demultiplexor waits, the OS executes the read operation in a parallel kernel thread, puts data into a user-defined buffer, and notifies the event demultiplexor that the read is complete
The event demultiplexor calls the appropriate handler;
The event handler handles the data from user defined buffer, starts a new asynchronous operation, and returns control to the event demultiplexor.

Current practice

The open-source C++ development framework ACE [1, 3] developed by Douglas Schmidt, et al., offers a wide range of platform-independent, low-level concurrency support classes (threading, mutexes, etc). On the top level it provides two separate groups of classes: implementations of the ACE Reactor and ACE Proactor. Although both of them are based on platform-independent primitives, these tools offer different interfaces.

The ACE Proactor gives much better performance and robustness on MS-Windows, as Windows provides a very efficient async API, based on operating-system-level support [4, 5].

Unfortunately, not all operating systems provide full robust async OS-level support. For instance, many Unix systems do not. Therefore, ACE Reactor is a preferable solution in UNIX (currently UNIX does not have robust async facilities for sockets). As a result, to achieve the best performance on each system, developers of networked applications need to maintain two separate code-bases: an ACE Proactor based solution on Windows and an ACE Reactor based solution for Unix-based systems.

As we mentioned, the true async Proactor pattern requires operating-system-level support. Due to the differing nature of event handler and operating-system interaction, it is difficult to create common, unified external interfaces for both Reactor and Proactor patterns. That, in turn, makes it hard to create a fully portable development framework and encapsulate the interface and OS- related differences.

Proposed solution

In this section, we will propose a solution to the challenge of designing a portable framework for the Proactor and Reactor I/O patterns. To demonstrate this solution, we will transform a Reactor demultiplexor I/O solution to an emulated async I/O by moving read/write operations from event handlers inside the demultiplexor (this is "emulated async" approach). The following example illustrates that conversion for a read operation:

An event handler declares interest in I/O events (readiness for read) and provides the demultiplexor with information such as the address of a data buffer, or the number of bytes to read.
Dispatcher waits for events (for example, on select());
When an event arrives, it awakes up the dispatcher. The dispatcher performs a non- blocking read operation (it has all necessary information to perform this operation) and on completion calls the appropriate handler.
The event handler handles data from the user-defined buffer, declares new interest, along with information about where to put the data buffer and the number bytes to read in I/O events. The event handler then returns control to the dispatcher.

As we can see, by adding functionality to the demultiplexor I/O pattern, we were able to convert the Reactor pattern to a Proactor pattern. In terms of the amount of work performed, this approach is exactly the same as the Reactor pattern. We simply shifted responsibilities between different actors. There is no performance degradation because the amount of work performed is still the same. The work was simply performed by different actors. The following lists of steps demonstrate that each approach performs an equal amount of work:

Standard/classic Reactor:

Step 1) wait for event (Reactor job)
Step 2) dispatch "Ready-to-Read" event to user handler ( Reactor job)
Step 3) read data (user handler job)
Step 4) process data ( user handler job)

Proposed emulated Proactor:

Step 1) wait for event (Proactor job)
Step 2) read data (now Proactor job)
Step 3) dispatch "Read-Completed" event to user handler (Proactor job)
Step 4) process data (user handler job)

With an operating system that does not provide an async I/O API, this approach allows us to hide the reactive nature of available socket APIs and to expose a fully proactive async interface. This allows us to create a fully portable platform-independent solution with a common external interface.

TProactor

The proposed solution (TProactor) was developed and implemented at Terabit P/L [6]. The solution has two alternative implementations, one in C++ and one in Java. The C++ version was built using ACE cross-platform low-level primitives and has a common unified async proactive interface on all platforms.

The main TProactor components are the Engine and WaitStrategy interfaces. Engine manages the async operations lifecycle. WaitStrategy manages concurrency strategies. WaitStrategy depends on Engine and the two always work in pairs. Interfaces between Engine and WaitStrategy are strongly defined.

Engines and waiting strategies are implemented as pluggable class-drivers (for the full list of all implemented Engines and corresponding WaitStrategies, see Appendix 1). TProactor is a highly configurable solution. It internally implements three engines (POSIX AIO, SUN AIO and Emulated AIO) and hides six different waiting strategies, based on an asynchronous kernel API (for POSIX- this is not efficient right now due to internal POSIX AIO API problems) and synchronous Unix select(), poll(), /dev/poll (Solaris 5.8+), port_get (Solaris 5.10), RealTime (RT) signals (Linux 2.4+), epoll (Linux 2.6), k-queue (FreeBSD) APIs. TProactor conforms to the standard ACE Proactor implementation interface. That makes it possible to develop a single cross-platform solution (POSIX/MS-WINDOWS) with a common (ACE Proactor) interface.

With a set of mutually interchangeable "lego-style" Engines and WaitStrategies, a developer can choose the appropriate internal mechanism (engine and waiting strategy) at run time by setting appropriate configuration parameters. These settings may be specified according to specific requirements, such as the number of connections, scalability, and the targeted OS. If the operating system supports async API, a developer may use the true async approach, otherwise the user can opt for an emulated async solutions built on different sync waiting strategies. All of those strategies are hidden behind an emulated async façade.

For an HTTP server running on Sun Solaris, for example, the /dev/poll or port_get()-based engines is the most suitable choice, able to serve huge number of connections, but for another UNIX solution with a limited number of connections but high throughput requirements, aselect()-based engine may be a better approach. Such flexibility cannot be achieved with a standard ACE Reactor/Proactor, due to inherent algorithmic problems of different wait strategies (see Appendix 2).

In terms of performance, our tests show that emulating from reactive to proactive does not impose any overhead—it can be faster, but not slower. According to our test results, the TProactor gives on average of up to 10-35 % better performance (measured in terms of both throughput and response times) than the reactive model in the standard ACE Reactor implementation on various UNIX/Linux platforms. On Windows it gives the same performance as standard ACE Proactor.

Performance comparison (JAVA versus C++ versus C#).

In addition to C++, as we also implemented TProactor in Java. As for JDK version 1.4, Java provides only the sync-based approach that is logically similar to C select() [7, 8]. Java TProactor is based on Java's non-blocking facilities (java.nio packages) logically similar to C++ TProactor with waiting strategy based on select().

Figures 1 and 2 chart the transfer rate in bits/sec versus the number of connections. These charts represent comparison results for a simple echo-server built on standard ACE Reactor, using RedHat Linux 9.0, TProactor C++ and Java (IBM 1.4JVM) on Microsoft's Windows and RedHat Linux9.0, and a C# echo-server running on the Windows operating system. Performance of native AIO APIs is represented by "Async"-marked curves; by emulated AIO (TProactor)—AsyncE curves; and by TP_Reactor—Synch curves. All implementations were bombarded by the same client application—a continuous stream of arbitrary fixed sized messages via N connections.

The full set of tests was performed on the same hardware. Tests on different machines proved that relative results are consistent.

Figure 1. Windows XP/P4 2.6GHz HyperThreading/512 MB RAM.

Figure 2. Linux RedHat 2.4.20-smp/P4 2.6GHz HyperThreading/512 MB RAM.

User code example

The following is the skeleton of a simple TProactor-based Java echo-server. In a nutshell, the developer only has to implement the two interfaces: OpRead with buffer where TProactor puts its read results, and OpWrite with a buffer from which TProactor takes data. The developer will also need to implement protocol-specific logic via providing callbacks onReadCompleted() and onWriteCompleted() in the AsynchHandler interface implementation. Those callbacks will be asynchronously called by TProactor on completion of read/write operations and executed on a thread pool space provided by TProactor (the developer doesn't need to write his own pool).

class EchoServerProtocol implements AsynchHandler
{
    AsynchChannel achannel = null;
    EchoServerProtocol(Demultiplexor m, SelectableChannel channel) throws Exception 
    {
        this.achannel = new AsynchChannel( m, this, channel );
    }

    public void start() throws Exception
    {
	// called after construction 
	System.out.println( Thread.currentThread().getName() + ": EchoServer protocol started" ); 
        achannel.read( buffer);
    }

    public void onReadCompleted( OpRead opRead ) throws Exception
    {
	if (opRead.getError() != null )
	{
   	// handle error, do clean-up if needed  
		System.out.println( "EchoServer::readCompleted: " + opRead.getError().toString());
		achannel.close();
		return;
	}
		
	if (opRead.getBytesCompleted () <= 0)
	{
		System.out.println( "EchoServer::readCompleted: Peer closed " + opRead.getBytesCompleted();
		achannel.close();
		return;
	}

	ByteBuffer buffer = opRead.getBuffer();
	achannel.write(buffer);
}
		
public void onWriteCompleted(OpWrite opWrite) throws Exception 
{
// logically similar to onReadCompleted         ...     
}
};

IOHandler is a TProactor base class. AsynchHandler and Multiplexor, among other things, internally execute the wait strategy chosen by the developer.

Conclusion

TProactor provides a common, flexible, and configurable solution for multi-platform high- performance communications development. All of the problems and complexities mentioned in Appendix 2, are hidden from the developer.

It is clear from the charts that C++ is still the preferable approach for high performance communication solutions, but Java on Linux comes quite close. However, the overall Java performance was weakened by poor results on Windows. One reason for that may be that the Java 1.4 nio package is based on select()-style API. �K?It is true, Java NIO package is kind of Reactor pattern based on select()-style API (see [7, 8]). Java NIO allows to write your own select()-style provider (equivalent of TProactor waiting strategies). Looking at Java NIO implementation for Windows (to do this enough to examine import symbols in jdk1.5.0\jre\bin\nio.dll), we can make a conclusion that Java NIO 1.4.2 and 1.5.0 for Windows is based on WSAEventSelect () API. That is better than select(), but slower than IOCompletionPort�K�s for significant number of connections. . Should the 1.5 version of Java's nio be based on IOCompletionPorts, then that should improve performance. If Java NIO would use IOCompletionPorts, than conversion of Proactor pattern to Reactor pattern should be made inside nio.dll. Although such conversion is more complicated than Reactor- >Proactor conversion, but it can be implemented in frames of Java NIO interfaces. (this the topic of next arcticle, but we can provide algorithm). At this time, no TProactor performance tests were done on JDK 1.5.

Note. All tests for Java are performed on "raw" buffers (java.nio.ByteBuffer) without data processing.

Taking into account the latest activities to develop robust AIO on Linux [9], we can conclude that Linux Kernel API (io_xxxx set of system calls) should be more scalable in comparison with POSIX standard, but still not portable. In this case, TProactor with new Engine/Wait Strategy pair, based on native LINUX AIO can be easily implemented to overcome portability issues and to cover Linux native AIO with standard ACE Proactor interface.

Appendix I

Engines and waiting strategies implemented in TProactor

Engine Type	Wait Strategies	Operating System
POSIX_AIO (true async) `aio_read()`/`aio_write()`	`aio_suspend() Waiting for RT signal Callback function`	POSIX complained UNIX (not robust) POSIX (not robust) SGI IRIX, LINUX (not robust)
SUN_AIO (true async) `aio_read()`/`aio_write()`	`aio_wait()`	SUN (not robust)
Emulated Async Non-blocking `read()`/`write()`	`select()` `poll()` /dev/poll Linux RT signals Kqueue	generic POSIX Mostly all POSIX implementations SUN Linux FreeBSD

Appendix II

All sync waiting strategies can be divided into two groups:

edge-triggered (e.g. Linux RT signals)—signal readiness only when socket became ready (changes state);
level-triggered (e.g. select(), poll(), /dev/poll)—readiness at any time.

Let us describe some common logical problems for those groups:

edge-triggered group: after executing I/O operation, the demultiplexing loop can lose the state of socket readiness. Example: the "read" handler did not read whole chunk of data, so the socket remains still ready for read. But the demultiplexor loop will not receive next notification.
level-triggered group: when demultiplexor loop detects readiness, it starts the write/read user defined handler. But before the start, it should remove socket descriptior from the set of monitored descriptors. Otherwise, the same event can be dispatched twice.
Obviously, solving these problems adds extra complexities to development. All these problems were resolved internally within TProactor and the developer should not worry about those details, while in the synch approach one needs to apply extra effort to resolve them.

Resources

[1] Douglas C. Schmidt, Stephen D. Huston "C++ Network Programming." 2002, Addison-Wesley ISBN 0-201-60464-7

[2] W. Richard Stevens "UNIX Network Programming" vol. 1 and 2, 1999, Prentice Hill, ISBN 0-13- 490012-X

[3] Douglas C. Schmidt, Michael Stal, Hans Rohnert, Frank Buschmann "Pattern-Oriented Software Architecture: Patterns for Concurrent and Networked Objects, Volume 2" Wiley & Sons, NY 2000

[4] INFO: Socket Overlapped I/O Versus Blocking/Non-blocking Mode. Q181611. Microsoft Knowledge Base Articles.

[5] Microsoft MSDN. I/O Completion Ports.
http://msdn.microsoft.com/library/default.asp?url=/library/en- us/fileio/fs/i_o_completion_ports.asp

[6] TProactor (ACE compatible Proactor).
www.terabit.com.au

[7] JavaDoc java.nio.channels
http://java.sun.com/j2se/1.4.2/docs/api/java/nio/channels/package-summary.html

[8] JavaDoc Java.nio.channels.spi Class SelectorProvider
http://java.sun.com/j2se/1.4.2/docs/api/java/nio/channels/spi/SelectorProvider.html

[9] Linux AIO development
http://lse.sourceforge.net/io/aio.html, and
http://archive.linuxsymposium.org/ols2003/Proceedings/All-Reprints/Reprint-Pulavarty-OLS2003.pdf

About the authors

Alex Libman has been programming for 15 years. During the past 5 years his main area of interest is pattern-oriented multiplatform networked programming using C++ and Java. He is big fan and contributor of ACE.

Vlad Gilbourd works as a computer consultant, but wishes to spend more time listening jazz :) As a hobby, he started and runswww.corporatenews.com.au website.

from:
http://www.artima.com/articles/io_design_patterns.html

chatler 2012-05-21 11:24 发表评论

Comparing Two High-Performance I/O Design Patterns

chatler — Wed, 08 Sep 2010 09:20:00 GMT

Summary

System I/O can be blocking, or non-blocking synchronous, or non-blocking asynchronous [1, 2]. Blocking I/O means that the calling system does not return control to the caller until the operation is finished. As a result, the caller is blocked and cannot perform other activities during that time. Most important, the caller thread cannot be reused for other request processing while waiting for the I/O to complete, and becomes a wasted resource during that time. For example, a read() operation on a socket in blocking mode will not return control if the socket buffer is empty until some data becomes available.

An example will help you understand the difference between Reactor and Proactor. We will focus on the read operation here, as the write implementation is similar. Here's a read in Reactor:

An event handler declares interest in I/O events that indicate readiness for read on a particular socket
The event demultiplexor waits for events
An event comes in and wakes-up the demultiplexor, and the demultiplexor calls the appropriate handler
The event handler performs the actual read operation, handles the data read, declares renewed interest in I/O events, and returns control to the dispatcher

By comparison, here is a read operation in Proactor (true async):

A handler initiates an asynchronous read operation (note: the OS must support asynchronous I/O). In this case, the handler does not care about I/O readiness events, but is instead registers interest in receiving completion events.
The event demultiplexor waits until the operation is completed
While the event demultiplexor waits, the OS executes the read operation in a parallel kernel thread, puts data into a user-defined buffer, and notifies the event demultiplexor that the read is complete
The event demultiplexor calls the appropriate handler;
The event handler handles the data from user defined buffer, starts a new asynchronous operation, and returns control to the event demultiplexor.

Current practice

The ACE Proactor gives much better performance and robustness on MS-Windows, as Windows provides a very efficient async API, based on operating-system-level support [4, 5].

Proposed solution

An event handler declares interest in I/O events (readiness for read) and provides the demultiplexor with information such as the address of a data buffer, or the number of bytes to read.
Dispatcher waits for events (for example, on select());
When an event arrives, it awakes up the dispatcher. The dispatcher performs a non- blocking read operation (it has all necessary information to perform this operation) and on completion calls the appropriate handler.
The event handler handles data from the user-defined buffer, declares new interest, along with information about where to put the data buffer and the number bytes to read in I/O events. The event handler then returns control to the dispatcher.

Standard/classic Reactor:

Step 1) wait for event (Reactor job)
Step 2) dispatch "Ready-to-Read" event to user handler ( Reactor job)
Step 3) read data (user handler job)
Step 4) process data ( user handler job)

Proposed emulated Proactor:

Step 1) wait for event (Proactor job)
Step 2) read data (now Proactor job)
Step 3) dispatch "Read-Completed" event to user handler (Proactor job)
Step 4) process data (user handler job)

TProactor

For an HTTP server running on Sun Solaris, for example, the /dev/poll or port_get()-based engines is the most suitable choice, able to serve huge number of connections, but for another UNIX solution with a limited number of connections but high throughput requirements, a select()-based engine may be a better approach. Such flexibility cannot be achieved with a standard ACE Reactor/Proactor, due to inherent algorithmic problems of different wait strategies (see Appendix 2).

Performance comparison (JAVA versus C++ versus C#).

The full set of tests was performed on the same hardware. Tests on different machines proved that relative results are consistent.

Figure 1. Windows XP/P4 2.6GHz HyperThreading/512 MB RAM.

Figure 2. Linux RedHat 2.4.20-smp/P4 2.6GHz HyperThreading/512 MB RAM.

User code example

The following is the skeleton of a simple TProactor-based Java echo-server. In a nutshell, the developer only has to implement the two interfaces:OpRead with buffer where TProactor puts its read results, and OpWrite with a buffer from which TProactor takes data. The developer will also need to implement protocol-specific logic via providing callbacks onReadCompleted() and onWriteCompleted() in the AsynchHandlerinterface implementation. Those callbacks will be asynchronously called by TProactor on completion of read/write operations and executed on a thread pool space provided by TProactor (the developer doesn't need to write his own pool).

class EchoServerProtocol implements AsynchHandler
{

    AsynchChannel achannel = null;

    EchoServerProtocol( Demultiplexor m,  SelectableChannel channel ) throws Exception
    {
        this.achannel = new AsynchChannel( m, this, channel );
    }

    public void start() throws Exception
    {
        // called after construction
        System.out.println( Thread.currentThread().getName() + ": EchoServer protocol started" );
        achannel.read( buffer);
    }

    public void onReadCompleted( OpRead opRead ) throws Exception
    {
        if ( opRead.getError() != null )
        {
            // handle error, do clean-up if needed
 System.out.println( "EchoServer::readCompleted: " + opRead.getError().toString());
            achannel.close();
            return;
        }

        if ( opRead.getBytesCompleted () <= 0)
        {
            System.out.println( "EchoServer::readCompleted: Peer closed " + opRead.getBytesCompleted();
            achannel.close();
            return;
        }

        ByteBuffer buffer = opRead.getBuffer();

        achannel.write(buffer);
    }

    public void onWriteCompleted(OpWrite opWrite) throws Exception
    {
        // logically similar to onReadCompleted
        ...
    }
}

IOHandler is a TProactor base class. AsynchHandler and Multiplexor, among other things, internally execute the wait strategy chosen by the developer.

Conclusion

Note. All tests for Java are performed on "raw" buffers (java.nio.ByteBuffer) without data processing.

Appendix I

Engines and waiting strategies implemented in TProactor

Engine Type	Wait Strategies	Operating System
POSIX_AIO (true async) `aio_read()`/`aio_write()`	`aio_suspend() Waiting for RT signal Callback function`	POSIX complained UNIX (not robust) POSIX (not robust) SGI IRIX, LINUX (not robust)
SUN_AIO (true async) `aio_read()`/`aio_write()`	`aio_wait()`	SUN (not robust)
Emulated Async Non-blocking `read()`/`write()`	`select()` `poll()` /dev/poll Linux RT signals Kqueue	generic POSIX Mostly all POSIX implementations SUN Linux FreeBSD

Appendix II

All sync waiting strategies can be divided into two groups:

edge-triggered (e.g. Linux RT signals)—signal readiness only when socket became ready (changes state);
level-triggered (e.g. select(), poll(), /dev/poll)—readiness at any time.

Let us describe some common logical problems for those groups:

edge-triggered group: after executing I/O operation, the demultiplexing loop can lose the state of socket readiness. Example: the "read" handler did not read whole chunk of data, so the socket remains still ready for read. But the demultiplexor loop will not receive next notification.
level-triggered group: when demultiplexor loop detects readiness, it starts the write/read user defined handler. But before the start, it should remove socket descriptior from the set of monitored descriptors. Otherwise, the same event can be dispatched twice.
Obviously, solving these problems adds extra complexities to development. All these problems were resolved internally within TProactor and the developer should not worry about those details, while in the synch approach one needs to apply extra effort to resolve them.

Resources

[1] Douglas C. Schmidt, Stephen D. Huston "C++ Network Programming." 2002, Addison-Wesley ISBN 0-201-60464-7

[2] W. Richard Stevens "UNIX Network Programming" vol. 1 and 2, 1999, Prentice Hill, ISBN 0-13- 490012-X

[3] Douglas C. Schmidt, Michael Stal, Hans Rohnert, Frank Buschmann "Pattern-Oriented Software Architecture: Patterns for Concurrent and Networked Objects, Volume 2" Wiley & Sons, NY 2000

[4] INFO: Socket Overlapped I/O Versus Blocking/Non-blocking Mode. Q181611. Microsoft Knowledge Base Articles.

[5] Microsoft MSDN. I/O Completion Ports.
http://msdn.microsoft.com/library/default.asp?url=/library/en- us/fileio/fs/i_o_completion_ports.asp

[6] TProactor (ACE compatible Proactor).
www.terabit.com.au

[7] JavaDoc java.nio.channels
http://java.sun.com/j2se/1.4.2/docs/api/java/nio/channels/package-summary.html

[8] JavaDoc Java.nio.channels.spi Class SelectorProvider
http://java.sun.com/j2se/1.4.2/docs/api/java/nio/channels/spi/SelectorProvider.html

[9] Linux AIO development
http://lse.sourceforge.net/io/aio.html, and
http://archive.linuxsymposium.org/ols2003/Proceedings/All-Reprints/Reprint-Pulavarty-OLS2003.pdf

About the authors

Vlad Gilbourd works as a computer consultant, but wishes to spend more time listening jazz :) As a hobby, he started and runswww.corporatenews.com.au website.

from:

http://www.artima.com/articles/io_design_patterns.html

chatler 2010-09-08 17:20 发表评论

一个基于完成端口的TCP Server Framework,��析IOCP

chatler — Wed, 25 Aug 2010 12:42:00 GMT

如果你不投递（POST�Q�Overlapped I/O�Q�那么I/O Completion Ports 只能��Z��提供一个Queue.
    CreateIoCompletionPort的NumberOfConcurrentThreads�Q?/span>
1.只有当第二个参数ExistingCompletionPort为NULL时它才有效，它是个max threads limits.
2.大家有谁把它讄��出cpu个数的��|��当然不只是cpu个数�?倍，而是下面的MAX_THREADS 100甚至更大�?/span>
对于�q�个值的讑֮��Q�msdn�q�没有说非得设成cpu个数�?倍，而且也没有把减少�U�程之间上下文交换这些媄响扯到这里来。I/O Completion Ports MSDN:"If your transaction required a lengthy computation, a larger concurrency value will allow more threads to run. Each completion packet may take longer to finish, but more completion packets will be processed at the same time. "�?/span>
    对于struct OVERLAPPED�Q�我们常会如下扩展，
typedef struct {
WSAOVERLAPPED overlapped; //must be first member?   是的�Q�必��L��W�一个。如果你不肯定，你可以试试�?/span>
SOCKET client_s;
SOCKADDR_IN client_addr;
WORD optCode;//1--read,2--send. 有�h�怼�定义�q�个数据成员�Q�但也有��Z��用，争议在send/WSASend,此时的同步和异步是否有必要？臛_��我下面的server更本��没用它�?/span>
char buf[MAX_BUF_SIZE];
WSABUF wsaBuf;//inited ? �q�个不要忘了�Q?/span>
DWORD numberOfBytesTransferred;
DWORD flags;

}QSSOverlapped;//for per connection
我下面的server框架的基本思想�?
One connection VS one thread in worker thread pool ,worker thread performs completionWorkerRoutine.
A Acceptor thread 专门用来accept socket,兌��至IOCP,�q�WSARecv:post Recv Completion Packet to IOCP.
在completionWorkerRoutine中有以下的职�?
1.handle request,当忙时增加completionWorkerThread数量但不��过maxThreads,post Recv Completion Packet to IOCP.
2.timeout时检查是否空闲和当前completionWorkerThread数量,当空闲时保持或减��至minThreads数量.
3.�Ҏ��有Accepted-socket��理生命周期,�q�里利用�pȝ��的keepalive probes,若想实现业务�?心蟩探测"只需��QSS_SIO_KEEPALIVE_VALS_TIMEOUT 改回�pȝ��默认�?��时.
下面�l�合源代�?��析一下IOCP:
socketserver.h
#ifndef __Q_SOCKET_SERVER__
#define __Q_SOCKET_SERVER__
#include
#include
#define QSS_SIO_KEEPALIVE_VALS_TIMEOUT 30*60*1000
#define QSS_SIO_KEEPALIVE_VALS_INTERVAL 5*1000

#define MAX_THREADS 100
#define MAX_THREADS_MIN 10
#define MIN_WORKER_WAIT_TIMEOUT 20*1000
#define MAX_WORKER_WAIT_TIMEOUT 60*MIN_WORKER_WAIT_TIMEOUT

#define MAX_BUF_SIZE 1024

/*当Accepted socket和socket关闭或发生异常时回调CSocketLifecycleCallback*/
typedef void (*CSocketLifecycleCallback)(SOCKET cs,int lifecycle);//lifecycle:0:OnAccepted,-1:OnClose//注意OnClose此时的socket未必可用,可能已经被非正常关闭或其他异�?

/*协议处理回调*/
typedef int (*InternalProtocolHandler)(LPWSAOVERLAPPED overlapped);//return -1:SOCKET_ERROR

typedef struct Q_SOCKET_SERVER SocketServer;
DWORD initializeSocketServer(SocketServer ** ssp,WORD passive,WORD port,CSocketLifecycleCallback cslifecb,InternalProtocolHandler protoHandler,WORD minThreads,WORD maxThreads,long workerWaitTimeout);
DWORD startSocketServer(SocketServer *ss);
DWORD shutdownSocketServer(SocketServer *ss);

#endif
qsocketserver.c      ��U?qss,相应的OVERLAPPED��U�qssOl.
#include "socketserver.h"
#include "stdio.h"
typedef struct {
WORD passive;//daemon
WORD port;
WORD minThreads;
WORD maxThreads;
volatile long lifecycleStatus;//0-created,1-starting, 2-running,3-stopping,4-exitKeyPosted,5-stopped
long workerWaitTimeout;//wait timeout
CRITICAL_SECTION QSS_LOCK;
volatile long workerCounter;
volatile long currentBusyWorkers;
volatile long CSocketsCounter;//Accepted-socket引用计数
CSocketLifecycleCallback cslifecb;
InternalProtocolHandler protoHandler;
WORD wsaVersion;//=MAKEWORD(2,0);
WSADATA wsData;
SOCKET server_s;
SOCKADDR_IN serv_addr;
HANDLE iocpHandle;
}QSocketServer;

typedef struct {
WSAOVERLAPPED overlapped;
SOCKET client_s;
SOCKADDR_IN client_addr;
WORD optCode;
char buf[MAX_BUF_SIZE];
WSABUF wsaBuf;
DWORD numberOfBytesTransferred;
DWORD flags;
}QSSOverlapped;

DWORD acceptorRoutine(LPVOID);
DWORD completionWorkerRoutine(LPVOID);

static void adjustQSSWorkerLimits(QSocketServer *qss){
  /*adjust size and timeout.*/
  /*if(qss->maxThreads <= 0) {
   qss->maxThreads = MAX_THREADS;
        } else if (qss->maxThreads < MAX_THREADS_MIN) {
        qss->maxThreads = MAX_THREADS_MIN;
        }
        if(qss->minThreads > qss->maxThreads) {
        qss->minThreads = qss->maxThreads;
        }
        if(qss->minThreads <= 0) {
            if(1 == qss->maxThreads) {
            qss->minThreads = 1;
            } else {
            qss->minThreads = qss->maxThreads/2;
            }
        }

        if(qss->workerWaitTimeout        qss->workerWaitTimeout=MIN_WORKER_WAIT_TIMEOUT;
        if(qss->workerWaitTimeout>MAX_WORKER_WAIT_TIMEOUT)
        qss->workerWaitTimeout=MAX_WORKER_WAIT_TIMEOUT;        */
}

typedef struct{
QSocketServer * qss;
HANDLE th;
}QSSWORKER_PARAM;

static WORD addQSSWorker(QSocketServer *qss,WORD addCounter){
WORD res=0;
if(qss->workerCounterminThreads||(qss->currentBusyWorkers==qss->workerCounter&&qss->workerCountermaxThreads)){
  DWORD threadId;
  QSSWORKER_PARAM * pParam=NULL;
  int i=0;
  EnterCriticalSection(&qss->QSS_LOCK);
  if(qss->workerCounter+addCounter<=qss->maxThreads)
   for(;i   {
    pParam=malloc(sizeof(QSSWORKER_PARAM));
    if(pParam){
     pParam->th=CreateThread(NULL,0,(LPTHREAD_START_ROUTINE)completionWorkerRoutine,pParam,CREATE_SUSPENDED,&threadId);
     pParam->qss=qss;
     ResumeThread(pParam->th);
     qss->workerCounter++,res++;
    }
   }
  LeaveCriticalSection(&qss->QSS_LOCK);
}
return res;
}

static void SOlogger(const char * msg,SOCKET s,int clearup){
perror(msg);
if(s>0)
closesocket(s);
if(clearup)
WSACleanup();
}

static int _InternalEchoProtocolHandler(LPWSAOVERLAPPED overlapped){
QSSOverlapped *qssOl=(QSSOverlapped *)overlapped;

printf("numOfT:%d,WSARecvd:%s,\n",qssOl->numberOfBytesTransferred,qssOl->buf);
//Sleep(500);
return send(qssOl->client_s,qssOl->buf,qssOl->numberOfBytesTransferred,0);
}

DWORD initializeSocketServer(SocketServer ** ssp,WORD passive,WORD port,CSocketLifecycleCallback cslifecb,InternalProtocolHandler protoHandler,WORD minThreads,WORD maxThreads,long workerWaitTimeout){
QSocketServer * qss=malloc(sizeof(QSocketServer));
qss->passive=passive>0?1:0;
qss->port=port;
qss->minThreads=minThreads;
qss->maxThreads=maxThreads;
qss->workerWaitTimeout=workerWaitTimeout;
qss->wsaVersion=MAKEWORD(2,0);
qss->lifecycleStatus=0;
InitializeCriticalSection(&qss->QSS_LOCK);
qss->workerCounter=0;
qss->currentBusyWorkers=0;
qss->CSocketsCounter=0;
qss->cslifecb=cslifecb,qss->protoHandler=protoHandler;
if(!qss->protoHandler)
qss->protoHandler=_InternalEchoProtocolHandler;
adjustQSSWorkerLimits(qss);
*ssp=(SocketServer *)qss;
return 1;
}

DWORD startSocketServer(SocketServer *ss){
QSocketServer * qss=(QSocketServer *)ss;
if(qss==NULL||InterlockedCompareExchange(&qss->lifecycleStatus,1,0))
return 0;
qss->serv_addr.sin_family=AF_INET;
qss->serv_addr.sin_port=htons(qss->port);
qss->serv_addr.sin_addr.s_addr=INADDR_ANY;//inet_addr("127.0.0.1");
if(WSAStartup(qss->wsaVersion,&qss->wsData)){
/*�q�里�q�有个插曲就是这个WSAStartup被调用的时�?它居然会启动一条额外的�U�程,当然�E�后�q�条�U�程会自动退出的.不知WSAClearup又会如何?......*/

  SOlogger("WSAStartup failed.\n",0,0);
  return 0;
}
qss->server_s=socket(AF_INET,SOCK_STREAM,IPPROTO_IP);
if(qss->server_s==INVALID_SOCKET){
  SOlogger("socket failed.\n",0,1);
  return 0;
}
if(bind(qss->server_s,(LPSOCKADDR)&qss->serv_addr,sizeof(SOCKADDR_IN))==SOCKET_ERROR){
  SOlogger("bind failed.\n",qss->server_s,1);
  return 0;
}
if(listen(qss->server_s,SOMAXCONN)==SOCKET_ERROR)/*�q�里来谈�?strong>backlog,很多��Z��知道设成何�?我见到过1,5,50,100�?有�h说设定的��大��耗资�?的确,�q�里设成SOMAXCONN不代表windows会真的��用SOMAXCONN,而是" If set to SOMAXCONN, the underlying service provider responsible for socket s will set the backlog to a maximum reasonable value. "�Q�同时在现实环境中，不同操作�pȝ��支持TCP�~�冲队列有所不同�Q�所以还不如让操作系�l�来军_��它的倹{��像Apache�q�种服务器：
#ifndef DEFAULT_LISTENBACKLOG
#define DEFAULT_LISTENBACKLOG 511
#endif
*/
    {
  SOlogger("listen failed.\n",qss->server_s,1);
        return 0;
    }
qss->iocpHandle=CreateIoCompletionPort(INVALID_HANDLE_VALUE,NULL,0,/*NumberOfConcurrentThreads-->*/qss->maxThreads);
//initialize worker for completion routine.
addQSSWorker(qss,qss->minThreads);
qss->lifecycleStatus=2;
{
  QSSWORKER_PARAM * pParam=malloc(sizeof(QSSWORKER_PARAM));
  pParam->qss=qss;
  pParam->th=NULL;
  if(qss->passive){
   DWORD threadId;
   pParam->th=CreateThread(NULL,0,(LPTHREAD_START_ROUTINE)acceptorRoutine,pParam,0,&threadId);
  }else
   return acceptorRoutine(pParam);
}
return 1;
}

DWORD shutdownSocketServer(SocketServer *ss){
QSocketServer * qss=(QSocketServer *)ss;
if(qss==NULL||InterlockedCompareExchange(&qss->lifecycleStatus,3,2)!=2)
  return 0;
closesocket(qss->server_s/*listen-socket*/);//..other accepted-sockets associated with the listen-socket will not be closed,except WSACleanup is called..
if(qss->CSocketsCounter==0)
  qss->lifecycleStatus=4,PostQueuedCompletionStatus(qss->iocpHandle,0,-1,NULL);
WSACleanup();
return 1;
}

DWORD  acceptorRoutine(LPVOID ss){
QSSWORKER_PARAM * pParam=(QSSWORKER_PARAM *)ss;
QSocketServer * qss=pParam->qss;
HANDLE curThread=pParam->th;
QSSOverlapped *qssOl=NULL;
SOCKADDR_IN client_addr;
int client_addr_leng=sizeof(SOCKADDR_IN);
SOCKET cs;
free(pParam);
while(1){
  printf("accept starting.....\n");
  cs/*Accepted-socket*/=accept(qss->server_s,(LPSOCKADDR)&client_addr,&client_addr_leng);
  if(cs==INVALID_SOCKET)
        {
   printf("accept failed:%d\n",GetLastError());
            break;
        }else{//SO_KEEPALIVE,SIO_KEEPALIVE_VALS �q�里是利用系�l�的"心蟩探测",keepalive probes.linux:setsockopt,SOL_TCP:TCP_KEEPIDLE,TCP_KEEPINTVL,TCP_KEEPCNT
            struct tcp_keepalive alive,aliveOut;
            int so_keepalive_opt=1;
            DWORD outDW;
            if(!setsockopt(cs,SOL_SOCKET,SO_KEEPALIVE,(char *)&so_keepalive_opt,sizeof(so_keepalive_opt))){
               alive.onoff=TRUE;
               alive.keepalivetime=QSS_SIO_KEEPALIVE_VALS_TIMEOUT;
               alive.keepaliveinterval=QSS_SIO_KEEPALIVE_VALS_INTERVAL;
               if(WSAIoctl(cs,SIO_KEEPALIVE_VALS,&alive,sizeof(alive),&aliveOut,sizeof(aliveOut),&outDW,NULL,NULL)==SOCKET_ERROR){
                    printf("WSAIoctl SIO_KEEPALIVE_VALS failed:%d\n",GetLastError());
                    break;
                }

            }else{
                     printf("setsockopt SO_KEEPALIVE failed:%d\n",GetLastError());
                     break;
            }
  }

  CreateIoCompletionPort((HANDLE)cs,qss->iocpHandle,cs,0);
  if(qssOl==NULL){
   qssOl=malloc(sizeof(QSSOverlapped));
  }
  qssOl->client_s=cs;
  qssOl->wsaBuf.len=MAX_BUF_SIZE,qssOl->wsaBuf.buf=qssOl->buf,qssOl->numberOfBytesTransferred=0,qssOl->flags=0;//initialize WSABuf.
  memset(&qssOl->overlapped,0,sizeof(WSAOVERLAPPED));
  {
   DWORD lastErr=GetLastError();
   int ret=0;
   SetLastError(0);
   ret=WSARecv(cs,&qssOl->wsaBuf,1,&qssOl->numberOfBytesTransferred,&qssOl->flags,&qssOl->overlapped,NULL);
   if(ret==0||(ret==SOCKET_ERROR&&GetLastError()==WSA_IO_PENDING)){
    InterlockedIncrement(&qss->CSocketsCounter);//Accepted-socket计数递增.
    if(qss->cslifecb)
     qss->cslifecb(cs,0);
    qssOl=NULL;
   }

   if(!GetLastError())
    SetLastError(lastErr);
  }

  printf("accept flags:%d ,cs:%d.\n",GetLastError(),cs);
}//end while.

if(qssOl)
  free(qssOl);
if(qss)
  shutdownSocketServer((SocketServer *)qss);
if(curThread)
  CloseHandle(curThread);

return 1;
}

static int postRecvCompletionPacket(QSSOverlapped * qssOl,int SOErrOccurredCode){
int SOErrOccurred=0;
DWORD lastErr=GetLastError();
SetLastError(0);
//SOCKET_ERROR:-1,WSA_IO_PENDING:997
if(WSARecv(qssOl->client_s,&qssOl->wsaBuf,1,&qssOl->numberOfBytesTransferred,&qssOl->flags,&qssOl->overlapped,NULL)==SOCKET_ERROR
  &&GetLastError()!=WSA_IO_PENDING)//this case lastError maybe 64, 10054
{
  SOErrOccurred=SOErrOccurredCode;
}
if(!GetLastError())
  SetLastError(lastErr);
if(SOErrOccurred)
  printf("worker[%d] postRecvCompletionPacket SOErrOccurred=%d,preErr:%d,postedErr:%d\n",GetCurrentThreadId(),SOErrOccurred,lastErr,GetLastError());
return SOErrOccurred;
}

DWORD  completionWorkerRoutine(LPVOID ss){
QSSWORKER_PARAM * pParam=(QSSWORKER_PARAM *)ss;
QSocketServer * qss=pParam->qss;
HANDLE curThread=pParam->th;
QSSOverlapped * qssOl=NULL;
DWORD numberOfBytesTransferred=0;
ULONG_PTR completionKey=0;
int postRes=0,handleCode=0,exitCode=0,SOErrOccurred=0;
free(pParam);
while(!exitCode){
  SetLastError(0);
  if(GetQueuedCompletionStatus(qss->iocpHandle,&numberOfBytesTransferred,&completionKey,(LPOVERLAPPED *)&qssOl,qss->workerWaitTimeout)){
   if(completionKey==-1&&qss->lifecycleStatus>=4)
   {
    printf("worker[%d] completionKey -1:%d \n",GetCurrentThreadId(),GetLastError());
    if(qss->workerCounter>1)
     PostQueuedCompletionStatus(qss->iocpHandle,0,-1,NULL);
    exitCode=1;
    break;
   }
   if(numberOfBytesTransferred>0){

    InterlockedIncrement(&qss->currentBusyWorkers);
    addQSSWorker(qss,1);
    handleCode=qss->protoHandler((LPWSAOVERLAPPED)qssOl);
    InterlockedDecrement(&qss->currentBusyWorkers);

    if(handleCode>=0){
     SOErrOccurred=postRecvCompletionPacket(qssOl,1);
    }else
     SOErrOccurred=2;
   }else{
    printf("worker[%d] numberOfBytesTransferred==0 ***** closesocket servS or cs *****,%d,%d ,ol is:%d\n",GetCurrentThreadId(),GetLastError(),completionKey,qssOl==NULL?0:1);
    SOErrOccurred=3;
   }
  }else{ //GetQueuedCompletionStatus rtn FALSE, lastError 64 ,995[timeout worker thread exit.] ,WAIT_TIMEOUT:258
   if(qssOl){
    SOErrOccurred=postRecvCompletionPacket(qssOl,4);
   }else {

    printf("worker[%d] GetQueuedCompletionStatus F:%d \n",GetCurrentThreadId(),GetLastError());
    if(GetLastError()!=WAIT_TIMEOUT){
     exitCode=2;
    }else{//wait timeout
     if(qss->lifecycleStatus!=4&&qss->currentBusyWorkers==0&&qss->workerCounter>qss->minThreads){
      EnterCriticalSection(&qss->QSS_LOCK);
      if(qss->lifecycleStatus!=4&&qss->currentBusyWorkers==0&&qss->workerCounter>qss->minThreads){
       qss->workerCounter--;//until qss->workerCounter decrease to qss->minThreads
       exitCode=3;
      }
      LeaveCriticalSection(&qss->QSS_LOCK);
     }
    }
   }
  }//end GetQueuedCompletionStatus.

  if(SOErrOccurred){
   if(qss->cslifecb)
    qss->cslifecb(qssOl->client_s,-1);
   /*if(qssOl)*/{
    closesocket(qssOl->client_s);
    free(qssOl);
   }
   if(InterlockedDecrement(&qss->CSocketsCounter)==0&&qss->lifecycleStatus>=3){
    //for qss workerSize,PostQueuedCompletionStatus -1
    qss->lifecycleStatus=4,PostQueuedCompletionStatus(qss->iocpHandle,0,-1,NULL);
    exitCode=4;
   }
  }
  qssOl=NULL,numberOfBytesTransferred=0,completionKey=0,SOErrOccurred=0;//for net while.
}//end while.

//last to do
if(exitCode!=3){
  int clearup=0;
  EnterCriticalSection(&qss->QSS_LOCK);
  if(!--qss->workerCounter&&qss->lifecycleStatus>=4){//clearup QSS
    clearup=1;
  }
  LeaveCriticalSection(&qss->QSS_LOCK);
  if(clearup){
   DeleteCriticalSection(&qss->QSS_LOCK);
   CloseHandle(qss->iocpHandle);
   free(qss);
  }
}
CloseHandle(curThread);
return 1;
}
------------------------------------------------------------------------------------------------------------------------
对于IOCP的LastError的��L别和处理是个隄��,所以请注意我的completionWorkerRoutine的while�l�构,
�l�构如下:
while(!exitCode){
    if(completionKey==-1){...break;}
    if(GetQueuedCompletionStatus){/*在这个if体中只要你投递的OVERLAPPED is not NULL,那么�q�里你得到的��是�?/strong>.*/
        if(numberOfBytesTransferred>0){
               /*在这里handle request,记得要��l�投递你的OVERLAPPED�? */
        }else{
              /*�q�里可能客户端或服务端closesocket(the socket),但是OVERLAPPED is not NULL,只要你投递的不�ؓNULL!*/
        }
    }else{/*在这里的if体中,虽然GetQueuedCompletionStatus return FALSE,但是不代表OVERLAPPED一定�ؓNULL.特别是OVERLAPPED is not NULL的情况下,不要以�ؓLastError发生�?��׃��表当前的socket无用或发生致命的异常,比如发生lastError:995�q�种情况下此时的socket有可能是一切正常的可用�?你不应该关闭�?/strong>.*/
        if(OVERLAPPED is not NULL){
             /*�q�种情况�?请不��?7,21�l�箋投递吧!在投递后再检��错�?/strong>.*/
        }else{

        }
    }
if(socket error occured){

}
prepare for next while.
}

行文仓促,隑օ�有错误或不��之处,希望大家�t�跃指正评论,谢谢!

�q�个模型在性能上还是有改进的空间哦�Q?/strong>

from:
http://www.shnenglu.com/adapterofcoms/archive/2010/06/26/118781.aspx

chatler 2010-08-25 20:42 发表评论

一个基于Event Poll(epoll)的TCP Server Framework,��析epoll

chatler — Wed, 25 Aug 2010 12:41:00 GMT
     摘要: epoll,event poll,on linux kernel 2.6.x.pthread,nptl-2.12   LT/ET:ET也会多次发送event,当然频率�q�低于LT,但是epoll one shot才是真正的对"one connection VS one thread in worker thread pool,不依赖于��M��connection-...  阅读全文

chatler 2010-08-25 20:41 发表评论

TCP: SYN ACK FIN RST PSH URG 详解

chatler — Fri, 16 Jul 2010 06:14:00 GMT

版权声明�Q��{载时请以��链接�Ş式标明文章原始出处和作者信息及本声�?/a>
http://xufish.blogbus.com/logs/40536553.html

TCP 的三�ơ握�?/strong>是怎么�q�行的了�Q�发送端发送一个SYN=1�Q�ACK=0标志的数据包�l�接收端�Q�请求进行连接，�q�是�W�一�ơ握手；接收端收到请求�ƈ且允许连接的话，��׃��发送一个SYN=1�Q�ACK=1标志的数据包�l�发送端�Q�告诉它�Q�可以通讯了，�q�且让发送端发送一个确认数据包�Q�这是第二次握手�Q? 最后，发送端发送一个SYN=0�Q�ACK=1的数据包�l�接收端�Q�告诉它�q�接已被��认�Q�这��是�W�三�ơ握手。之后，一个TCP�q�接建立�Q�开始通讯�?/p>
*SYN�Q�同步标�?br style="line-height: normal;">同步序列�~�号(Synchronize Sequence Numbers)栏有效。该标志仅在三次握手建立TCP�q�接时有效。它提示TCP�q�接的服务端��查序列编��P��该序列编号�ؓTCP�q�接初始�?一般是客户 �?的初始序列编受��在�q�里�Q�可以把TCP序列�~�号看作是一个范围从0�?�Q?94�Q?67�Q?95�?2位计数器。通过TCP�q�接交换的数据中每一个字节都�l�过序列�~�号。在TCP报头中的序列�~�号栏包括了TCP分段中第一个字节的序列�~�号�?/p>
*ACK�Q�确认标�?br style="line-height: normal;">��认�~�号(Acknowledgement Number)栏有效。大多数情况下该标志位是�|�位的。TCP报头内的��认�~�号栏内包含的确认编�?w+1�Q�Figure-1)��Z��一个预期的序列�~�号�Q? 同时提示�q�端�pȝ��已经成功接收所有数据�?/p>
*RST�Q�复位标�?br style="line-height: normal;">复位标志有效。用于复位相应的TCP�q�接�?/p>
*URG�Q�紧急标�?br style="line-height: normal;">紧�?The urgent pointer) 标志有效。紧急标志置位，

*PSH�Q�推标志
�? 标志�|�位�Ӟ��接收端不��该数据�q�行队列处理�Q�而是��可能快��数据�{由应用处理。在处理 telnet �?rlogin �{�交互模式的�q�接�Ӟ��该标志��L��|�位的�?/p>
*FIN�Q�结束标�?br style="line-height: normal;">带有该标志置位的数据包用来结束一个TCP回话�Q�但对应端口仍处于开攄��态，准备接收后箋数据�?/p>
=============================================================
三次握手Three-way Handshake

一个虚拟连接的建立是通过三次握手来实现的

1. (B) --> [SYN] --> (A)

假如�? 务器A和客��h��B通讯. 当A要和B通信�Ӟ��B首先向A发一个SYN (Synchronize) 标记的包�Q�告诉A��h��建立�q�接.

注意: 一�? SYN包就是仅SYN标记设�ؓ1的TCP�?参见TCP包头Resources). 认识到这点很重要�Q�只有当A受到B发来的SYN包，才可建立�q�接�Q�除此之外别无他法。因此，如果你的防火墙丢弃所有的发往外网接口的SYN包，那么你将�? 能让外部��M��L��d��建立�q�接�?br style="line-height: normal;">
2. (B) <-- [SYN/ACK] <--(A)

接着�Q�A收到后会发一个对SYN包的��认�?SYN/ACK)�? 去，表示对第一个SYN包的��认�Q��ƈ�l�箋握手操作.

注意: SYN/ACK包是仅SYN �?ACK 标记�?的包.

3. (B) --> [ACK] --> (A)

B收到SYN/ACK �?B发一个确认包(ACK)�Q�通知A�q�接已徏立。至此，三次握手完成�Q�一个TCP�q�接完成

Note: ACK包就是仅ACK 标记设�ؓ1的TCP�? 需要注意的是当三此握手完成、连接徏立以后，TCP�q�接的每个包都会讄��ACK�?br style="line-height: normal;">
�q�就是�ؓ何连接跟�t�很重要的原因了. 没有�q�接跟踪,防火墙将无法判断收到的ACK包是否属于一个已�l�徏立的�q�接.一般的包过�?Ipchains)收到ACK包时,会让它通过(�q�绝对不是个好主�?. 而当状态型防火墙收到此�U�包�Ӟ��它会先在�q�接表中查找是否属于哪个已徏�q�接�Q�否则丢弃该�?br style="line-height: normal;">
四次握手Four-way Handshake

四次握手用来关闭已徏立的TCP�q�接

1. (B) --> ACK/FIN --> (A)

2. (B) <-- ACK <-- (A)

3. (B) <-- ACK/FIN <-- (A)

4. (B) --> ACK --> (A)

注意: �׃��TCP�q�接是双向连�? 因此关闭�q�接需要在两个方向上做。ACK/FIN �?ACK 和FIN 标记设�ؓ1)通常被认为是FIN(�l�结)�?然�? �׃��q�接�q�没有关�? FIN包��L��打上ACK标记. 没有ACK标记而仅有FIN标记的包不是合法的包�Q��ƈ且通常被认为是恶意�?br style="line-height: normal;">
�q�接复位Resetting a connection

四次握手不是关闭 TCP�q�接的唯一�Ҏ��. 有时,如果��L��需要尽快关闭连�?或连接超�?端口或主��Z��可达),RST (Reset)包将被发�? 注意在，�׃��RST包不是TCP�q�接中的必须部分, 可以只发送RST�?即不带ACK标记). 但在正常的TCP�q�接中RST包可以带ACK��认标记

��h��意RST包是�? 以不要收到方��认�?

无效的TCP标记Invalid TCP Flags

到目前�ؓ止，你已�l�看��C�� SYN, ACK, FIN, 和RST 标记. 另外�Q�还有PSH (Push) 和URG (Urgent)标记.

最常见的非法组合是SYN/FIN �? 注意:�׃�� SYN包是用来初始化连接的, 它不可能�?FIN和RST标记一起出�? �q�也是一个恶意攻�?

�׃��现在大多数防火墙已知 SYN/FIN �? 别的一些组�?例如SYN/FIN/PSH, SYN/FIN/RST, SYN/FIN/RST/PSH。很明显�Q�当�|�络中出现这�U�包�Ӟ��很你的网�l�肯定受到攻��M��?br style="line-height: normal;">
别的已知的非法包有FIN (无ACK标记)�?NULL"包。如同早先讨论的�Q�由于ACK/FIN包的出现是�ؓ了关闭一个TCP�q�接�Q�那么正常的FIN包��L��带有 ACK 标记�?NULL"包就是没有�Q何TCP标记的包(URG,ACK,PSH,RST,SYN,FIN都�ؓ0)�?br style="line-height: normal;">
到目前�ؓ止，正常的网 �l�活动下�Q�TCP协议栈不可能产生带有上面提到的�Q何一�U�标记组合的TCP包。当你发现这些不正常的包�Ӟ��肯定有�h对你的网�l�不怀好意�?br style="line-height: normal;">
UDP (用户数据包协议User Datagram Protocol)
TCP是面向连�? 的，而UDP是非�q�接的协议。UDP没有�Ҏ��受进行确认的标记和确认机制。对丢包的处理是在应用层来完成的�?or accidental arrival).

此处需要重�Ҏ��意的事情是：在正常情况下�Q�当UDP包到达一个关闭的端口�Ӟ��会返回一个UDP复位包。由于UDP是非面向�q�接�? 因此没有��M��认信息来确认包是否正确到达目的地。因此如果你的防火墙丢弃UDP包，它会开放所有的UDP端口(?)�?br style="line-height: normal;">
�׃��Internet 上正常情况下一些包��被丢弃�Q�甚��x��些发往已关闭端�?非防火墙�?的UDP包将不会到达目的�Q�它们将�q�回一个复位UDP包�?br style="line-height: normal;">
因�ؓ�q�个原因�Q�UDP 端口扫描��L��不精��、不可靠的�?br style="line-height: normal;">
看�v来大UDP包的��片是常见的DOS (Denial of Service)��d��的常见�Ş�?(�q�里有个DOS��d��的例子，http://grc.com/dos/grcdos.htm ).

ICMP (�|�间控制消息协议Internet Control Message Protocol)
如同名字一��P�� ICMP用来在主�?路由器之间传递控制信息的协议�?ICMP包可以包含诊断信�?ping, traceroute - 注意目前unix�pȝ��中的traceroute用UDP包而不是ICMP)�Q�错误信�?�|�络/��L��/端口不可�? network/host/port unreachable), 信息(旉��戳timestamp, 地址掩码address mask request, etc.)�Q�或控制信息 (source quench, redirect, etc.) �?br style="line-height: normal;">
你可以在http://www.iana.org/assignments/icmp-parameters�? 扑ֈ�ICMP包的�c�d��?br style="line-height: normal;">
��管ICMP通常是无害的�Q�还是有些类型的ICMP信息需要丢弃�?br style="line-height: normal;">
Redirect (5), Alternate Host Address (6), Router Advertisement (9) 能用来�{发通讯�?br style="line-height: normal;">
Echo (8), Timestamp (13) and Address Mask Request (17) 能用来分别判断主机是否�v来，本地旉��和地址掩码。注意它们是和返回的信息�c�d��有关的。它们自己本�w�是不能被利用的�Q�但它们泄露出的信息�Ҏ��击者是有用的�?br style="line-height: normal;">
ICMP 消息有时也被用来作�ؓDOS��d��的一部分(例如�Q�洪水ping flood ping,�?ping ?呵呵�Q�有��?ping of death)?/p>

包碎片注意A Note About Packet Fragmentation

如果一个包的大��超�q�了TCP的最大段长度MSS (Maximum Segment Size) 或MTU (Maximum Transmission Unit)�Q�能够把此包发往目的的唯一�Ҏ��是把此包分片。由于包分片是正常的�Q�它可以被利用来做恶意的��d��?br style="line-height: normal;">
因�ؓ分片的包的第一�? 分片包含一个包��_��若没有包分片的重�l�功能，包过滤器不可能检��附加的包分片。典型的��d��Typical attacks involve in overlapping the packet data in which packet header is 典型的攻击Typical attacks involve in overlapping the packet data in which packet header isnormal until is it overwritten with different destination IP (or port) thereby bypassing firewall rules。包分片能作�?DOS ��d��的一部分�Q�它可以crash older IP stacks 或涨死CPU�q�接能力�?br style="line-height: normal;">
Netfilter/Iptables中的�q�接跟踪代码能自动做分片重组。它仍有��q��Q�可�? 受到饱和�q�接��d��Q�可以把CPU资源耗光�?br style="line-height: normal;">
握手阶段�Q?br style="line-height: normal;">序号方向 seq ack
1　　A->B 10000 0
2 B->A 20000 10000+1=10001
3 A->B 10001 20000+1=20001
解释�Q?br style="line-height: normal;">1�Q�A向B发�v �q�接��h��Q�以一个随机数初始化A的seq,�q�里假设�?0000�Q�此时ACK�Q?

2�Q�B收到A的连接请求后�Q�也以一个随机数初始化B的seq�Q�这里假设�ؓ20000�Q�意�? 是：你的��h��我已收到�Q�我�q�方的数据流��׃��q�个数开始。B的ACK是A的seq�?�Q�即10000�Q?�Q?0001

3�Q�A收到B的回�? 后，它的seq是它的上个请求的seq�?�Q�即10000�Q?�Q?0001�Q�意思也是：你的回复我收��C��Q�我�q�方的数据流��׃��q�个数开始。A此时的ACK 是B的seq�?�Q�即20000+1=20001

数据传输阶段�Q?br style="line-height: normal;">序号　　方向　　　　　　seq ack size
23 A->B 40000 70000 1514
24 B->A 70000 40000+1514-54=41460 54
25 A->B 41460 70000+54-54=70000 1514
26 B->A 70000 41460+1514-54=42920 54
解释�Q?br style="line-height: normal;">23:B接收�? A发来的seq=40000,ack=70000,size=1514的数据包
24: 于是B向A也发一个数据包�Q�告诉B�Q�你的上个包我收��C��。B的seq��׃��它收到的数据包的ACK填充�Q�ACK是它收到的数据包的SEQ加上数据包的大小 (不包括以太网协议��_��IP��_��TCP�?�Q�以证实B发过来的数据全收��C��?br style="line-height: normal;">25:A 在收到B发过来的ack�?1460的数据包�Ӟ��一看到41460�Q�正好是它的上个数据包的seq加上包的大小�Q�就明白�Q�上�ơ发送的数据包已安全到达。于是它再发一个数据包�l�B。这个正在发送的数据包的seq也以它收到的数据包的ACK填充�Q�ACK��׃��它收到的数据包的seq(70000)加上包的 size(54)填充,即ack=70000+54-54(全是头长�Q�没数据��?�?br style="line-height: normal;">
其实在握手和�l�束时确认号应该是对方序列号�?,传输数据时则是对方序列号加上�Ҏ��携带�? 用层数据的长�?如果从以太网包返回来计算所加的长度,��嫌走弯路了.
另外,如果�? �Ҏ��有数据过�?则自��q��认号不�?序列号�ؓ上次的序列号加上本次应用层数据发送长�?/span>

chatler 2010-07-16 14:14 发表评论

chatler — Tue, 13 Jul 2010 07:28:00 GMT
NAT的优点不必多�?它提供了一�p�d��相关技术来实现多个内网用户通过一个公�|�ip和外部通信,有效的解决了ipv4地址不够用的问题.那么位于NAT�? 的用户��用私�|�ip真的和��用公�|�ip一样吗?NAT解决了所有地址转换的相关问题了�?
下面主要讲一些NAT不支持的斚w��,以及所谓的NAT �?�~�陷".

一些应用层协议(如TCP和SIP),在它们的应用层数据中需要包含公�|�IP地址.拿FTP来说�?众所周知,FTP是通过两个不同的连接来传输控制报文和数据报文的.当传输一个文件时,FTP服务器要求通过控制报文得到卛_��传输的数据报文的�|�络层和传输层地址 (IP/PORT).如果�q�个时候客户主机是在NAT之后�?那么服务器端收到的ip/port��会是NAT转化前的�U�网IP地址,从而会��D��文�g传输�? �?
SIP(Session Initiation Protocol)主要是来控制音频传输�?�q�个协议也面临同��L��问题.因�ؓSIP建立�q�接�?需要用到几个不同的端口来通过RTP传输音频��?而且�q�些端口以及IP会被�~�码到音频流�?传输�l�服务器�?从而实现后�l�的通信.
如果没有一些特�D�的技�?如STUN),那么NAT是不支持�q�些协议�? �q�些协议�l�过NAT也肯定会��p�|.
Some Application Layer protocols (such as FTP and SIP) send explicit network addresses within their application data. FTP in active mode, for example, uses separate connections for control traffic (commands) and for data traffic (file contents). When requesting a file transfer, the host making the request identifies the corresponding data connection by its network layer and transport layer addresses. If the host making the request lies behind a simple NAT firewall, the translation of the IP address and/or TCP port number makes the information received by the server invalid. The Session Initiation Protocol (SIP) controls Voice over IP (VoIP) communications and suffers the same problem. SIP may use multiple ports to set up a connection and transmit voice stream via RTP. IP addresses and port numbers are encoded in the payload data and must be known prior to the traversal of NATs. Without special techniques, such as STUN, NAT behavior is unpredictable and communications may fail.

�? 面讲一些特�D�的技�?来��NAT支持�q�些�Ҏ��的应用层协议.

最直观的想法就�?既然NAT修改了IP/PROT,那么我们也修改应用层�? 据中相应的IP/PORT.应用层网�?ALG)(��g或��Y仉��?��是�q�样来解册��个问题的.应用层网兌��行在讄��了NAT的防火墙讑֤��?它会更新�? 输数据中的IP/PORT.所�?应用层网关也必须能够解析应用层协�?而且对于每一�U�协�?可能需要不同的应用层网��x��?
Application Layer Gateway (ALG) software or hardware may correct these problems. An ALG software module running on a NAT firewall device updates any payload data made invalid by address translation. ALGs obviously need to understand the higher-layer protocol that they need to fix, and so each protocol with this problem requires a separate ALG.

另外一个解��x��问题的办法就是NAT�I�K�?此方法主要利用STUN�? ICE�{�协议或者一些和会话控制相关的特有的�Ҏ��来实�?理论上NAT�I�K��最好能够同旉��用于基于TCP和基于UDP的应�?但是��Z��UDP的应用相�Ҏ�� 较简�?更广为流�?也更适合兼容一些种�cȝ��NAT做穿�?�q�样,应用层协议在设计的时�?必须考虑到可支持NAT�I�K�?但一些其他类型的NAT(比如�? �U�NAT)是无论如何也不能做穿透的.
Another possible solution to this problem is to use NAT traversal techniques using protocols such as STUN or ICE or proprietary approaches in a session border controller. NAT traversal is possible in both TCP- and UDP-based applications, but the UDP-based technique is simpler, more widely understood, and more compatible with legacy NATs. In either case, the high level protocol must be designed with NAT traversal in mind, and it does not work reliably across symmetric NATs or other poorly-behaved legacy NATs.

�q�有一些方�?比如UPnP (Universal Plug and Play) �?Bonjour (NAT-PMP),但是�q�些�Ҏ��都需要专门的NAT讑֤�.
Other possibilities are UPnP (Universal Plug and Play) or Bonjour (NAT-PMP), but these require the cooperation of the NAT device.

大部分传�l�的客户-服务器协�?除了FTP),都不定义3层以上的数据�? �?所�?也就可以和传�l�的NAT兼容.实际�?在设计应用层协议的时候应��量避免涉及�?层以上的数据,因�ؓ�q�样会��它兼容NAT时复杂化.
Most traditional client-server protocols (FTP being the main exception), however, do not send layer 3 contact information and therefore do not require any special treatment by NATs. In fact, avoiding NAT complications is practically a requirement when designing new higher-layer protocols today.

NAT也会和利用ipsec加密的一些应用冲�H?比如SIP电话,如果有很多SIP电话讑֤��? NA(P)T之后,那么在电话利用ipsc加密它们的信��h��,如果也加密了port信息,那么�q�就意味着NAPT��׃��能�{换port,只能转换IP.但是 �q�样��׃��D��回来的数据包都被NAT到同一个客��L��,从而导致通信��p�|(不太明白).不过,�q�个问题有很多方法来解决,比如用TLS.TLS是运行在�W�四 �?OSI模型)�?所以它不包含port信息.也可以在UDP之内来封装ipsec,TISPAN ��是用这�U�方法来实现安全NAT转化�?
NATs can also cause problems where IPsec encryption is applied and in cases where multiple devices such as SIP phones are located behind a NAT. Phones which encrypt their signaling with IPsec encapsulate the port information within the IPsec packet meaning that NA(P)T devices cannot access and translate the port. In these cases the NA(P)T devices revert to simple NAT operation. This means that all traffic returning to the NAT will be mapped onto one client causing the service to fail. There are a couple of solutions to this problem, one is to use TLS which operates at level 4 in the OSI Reference Model and therefore does not mask the port number, or to Encapsulate the IPsec within UDP - the latter being the solution chosen by TISPAN to achieve secure NAT traversal.

Dan Kaminsky �?008�q�的时候提出NAPT�q�会间接的媄响DNS协议的健壮�?��Z��避免DNS服务器缓存中�?在NA(p)T防火墙之后的DNS服务器最好不要�{�? 来自外部的DNS��h��(UDP)的源端口.而对DNS�~�存中毒��d��的应�Ҏ��施就是��所有的DNS服务器用随机的端口来接收DNS��h��.但如果NA(P)T 使DNS��h��的源端口也随机化,那么在NA(P)T防火墙后面的DNS服务器还是会崩溃�?
The DNS protocol vulnerability announced by Dan Kaminsky on 2008 July 8 is indirectly affected by NAT port mapping. To avoid DNS server cache poisoning, it is highly desirable to not translate UDP source port numbers of outgoing DNS requests from any DNS server which is behind a firewall which implements NAT. The recommended work-around for the DNS vulnerability is to make all caching DNS servers use randomized UDP source ports. If the NAT function de-randomizes the UDP source ports, the DNS server will be made vulnerable.

�? 于NAT后的��L��不能实现真的端对端的通信,也不能��用一些和NAT冲突的internat协议.而且从外部发��L��TCP�q�接和一些无状态的协议(利用 udp的上层协�?也不能正常的�q�行,除非NAT所在设备通过相关技术支持这些协�?一些协议能够利用应用层�|�关或其他技�?来��只有一端处于NAT后的通信双方正常通信.但要是双斚w��在NAT后就会失�?NAT也和一些隧道协�?如ipsec)冲突,因�ؓNAT会修改ip或port,从而会使协议的完整性校验失�?
Hosts behind NAT-enabled routers do not have end-to-end connectivity and cannot participate in some Internet protocols. Services that require the initiation of TCP connections from the outside network, or stateless protocols such as those using UDP, can be disrupted. Unless the NAT router makes a specific effort to support such protocols, incoming packets cannot reach their destination. Some protocols can accommodate one instance of NAT between participating hosts ("passive mode" FTP, for example), sometimes with the assistance of an application-level gateway (see below), but fail when both systems are separated from the Internet by NAT. Use of NAT also complicates tunneling protocols such as IPsec because NAT modifies values in the headers which interfere with the integrity checks done by IPsec and other tunneling protocols.

端对端的�q�接�? internet设计时的一个重要的核心的基本原�?而NAT是违背这一原则�?但是NAT在设计的时候也充分地考虑��C��q�些问题.现在��Z��ipv6�? NAT已经被广泛关�?但许多ipv6架构设计者认为ipv6应该摒弃NAT.
End-to-end connectivity has been a core principle of the Internet, supported for example by the Internet Architecture Board. Current Internet architectural documents observe that NAT is a violation of the End-to-End Principle, but that NAT does have a valid role in careful design. There is considerably more concern with the use of IPv6 NAT, and many IPv6 architects believe IPv6 was intended to remove the need for NAT.

�׃��NAT的连接追�t�具有短时效�?所以在特定的地址转换关系会在一��段旉��后失�? 除非遵守NAT的keep-alive机制,内网��L��不时的去讉K��外部��L��.�q�至��会造成一些不必要的消�?比如消耗手持设备的电量.
Because of the short-lived nature of the stateful translation tables in NAT routers, devices on the internal network lose IP connectivity typically within a very short period of time unless they implement NAT keep-alive mechanisms by frequently accessing outside hosts. This dramatically shortens the power reserves on battery-operated hand-held devices and has thwarted more widespread deployment of such IP-native Internet-enabled devices.

一些IPS会直接提供给用户�U�网IP地址,�q�样用户��必��通过IPS�? NAT来和外部INTERNET通信.�q�样,用户实际上没有实现端对端通信,中间加了一个IPS的NAT,�q�有悖于Internet Architecture Board列出的internal核心基本原则.
Some Internet service providers (ISPs) provide their customers only with "local" IP addresses.[citation needed]Thus, these customers must access services external to the ISP's network through NAT. As a result, the customers cannot achieve true end-to-end connectivity, in violation of the core principles of the Internet as laid out by the Internet Architecture Board.

NAT 最后的一个缺陷就�?NAT的推�q�和使用,解决了ipv4下IP地址不够用的问题,大大的推�q�了IPV6的发�?
(说它是优点好�?�q�是�~�陷�? �?)
it is possible that its [NAT] widespread use will significantly delay the need to deploy IPv6

Reference:
Network address translation

from:
http://blog.chinaunix.net/u2/86590/showart.php?id=2208148

chatler 2010-07-13 15:28 发表评论

Linux下面socket�~�程的非��d��TCP 研究

chatler — Wed, 07 Jul 2010 09:14:00 GMT

tcp协议�? �w�是可靠�?�q�不�{�于应用�E�序用tcp发送数据就一定是可靠�?不管是否��d��,send发送的大小,�q�不代表对端recv到多��的数据.

�?span style="line-height: normal; color: #ff0000;">��d��模式�? send函数的过�E�是��应用程序请求发送的数据拯��到发送缓存中发送�ƈ得到��认后再�q�回.但由于发送缓存的存在,表现�?如果发送缓存大��比��h��发送的�? ��要�?那么send函数立即�q�回,同时向网�l�中发送数�?否则,send向网�l�发送缓存中不能容纳的那部分数据,�q�等待对端确认后再返�?接收端只要将数据收到接收�~�存�?��׃��认,�q�不一定要�{�待应用�E�序调用recv);

�?/span>非阻塞模�?/span>�?send函数的过�E�仅仅是��数据拷贝到协议栈的�~�存��已,如果�~�存区可用空间不�?则尽能力的拷�?�q�回成功拯��的大��?如缓存区可用�I�间�?,则返�?1,同时讄��errno�? EAGAIN.

linux下可�?span style="line-height: normal; color: #cc3333;">sysctl -a | grep net.ipv4.tcp_wmem查看�pȝ��? 认的发送缓存大��?
net.ipv4.tcp_wmem = 4096 16384 81920
�q? 有三个�?�W�一个值是socket的发送缓存区分配的最��字节数,�W�二个值是默认�?该��g��被net.core.wmem_default覆盖),�~�存�? 在系�l�负载不重的情况下可以增长到�q�个�?�W�三个值是发送缓存区�I�间的最大字节数(该��g��被net.core.wmem_max覆盖).
�Ҏ��实际��试, 如果手工更改了net.ipv4.tcp_wmem的�?则会按更改的值来�q�行,否则在默认情况下,协议栈通常是按 net.core.wmem_default和net.core.wmem_max的值来分配内存�?

应用�E�序应该�Ҏ��应用的特性在�E�序中更改发送缓存大��?

socklen_t sendbuflen = 0;
socklen_t len = sizeof(sendbuflen);
getsockopt(clientSocket, SOL_SOCKET, SO_SNDBUF, (void*)&sendbuflen, &len);
printf("default,sendbuf:%d\n", sendbuflen);

sendbuflen = 10240;
setsockopt(clientSocket, SOL_SOCKET, SO_SNDBUF, (void*)&sendbuflen, len);
getsockopt(clientSocket, SOL_SOCKET, SO_SNDBUF, (void*)&sendbuflen, &len);
printf("now,sendbuf:%d\n", sendbuflen);

需要注意的�?虽然��发送缓存设�|? 成了10k,但实际上,协议栈会��其扩大1�?设�ؓ20k.

-------------------�? 例分�?---------------------

�? 实际应用�?如果发送端是非��d��发�?�׃��|�络的阻塞或者接收端处理�q�慢,通常出现的情冉|��,发送应用程序看��h��发送了10k的数�?但是只发送了2k�? 对端�~�存�?�q�有8k在本机缓存中(未发送或者未得到接收端的��认).那么此时,接收应用�E�序能够收到的数据�ؓ2k.假如接收应用�E�序调用recv函数�? 取了1k的数据在处理,在这个瞬�?发生了以下情况之一,双方表现�?

A. 发送应用程序认为send完了10k数据,关闭了socket:
�? 送主��Z��为tcp的主动关闭�?�q�接��处于FIN_WAIT1的半关闭状�?�{�待�Ҏ��的ack),�q�且,发送缓存中�?k数据�q�不清除,依然会发送给�? �?如果接收应用�E�序依然在recv,那么它会收到余下�?k数据(�q�个前题�?接收端会在发送端FIN_WAIT1状态超时前收到余下�?k数据.), 然后得到一个对端socket被关闭的消息(recv�q�回0).�q�时,应该�q�行关闭.

B. 发送应用程序再�ơ调用send发�?k的数�?
�? 如发送缓存的�I�间�?0k,那么发送缓存可用空间�ؓ20-8=12k,大于��h��发送的8k,所以send函数��数据做拯��?�q�立卌��?192;

�? 如发送缓存的�I�间�?2k,那么此时发送缓存可用空间还�?2-8=4k,send()会返�?096,应用�E�序发现�q�回的值小于请求发送的大小值后,可以�? 为缓存区已满,�q�时必须��d��(或通过select�{�待下一�ơsocket可写的信�?,如果应用�E�序不理�?立即再次调用send,那么会得�?1的�? 在linux下表��Cؓerrno=EAGAIN.

C. 接收应用�E�序在处理完1k数据�?关闭了socket:
�? 收主��Z��Z��动关闭�?�q�接��处于FIN_WAIT1的半关闭状�?�{�待�Ҏ��的ack).然后,发送应用程序会收到socket可读的信�?通常�? select调用�q�回socket可读),但在��d��时会发现recv函数�q�回0,�q�时应该调用close函数来关闭socket(发送给�Ҏ��ack);

�? 果发送应用程序没有处理这个可�ȝ��信号,而是在send,那么�q�要分两�U�情冉|��考虑,假如是在发送端收到RST标志之后调用send,send��返�? -1,同时errno设�ؓECONNRESET表示对端�|�络已断开,但是,也有说法是进�E�会收到SIGPIPE信号, 该信��L��默认响应动作是退��E?如果忽略该信�?那么send是返�?1,errno为EPIPE(未证�?;如果是在发送端收到RST标志之前,则send像往�怸�样工�?

以上说的是非��d��? send情况,假如send是阻塞调�?�q�且正好处于��d��?例如一�ơ性发送一个巨大的buf,��出了发送缓�?,对端socket关闭,那么send��? �q�回成功发送的字节�?如果再次调用send,那么会同上一�?

D. 交换机或路由器的�|�络断开:
接收应用�E�序在处理完已收到的1k数据�?会��l�从�~�存�� 取余下的1k数据,然后��p��Cؓ无数据可�ȝ��现象,�q�种情况需要应用程序来处理��时.一般做法是讑֮�一个select�{�待的最大时�?如果��出�q�个旉��? 然没有数据可�?则认为socket已不可用.

�? 送应用程序会不断的将余下的数据发送到�|�络�?但始�l�得不到��认,所以缓存区的可用空间持�l��ؓ0,�q�种情况也需要应用程序来处理.

如果不由应用�E�序来处理这�U�情况超时的情况,也可以通过tcp协议本��n来处�?具体可以�? 看sysctl��中�?
net.ipv4.tcp_keepalive_intvl
net.ipv4.tcp_keepalive_probes
net.ipv4.tcp_keepalive_time

原文地址 http://xufish.blogbus.com/logs/40537344.html

from:
http://blog.chinaunix.net/u2/67780/showart_2056353.html

chatler 2010-07-07 17:14 发表评论

教你用c实现http协议

chatler — Sun, 27 Jun 2010 15:16:00 GMT

大家都很熟悉HTTP协议的应用，因�ؓ每天都在�|�络上浏览着不少东西�Q�也都知道是HTTP协议是相当简单的。每�ơ用 thunder之类的下载��Y件下载网��，当用到那�?#8220;用thunder下蝲全部链接”时总觉得很��奇�?br> 后来��x��Q�其实要实现�q�些下蝲功能也�ƈ不难�Q�只要按照HTTP协议发送request�Q�然后对接收到的数据�q�行分析�Q�如果页面上�q�有href之类的链接指向标志就可以�q�行�׃��层的下蝲了。HTTP协议目前用的最多的�?.1 版本�Q�要全面透彻地搞懂它��参考RFC2616文档吧。我是怕rfc文档了的,要看自己�ȝ��吧^_^ 源代码如下： /******* http客户端程�?httpclient.c ************/ #include #include #include #include #include #include #include #include #include #include #include #include //////////////////////////////httpclient.c 开�?////////////////////////////////////////// /******************************************** 功能�Q�搜索字�W�串双��L��W�一个匹配字�W?br> ********************************************/ char * Rstrchr(char * s, char x) {   int i = strlen(s);   if(!(*s)) return 0;   while(s[i-1]) if(strchr(s + (i - 1), x)) return (s + (i - 1)); else i--;   return 0; } /******************************************** 功能�Q�把字符串�{换�ؓ全小�?br> ********************************************/ void ToLowerCase(char * s) {   while(s && *s) {*s=tolower(*s);s++;} } /************************************************************** 功能�Q�从字符串src中分析出�|�站地址和端口，�q�得到用戯��下蝲的文�?br> ***************************************************************/ void GetHost(char * src, char * web, char * file, int * port) {   char * pA;   char * pB;   memset(web, 0, sizeof(web));   memset(file, 0, sizeof(file));   *port = 0;   if(!(*src)) return;   pA = src;   if(!strncmp(pA, "http://", strlen("http://"))) pA = src+strlen("http://");   else if(!strncmp(pA, "https://", strlen("https://"))) pA = src+strlen("https://");   pB = strchr(pA, '/');   if(pB) {     memcpy(web, pA, strlen(pA) - strlen(pB));     if(pB+1) {       memcpy(file, pB + 1, strlen(pB) - 1);       file[strlen(pB) - 1] = 0;     }   }   else memcpy(web, pA, strlen(pA));   if(pB) web[strlen(pA) - strlen(pB)] = 0;   else web[strlen(pA)] = 0;   pA = strchr(web, ':');   if(pA) *port = atoi(pA + 1);   else *port = 80; } int main(int argc, char *argv[]) {   int sockfd;   char buffer[1024];   struct sockaddr_in server_addr;   struct hostent *host;   int portnumber,nbytes;   char host_addr[256];   char host_file[1024];   char local_file[256];   FILE * fp;   char request[1024];   int send, totalsend;   int i;   char * pt;   if(argc!=2)   {     fprintf(stderr,"Usage:%s web-address\a\n",argv[0]);     exit(1);   }   printf("parameter.1 is: %s\n", argv[1]);   ToLowerCase(argv[1]);/*��参数�{换�ؓ全小�?/   printf("lowercase parameter.1 is: %s\n", argv[1]);   GetHost(argv[1], host_addr, host_file, &portnumber);/*分析�|�址、端口、文件名�{?/   printf("webhost:%s\n", host_addr);   printf("hostfile:%s\n", host_file);   printf("portnumber:%d\n\n", portnumber);   if((host=gethostbyname(host_addr))==NULL)/*取得��L��IP地址*/   {     fprintf(stderr,"Gethostname error, %s\n", strerror(errno));     exit(1);   }   /* 客户�E�序开始徏�?sockfd描述�W?*/   if((sockfd=socket(AF_INET,SOCK_STREAM,0))==-1)/*建立SOCKET�q�接*/   {     fprintf(stderr,"Socket Error:%s\a\n",strerror(errno));     exit(1);   }   /* 客户�E�序填充服务端的资料 */   bzero(&server_addr,sizeof(server_addr));   server_addr.sin_family=AF_INET;   server_addr.sin_port=htons(portnumber);   server_addr.sin_addr=*((struct in_addr *)host->h_addr);   /* 客户�E�序发�v�q�接��h�� */   if(connect(sockfd,(struct sockaddr *)(&server_addr),sizeof(struct sockaddr))==-1)/*�q�接�|�站*/   {     fprintf(stderr,"Connect Error:%s\a\n",strerror(errno));     exit(1);   }   sprintf(request, "GET /%s HTTP/1.1\r\nAccept: */*\r\nAccept-Language: zh-cn\r\n\ User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)\r\n\ Host: %s:%d\r\nConnection: Close\r\n\r\n", host_file, host_addr, portnumber);   printf("%s", request);/*准备request�Q�将要发送给��L��*/   /*取得真实的文件名*/   if(host_file && *host_file) pt = Rstrchr(host_file, '/');   else pt = 0;   memset(local_file, 0, sizeof(local_file));   if(pt && *pt) {     if((pt + 1) && *(pt+1)) strcpy(local_file, pt + 1);     else memcpy(local_file, host_file, strlen(host_file) - 1);   }   else if(host_file && *host_file) strcpy(local_file, host_file);   else strcpy(local_file, "index.html");   printf("local filename to write:%s\n\n", local_file);   /*发送http��h��request*/   send = 0;totalsend = 0;   nbytes=strlen(request);   while(totalsend < nbytes) {     send = write(sockfd, request + totalsend, nbytes - totalsend);     if(send==-1) {printf("send error!%s\n", strerror(errno));exit(0);}     totalsend+=send;     printf("%d bytes send OK!\n", totalsend);   }   fp = fopen(local_file, "a");   if(!fp) {     printf("create file error! %s\n", strerror(errno));     return 0;   }   printf("\nThe following is the response header:\n");   i=0;   /* �q�接成功了，接收http响应�Q�response */   while((nbytes=read(sockfd,buffer,1))==1)   {     if(i < 4) {       if(buffer[0] == '\r' || buffer[0] == '\n') i++;       else i = 0;       printf("%c", buffer[0]);/*把http头信息打印在屏幕�?/     }     else {       fwrite(buffer, 1, 1, fp);/*��http��M��信息写入文�g*/       i++;       if(i%1024 == 0) fflush(fp);/*�?K时存盘一��?/     }   }   fclose(fp);   /* �l�束通讯 */   close(sockfd);   exit(0); }

zj@zj:~/C_pram/practice/http_client$ ls httpclient httpclient.c zj@zj:~/C_pram/practice/http_client$ ./httpclient http://www.baidu.com/ parameter.1 is: http://www.baidu.com/ lowercase parameter.1 is: http://www.baidu.com/ webhost:www.baidu.com hostfile: portnumber:80 GET / HTTP/1.1 Accept: */* Accept-Language: zh-cn User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0) Host: www.baidu.com:80 Connection: Close local filename to write:index.html 163 bytes send OK! The following is the response header: HTTP/1.1 200 OK Date: Wed, 29 Oct 2008 10:41:40 GMT Server: BWS/1.0 Content-Length: 4216 Content-Type: text/html Cache-Control: private Expires: Wed, 29 Oct 2008 10:41:40 GMT Set-Cookie: BAIDUID=A93059C8DDF7F1BC47C10CAF9779030E:FG=1; expires=Wed, 29-Oct-38 10:41:40 GMT; path=/; domain=.baidu.com P3P: CP=" OTI DSP COR IVA OUR IND COM " zj@zj:~/C_pram/practice/http_client$ ls httpclient httpclient.c index.html
不指定文件名字的�?默认��是下蝲�|�站默认的首��了^_^.

from:
http://blog.chinaunix.net/u2/76292/showart_1353805.html

chatler 2010-06-27 23:16 发表评论

c语言抓取�|�页数据

chatler — Sun, 27 Jun 2010 15:13:00 GMT

#include #include #include #include #include #include #define HTTPPORT 80 char* head =      "GET /u2/76292/ HTTP/1.1\r\n"      "Accept: */*\r\n"      "Accept-Language: zh-cn\r\n"      "Accept-Encoding: gzip, deflate\r\n"      "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; CIBA; TheWorld)\r\n"      "Host:blog.chinaunix.net\r\n"      "Connection: Keep-Alive\r\n\r\n"; int connect_URL(char *domain,int port) {     int sock;     struct hostent * host;     struct sockaddr_in server;     host = gethostbyname(domain);     if (host == NULL)      {       printf("gethostbyname error\n");       return -2;      }    // printf("HostName: %s\n",host->h_name);    // printf("IP Address: %s\n",inet_ntoa(*((struct in_addr *)host->h_addr)));     sock = socket(AF_INET,SOCK_STREAM,0);     if (sock < 0)     {       printf("invalid socket\n");       return -1;     }     memset(&server,0,sizeof(struct sockaddr_in));     memcpy(&server.sin_addr,host->h_addr_list[0],host->h_length);     server.sin_family = AF_INET;     server.sin_port = htons(port);     return (connect(sock,(struct sockaddr *)&server,sizeof(struct sockaddr)) <0) ? -1 : sock; } int main() {   int sock;   char buf[100];   char *domain = "blog.chinaunix.net";      fp = fopen("test.txt","rb");   if(NULL == fp){     printf("can't open stockcode file!\n");     return -1;   }        sock = connect_URL(domain,HTTPPORT);     if (sock <0){        printf("connetc err\n");        return -1;         }     send(sock,head,strlen(head),0);     while(1)     {       if((recv(sock,buf,100,0))<1)         break;       fprintf(fp,"%s",bufp); //save http data       }          fclose(fp);     close(sock);      printf("bye!\n");   return 0; }

我这里是保存数据到本地硬�?可以在这个的基础上修�?head头的定义可以自己使用wireshark抓包来看

from:
http://blog.chinaunix.net/u2/76292/showart.php?id=2123108

chatler 2010-06-27 23:13 发表评论

TCP的流量控�?

chatler — Fri, 08 Jan 2010 15:34:00 GMT

1. 前言

TCP是具备流控和可靠�q�接能力的协议，为防止TCP发生拥塞或�ؓ提高传输效率�Q�在�|?br>�l�发展早期就提出了一些相关的TCP��控和优化算法，而且也被RFC2581规定是每�?br>TCP实现时要实现的�?/div>

本文中，为求方便把将“TCP分组�D?segment)”都直接称�?#8220;�?#8221;�?/div>

2. 慢启�?slow start)和拥塞避�?Congestion Avoidance)

慢启动和拥塞避免是属于TCP发送方必须(MUST)要实现的�Q�防止TCP发送方向网�l�传入大量的�H�发数据造成�|�络��d��?/div>

先介�l�几个相兛_��敎ͼ�是在通信双方中需要考虑但不在TCP包中体现的一些参敎ͼ�

拥塞�H�口(congestion window�Q�cwnd)�Q�是指发送方在接收到�Ҏ��的ACK��认前向允许�|�络发送的数据量，数据发送后�Q�拥塞窗口羃��；接收到对方的ACK后，拥塞�H�口相应增加�Q�拥塞窗口越大，可发送的数据量越大�?/strong>拥塞�H�口初始值的RFC2581中被规定��Z��过发送方MSS的两倍，而且不能��过两个TCP包，在RFC3390中更��C��初始�H�口大小的设�|�方法�?/div>

通告�H�口(advertised window�Q�rwnd)�Q�是指接收方所能接收的没来得及发ACK��认的数据量�Q�接收方数据接收后，通告�H�口�~�小�Q�发送ACK后，通告�H�口相应扩大�?/strong>

慢启动阈�?slow start threshold, ssthresh)�Q�用来判断是否要使用慢启动或拥塞避免��法来控制流量的一个参敎ͼ�也是随通信�q�程不断变化的�?/div>

当cwnd < ssthresh�Ӟ��拥塞�H�口值已�l�比较小了，表示未经��认的数据量增大�Q�需要启动慢启动��法�Q�当cwnd > ssthresh�Ӟ��可发送数据量大，需要启动拥塞避免算法�?/div>

拥塞�H�口cwnd是根据发送的数据量自动减��的�Q�但扩大��需要根据对方的接收情况�q�行扩大�Q�慢启动和拥塞避免算法都是描�q�如何扩大该值的�?/strong>

在启动慢启动��法�Ӟ��TCP发送方接收到对方的ACK后拥塞窗口最多每�ơ增加一个发送方MSS字节的数��|��当拥塞窗口超�q�sshresh后或观察到拥塞才停止��法�?/div>

启动拥塞避免��法�Ӟ��拥塞�H�口在一个连接往�q�时间RTT内增加一个最大TCP包长度的量，一般实现时用以下公式计��：

      cwnd += max(SMSS*SMSS/cwnd, 1)            �Q?.1)

SMSS为发送方MSS�?/div>

TCP发送方��到数据包丢失时�Q�需要调整ssthresh�Q�一般按下面公式计算�Q?/div>
      ssthresh = max (FlightSize / 2, 2*SMSS)    (2.2)

其中FlightSize表示已经发送但�q�没有被��认的数据量�?/div>

3. 快速重�?fast retransmit)和快速恢�?fast recovery)

TCP接收�Ҏ��到错序的TCP包时要发送复制的ACK包回应，提示发送方可能出现�|�络丢包�Q�发送方
收到�q�箋3个重复的ACK包后启动快速重传算法，�Ҏ��认号快速重传那个可能丢��q��包而不必等
重传定时器超时后再重传，普通的重传是要�{�到重传定时器超时还没收到ACK才进行的。这个算
法是TCP发送方应该(SHOULD)实现的，不是必须。TCP发送方�q�行了快速重传后�q�入快速恢复阶�D?br>�Q�直到没再接攉��复的ACK包�?/div>

快速重传和快速恢复具体过�E��ؓ�Q?br>1. 当收到第3个重复的ACK包时�Q�ssthreh值按公式2.2重新讄��Q?/div>
2. 重传丢失的包后，��拥塞窗口cwnd讄��为sshresh+3*SMSS�Q��h工扩大了拥塞�H�口�Q?/div>
3. 对于每个接收到的重复的ACK包，cwnd相应增加SMSS�Q�扩大拥塞窗口；

4. 如果新的拥塞�H�口cwnd值和接收方的通告�H�口值允许的话，可以�l�箋发新包；

5. 当收��C��一个ACK��认了新数据�Ӟ��cwnd大小调整为sshresh�Q�减��窗口；�Ҏ��收方
   来说�Q�接收到重发的TCP包后��p��发此ACK��认当前接收的数据�?/div>

4. �l�论

�q�些��法重点在于保持�|�络的可靠性和可用性，防止�|�络��d��造成的网�l�崩溃，是相�?br>比较保守的�?/div>

5. 附录讨论

A�? �q�些��法都是针对通信双方的事, 但如果从开发防火墙�{�中间设备的角度来看,
     中间讑֤�有必要考虑�q�些�?
端木: �q�个...我好象也看不出必要性，因�ؓ��法的参数都是在双方内部而不在TCP数据�?br>      中体�?..但应该会让中间设备轻杄��Q�这个就象在马�\开车，�q�些��法��是交规

      让你开得规矩点�Q�交警只兛_��你开车的情况�Q�而不��你开的是什么�R�Q�开得好交警

      也轻松。好车可以让你很�Ҏ��开好，但差车也可以开好�?/div>

A�? �q�些��法原型提出也很早了, 最早是88�q�的�? 当时�|�络都处于初�U�阶�D? 有个

     9600bps的猫��很牛了, 计算机性能也很�? 因此实施�q�些��法�q�有点用; 但现

     在过了快20�q�了, 癑օ�都快淘汰, 千兆, 万兆�|�络都快普及�? 即��PC机的内存

     也都上G�?再规矩这�U�几K�U�别的数据量有意思么? ��好象现在喷气式战斗机都�?/div>
     �W?代了, 再研�I�螺旋桨战斗��有意思么?
端木: �q�个...�q�个��p��病毒库了, 里面不也有无数的DOS时代的病�? 你以后这辈子估计
      都见不着的，但没有哪个防病毒厂商会把�q�些病毒从库中剔除，库是只增不减的�?br>      有这么个东西也是一��P��正因为��^时没用，谁也不注意，知道了就可以吹一吹，

      ��其拿去唬唬人是很有效的�Q?/div>

A�? 你真无聊!
端木: You got it! 不无聊干吗写博客�?

端木: 搞技术有时候是很悲哀的一件事�Q�必��ȝ��扯七大姑八大姨的很多老东西，也就是向�?br>      兼容�Q�到一定程度将成�ؓ�q�一步发展的最大障��，讲一个从smth看到的不是笑�?/div>
      的笑话：

    ��C��铁�\的铁轨间距是4英尺8�?英寸�Q�铁轨间距采用了电�R轮距的标准，而电车轮�?br>的标准则沿袭了马车的轮距标准�?
    马�R的轮距�ؓ何是4英尺8�?英寸�Q�原来，英国的马路辙�q�的宽度�?英尺8�?英寸�?br>如果马�R改用其他��寸的轮距，轮子很快��׃��在英国的老马路上撞坏�?
    英国马�\的辙�q�宽度又从何而来�Q�这可以上溯到古�|�马时期。整个欧�z?包括英国)的老�\都是�|�马��Zؓ其军队铺讄��Q?英尺8�?英寸正是�|�马战�R的宽度�?
    �|�马战�R的宽度又是怎么来的�Q�答案很��单，它是牵引一辆战车的两匹马的屁股的��d��度�?
    �D�子到这里还没有�l�束。美国航天飞机的火箭助推器也摆脱不了马屁股的�U�缠———火��助推器造好之后要经�q�铁路运送，而铁路上必然有一些隧道，隧道的宽度又是根据铁轨的宽度而来。代表着��端�U�技的火��助推器的宽度，竟然被两匚w��的屁股的��d��度决定了�?br>转自�Q?br>http://www.shnenglu.com/prayer/archive/2009/04/20/80527.html

chatler 2010-01-08 23:34 发表评论

chatler — Wed, 30 Dec 2009 08:54:00 GMT
拥塞�Q�Congestion�Q�指的是在包交换�|�络中由于传送的包数目太多，而存贮�{发节点的资源有限而造成�|�络传输性能下降的情��c��拥塞的一�U�极端情冉|��死锁�Q�Deadlock�Q�，退出死锁往往需要网�l�复位操作�?
��量控制�Q�Flow Control�Q�指的是在一条通道上控制发送端发送数据的数量及速度使其不超�q�接收端所能承受的能力�Q�这个能力主要指接收端接收数据的速率及接收数据缓冲区的大��。通常采用停等法或滑动�H�口法控制流量�?
��量控制是针对端�pȝ��中资源受限而设�|�的�Q�拥塞控制是针对中间节点资源受限而设�|�的�?br>

chatler 2009-12-30 16:54 发表评论

用wget下蝲文�g或目录或者是整个�|�站

chatler — Mon, 21 Dec 2009 17:04:00 GMT
wget -m -nH -b -q -P /home/web http://domain
具体参数的含义还没有man�Q�等man�q�之后再��d��q�来�?br>

chatler 2009-12-22 01:04 发表评论

http��h��的详�l�过�E?--理解计算机网�l?lt;�?gt;

chatler — Wed, 21 Oct 2009 15:05:00 GMT

一个http��h��的详�l�过�E?/font>

我们来看当我们在��览器输�?/font>http://www.mycompany.com:8080/mydir/index.html,�q�后所发生的一切�?/font>

首先http是一个应用层的协议，在这个层的协议，只是一�U�通讯规范�Q�也��是因�ؓ双方要进行通讯�Q�大家要事先�U�定一个规范�?/font>

1.�q�接当我们输入这样一个请求时�Q�首先要建立一个socket�q�接�Q�因为socket是通过ip和端口徏立的�Q�所以之前还有一个DNS解析�q�程�Q�把www.mycompany.com变成ip�Q�如果url里不包含端口��P��则会使用该协议的默认端口受��?/font>

DNS的过�E�是�q�样的：首先我们知道我们本地的机器上在配�|�网�l�时都会填写DNS�Q�这��h��机就会把�q�个url发给�q�个配置的DNS服务器，如果能够扑ֈ�相应的url则返回其ip�Q�否则该DNS��l�将该解析请求发送给上��DNS�Q�整个DNS可以看做是一个树状结构，该请求将一直发送到根直到得到结果。现在已�l�拥有了目标ip和端口号�Q�这��h��们就可以打开socket�q�接了�?/font>

2.��h�� q�接成功建立后，开始向web服务器发送请求，�q�个��h��一般是GET或POST命��o�Q�POST用于FORM参数的传递）。GET命��o的格式�ؓ�Q�　　GET 路径/文�g�?HTTP/1.0
文�g名指出所讉K��的文�Ӟ��HTTP/1.0指出Web��览器��用的HTTP版本。现在可以发送GET命��o�Q?/font>

GET /mydir/index.html HTTP/1.0�Q?/font>

3.应答 web服务器收到这个请求，�q�行处理。从它的文档�I�间中搜索子目录mydir的文件index.html。如果找到该文�g�Q�Web服务器把该文件内容传送给相应的Web��览器�?/font>

��Z��告知��览器，�Q�Web服务器首先传送一些HTTP头信息，然后传送具体内容（即HTTP体信息）�Q�HTTP头信息和HTTP体信息之间用一个空行分开�?br>常用的HTTP头信息有�Q?br>　　�?HTTP 1.0 200 OK 　�q�是Web服务器应�{�的�W�一行，列出服务器正在运行的HTTP版本号和应答代码。代�?200 OK"表示��h��完成�?br>　　�?MIME_Version:1.0　它指�C�MIME�c�d��的版本�?br>　　�?content_type:�c�d��　�q�个头信息非帔R��要，它指�C�HTTP体信息的MIME�c�d��。如�Q�content_type:text/html指示传送的数据是HTML文档�?br>　　�?content_length:长度倹{��它指�C�HTTP体信息的长度�Q�字节）�?/font>

4.关闭�q�接�Q�当应答�l�束后，Web��览器与Web服务器必��L��开�Q�以保证其它Web��览器能够与Web服务器徏立连接�?/font>

下面我们具体分析其中的数据包在网�l�中漫游的经�?/font>

在网�l�分层结构中�Q�各层之间是严格单向依赖的�?#8220;服务”是描�q�各层之间关�pȝ��抽象概念�Q�即�|�络中各层向紧邻上层提供的一�l�操作。下层是服务提供者，上层是请求服务的用户。服务的表现形式是原语（primitive�Q�，如系�l�调用或库函数。系�l�调用是操作�pȝ��内核向网�l�应用程序或高层协议提供的服务原语。网�l�中的n层总要向n+1层提供比n-1层更完备的服务，否则n层就没有存在的�h倹{�?

传输层实现的�?#8220;端到�?#8221;通信�Q�引�q�网间进�E�通信概念�Q�同时也要解军_��错控�Ӟ��量控制�Q�数据排序（报文排序�Q�，�q�接��理�{�问题，为此提供不同的服务方式。通常传输层的服务通过�pȝ��调用的方式提供，以socket的方式。对于客��L��Q�要惛_��立一个socket�q�接�Q�需要调用这样一些函数socket() bind() connect(),然后��可以通过send()�q�行数据发送�?/font>

现在看数据包在网�l�中的穿行过�E�：

应用�?/font>

首先我们可以看到在应用层�Q�根据当前的需求和动作�Q�结合应用层的协议，有我们确定发送的数据内容�Q�我们把�q�些数据攑ֈ�一个缓冲区内，然后形成了应用层的报�?strong>data�?/font>

传输�?/font>

�q�些数据通过传输层发送，比如tcp协议。所以它们会被送到传输层处理，在这里报文打上了传输头的包头�Q�主要包含端口号�Q�以及tcp的各�U�制信息�Q�这些信息是直接得到的，因�ؓ接口中需要指定端口。这样就�l�成了tcp的数据传送单�?strong>segment。tcp是一�U�端到端的协议，利用�q�些信息�Q�比如tcp首部中的序号��认序号�Q�根据这些数字，发送的一方不断的�q�行发送等待确认，发送一个数据段后，会开启一个计数器�Q�只有当收到��认后才会发送下一个，如果��过计数旉��仍未收到��认则进行重发，在接受端如果收到错误数据�Q�则��其丢弃�Q�这��导致发送端��时重发。通过tcp协议�Q�控制了数据包的发送序列的产生�Q�不断的调整发送序列，实现��控和数据完整�?/font>

�|�络�?/font>

然后待发送的数据�D�送到�|�络层，在网�l�层被打包，�q�样��装上了�|�络层的包头�Q�包头内部含有源及目的的ip地址�Q�该层数据发送单位被�U�Cؓpacket。网�l�层开始负责将�q�样的数据包在网�l�上传输�Q�如何穿�q��\由器�Q�最�l�到辄��的地址。在�q�里�Q�根据目的ip地址�Q�就需要查找下一跌��\��q��地址。首先在本机�Q�要查找本机的�\��p��Q�在windows上运行route print��可以看到当前�\��p��内容�Q�有如下几项�Q?br>Active Routes Default Route Persistent Route.

整个查找�q�程是这��L��:
(1)�Ҏ��目的地址�Q�得到目的网�l�号�Q�如果处在同一个内�|�，则可以直接发送�?br>(2)如果不是�Q�则查询路由表，扑ֈ�一个�\由�?br>(3)如果找不到明��的路由�Q�此时在路由表中�q�会有默认网养I��也可�U�Cؓ�~�省�|�关�Q�IP用缺省的�|�关地址��一个数据传送给下一个指定的路由器，所以网关也可能是�\由器�Q�也可能只是内网向特定�\由器传输数据的网兟�?br>(4)路由器收到数据后�Q�它再次��E�主机或�|�络查询路由�Q�若�q�未扑ֈ�路由�Q�该数据包将发送到该�\由器的缺省网兛_��址。而数据包中包含一个最大�\��p��敎ͼ�如果��过�q�个��x��Q�就会丢弃数据包�Q�这样可以防止无限传递。�\由器收到数据包后�Q�只会查看网�l�层的包�Ҏ��据，目的ip。所以说它是工作在网�l�层�Q�传输层的数据对它来说则是透明的�?/font>

如果上面�q�些步骤都没有成功，那么该数据报��׃��能被传送。如果不能传送的数据报来自本机，那么一般会向生成数据报的应用程序返回一�?#8220;��L��不可�?#8221;�?“�|�络不可�?#8221;的错误�?/font>

以windows下主机的路由表�ؓ例，看�\��q��查找�q�程
======================================================================
Active Routes:
Network Destination            Netmask                      Gateway              Interface                  Metric
0.0.0.0                                 0.0.0.0                       192.168.1.2           192.168.1.101           10
127.0.0.0                             255.0.0.0                   127.0.0.1               127.0.0.1                   1
192.168.1.0                         255.255.255.0           192.168.1.101       192.168.1.101           10
192.168.1.101                     255.255.255.255       127.0.0.1               127.0.0.1                  10
192.168.1.255                     255.255.255.255       192.168.1.101       192.168.1.101           10
224.0.0.0                            240.0.0.0                   192.168.1.101       192.168.1.101           10
255.255.255.255                 255.255.255.255       192.168.1.101       192.168.1.101           1
Default Gateway:                192.168.1.2

Network Destination 目的�|�段
Netmask 子网掩码
Gateway 下一跌��\由器入口的ip�Q��\由器通过interface和gateway定义一调到下一个�\由器的链路，通常情况下，interface和gateway是同一�|�段的�?br>Interface 到达该目的地的本路由器的出口ip�Q�对于我们的个�hpc来说�Q�通常由机��机A的网卡，用该�|�卡的IP地址标识�Q�当然一个pc也可以有多个�|�卡�Q��?/font>

�|�关�q�个概念�Q�主要用于不同子�|�间的交互，当两个子�|�内��L��A,B要进行通讯�Ӟ��首先A要将数据发送到它的本地�|�关�Q�然后网兛_��数据发送给B所在的�|�关�Q�然后网兛_��发送给B�?br>默认�|�关�Q�当一个数据包的目的网�D�不在你的�\��p��录中�Q�那么，你的路由器该把那个数据包发送到哪里�Q�缺省�\��q��|�关是由你的�q�接上的default gateway军_��的，也就是我们通常在网�l�连接里配置的那个倹{�?/font>

通常interface和gateway处在一个子�|�内�Q�对于�\由器来说�Q�因为可能具有不同的interface,当数据包到达�Ӟ��Ҏ��Network Destination��L��匚w��的条目，如果扑ֈ��Q�interface则指明了应当从该路由器的那个接口出去�Q�gateway则代表了那个子网的网兛_��址�?/font>

�W�一�?nbsp;     0.0.0.0   0.0.0.0   192.168.1.2    192.168.1.101   10
0.0.0.0代表了缺省�\由。该路由记录的意思是�Q�当我接收到一个数据包的目的网�D�不在我的�\��p��录中�Q�我会将该数据包通过192.168.1.101�q�个接口发送到192.168.1.2�q�个地址�Q�这个地址是下一个�\由器的一个接口，�q�样�q�个数据包就可以交付�l�下一个�\由器处理�Q�与我无兟뀂该路由记录的线路质�?10。当有多个条目匹配时�Q�会选择��h��较小Metric值的那个�?/font>

�W�三�?nbsp;     192.168.1.0   255.255.255.0 192.168.1.101   192.168.1.101 10
直联�|�段的�\��p��录：当�\由器收到发往直联�|�段的数据包时该如何处理�Q�这�U�情况，路由记录的interface和gateway是同一个。当我接收到一个数据包的目的网�D�|��192.168.1.0�Ӟ��我会��该数据包通过192.168.1.101�q�个接口直接发送出去，因�ؓ�q�个端口直接�q�接着192.168.1.0�q�个�|�段�Q�该路由记录的线路质�?10 �Q�因interface和gateway是同一个，表示数据包直接传送给目的地址�Q�不需要再转给路由器）�?/font>

一般就分这两种情况�Q�目的地址与当前�\由器接口是否在同一子网。如果是则直接发送，不需再�{�l��\由器�Q�否则还需要�{发给下一个�\由器�l�箋�q�行处理�?/font>

查找��C��一跳ip地址后，�q�需要知道它的mac地址�Q�这个地址要作为链路层数据装进链�\层头部。这旉��要arp协议�Q�具体过�E�是�q�样的，查找arp�~�冲�Q�windows下运行arp -a可以查看当前arp�~�冲内容。如果里面含有对应ip的mac地址�Q�则直接�q�回。否则需要发生arp��h��Q�该��h��包含源的ip和mac地址�Q�还有目的地的ip地址�Q�在�|�内�q�行�q�播�Q�所有的��L��会检查自��q��ip与该��h��中的目的ip是否一��P��如果刚好对应则返回自��q��mac地址�Q�同时将��h��者的ip mac保存。这样就得到了目标ip的mac地址�?/font>

链�\�?/font>

��mac地址及链路层控制信息加到数据包里�Q��Ş�?strong>Frame�Q�Frame在链路层协议下，完成了相�ȝ��节点间的数据传输�Q�完成连接徏立，控制传输速度�Q�数据完整�?/font>

物理�?/font>

物理�U��\则只负责该数据以bit为单位从��L��传输��C��一个目的地�?/font>

下一个目的地接受到数据后�Q�从物理层得到数据然后经�q�逐层的解�?�?链�\�?�?�|�络层，然后开始上�q�的处理�Q�在�l�网�l�层链�\�?物理层将数据��装好��l�传往下一个地址�?/font>

在上面的�q�程中，可以看到有一个�\��p��查询�q�程�Q�而这个�\��p��的徏立则依赖于�\��q��法。也��是说�\��q��法实际上只是用来路由器之间更新维护�\��p��Q�真正的数据传输�q�程�q�不执行�q�个��法�Q�只查看路由表。这个概念也很重要，需要理解常用的路由��法。而整个tcp协议比较复杂�Q�跟链�\层的协议有些�怼��Q�其中有很重要的一些机制或者概念需要认真理解，比如�~�号与确认，��量控制�Q�重发机�Ӟ��发送接受窗口�?/font>

tcp/ip基本模型及概�?/font>

物理�?/font>

讑֤��Q�中�l�器�Q�repeater�Q?集线器（hub�Q�。对于这一层来��_��从一个端口收到数据，会�{发到所有端口�?/font>

链�\�?/font>

协议�Q�SDLC�Q�Synchronous Data Link Control�Q�HDLC�Q�High-level Data Link Control�Q?ppp协议独立的链路设备中最常见的当属网卡，�|�桥也是链�\产品。集�U�器MODEM的某些功能有��为属于链路层�Q�对此还有些争议认�ؓ属于物理层设备。除此之外，所有的交换机都需要工作在数据链�\层，但仅工作在数据链路层的仅是二层交换机。其他像三层交换机、四层交换机和七层交换机虽然可对应工作在OSI的三层、四层和七层�Q�但二层功能仍是它们基本的功能�?/font>

因�ؓ有了MAC地址表，所以才充分避免了冲�H�，因�ؓ交换机通过目的MAC地址知道应该把这个数据�{发到哪个端口。而不会像HUB一��P��会�{发到所有滴端口。所以，交换机是可以划分冲突域滴�?/font>

�|�络�?/font>

四个主要的协�?
�|�际协议IP�Q�负责在��L��和网�l�之间寻址和�\由数据包�?nbsp;
地址解析协议ARP�Q�获得同一物理�|�络中的��g��L��地址�?nbsp;
�|�际控制消息协议ICMP�Q�发送消息，�q�报告有��x��据包的传送错误�?nbsp;
互联�l�管理协议IGMP�Q�被IP��L��拿来向本地多路广播�\由器报告��L��l�成员�?/font>

该层讑֤�有三层交换机�Q��\由器�?/font>

传输�?/font>

两个重要协议 TCP �?UDP �?/font>

端口概念�Q�TCP/UDP 使用 IP 地址标识�|�上��L��Q��用端口号来标识应用进�E�，�?TCP/UDP 用主�?IP 地址和�ؓ应用�q�程分配的端口号来标识应用进�E�。端口号�?16 位的无符��h��敎ͼ� TCP 的端口号�?UDP 的端口号是两个独立的序列。尽��相互独立，如果 TCP �?UDP 同时提供某种知名服务�Q�两个协议通常选择相同的端口号。这�U��a是�ؓ了��用方便，而不是协议本�w�的要求。利用端口号�Q�一��C��Z��多个�q�程可以同时使用 TCP/UDP 提供的传输服务，�q�且�q�种通信是端到端的，它的数据�?IP 传递，但与 IP 数据报的传递�\径无兟뀂网�l�通信中用一个三元组可以在全局唯一标志一个应用进�E�：�Q�协议，本地地址�Q�本地端口号�Q��?/font>

也就是说tcp和udp可以使用相同的端口�?/font>

可以看到通过(协议,源端口，源ip�Q�目的端口，目的ip)��可以用来完全标识一�l�网�l�连接�?/font>

应用�?/font>

��Z��tcp�Q�Telnet FTP SMTP DNS HTTP
��Z��udp�Q�RIP NTP�Q�网落时间协议）和DNS �Q�DNS也��用TCP�Q�SNMP TFTP

参考文献：

��L��本机路由�?http://hi.baidu.com/thusness/blog/item/9c18e5bf33725f0818d81f52.html

Internet 传输层协�?http://www.cic.tsinghua.edu.cn/jdx/book6/3.htm 计算机网�l?谢希�?/font>

转自�Q?br>http://blog.chinaunix.net/u2/67780/showart_2065190.html

chatler 2009-10-21 23:05 发表评论

TCP三次握手/四次挥手详解<�?gt;

chatler — Tue, 20 Oct 2009 13:15:00 GMT
1
、徏立连接协议（三次握手�Q?/font>
�Q?�Q�客��L��发送一个带SYN标志的TCP报文到服务器。这是三�ơ握手过�E�中的报�?�?br style="font: normal normal normal 12px/normal song, Verdana; ">�Q?�Q?服务器端回应客户端的�Q�这是三�ơ握手中的第2个报文，�q�个报文同时带ACK标志和SYN标志。因此它表示对刚才客��L��SYN报文的回应；同时又标志SYN�l�客��L��Q�询问客��L��是否准备好进行数据通讯�?br style="font: normal normal normal 12px/normal song, Verdana; ">�Q?�Q?客户必须再次回应服务�D�一个ACK报文�Q�这是报文段3�?br style="font: normal normal normal 12px/normal song, Verdana; ">2、连接终止协议（四次挥手�Q?/font>
　　�׃��TCP�q�接是全双工的，因此每个方向都必��d��独进行关闭。这原则是当一方完成它的数据发送�Q务后��p��发送一个FIN来终止这个方向的�q�接。收��C��?FIN只意味着�q�一方向上没有数据流动，一个TCP�q�接在收��C��个FIN后仍能发送数据。首先进行关闭的一方将执行��d��关闭�Q�而另一�Ҏ��行被动关闭�?br style="font: normal normal normal 12px/normal song, Verdana; ">　�Q?�Q?TCP客户端发送一个FIN�Q�用来关闭客户到服务器的数据传送（报文�D?�Q��?br style="font: normal normal normal 12px/normal song, Verdana; ">　�Q?�Q?服务器收到这个FIN�Q�它发回一个ACK�Q�确认序号�ؓ收到的序号加1�Q�报文段5�Q�。和SYN一��P��一个FIN��占用一个序受��?br style="font: normal normal normal 12px/normal song, Verdana; ">　�Q?�Q?服务器关闭客��L��的连接，发送一个FIN�l�客��L��Q�报文段6�Q��?br style="font: normal normal normal 12px/normal song, Verdana; ">　�Q?�Q?客户�D�发回ACK报文��认�Q��ƈ��确认序可��|��ؓ收到序号�?�Q�报文段7�Q��?br style="font: normal normal normal 12px/normal song, Verdana; ">CLOSED: �q�个没什么好说的了，表示初始状态�?br style="font: normal normal normal 12px/normal song, Verdana; ">LISTEN: �q�个也是非常�Ҏ��理解的一个状态，表示服务器端的某个SOCKET处于监听状态，可以接受�q�接了�?br style="font: normal normal normal 12px/normal song, Verdana; ">SYN_RCVD: �q�个状态表�C�接受到了SYN报文�Q�在正常情况下，�q�个状态是服务器端的SOCKET在徏立TCP�q�接时的三次握手会话�q�程中的一个中间状态，很短暂，基本上用netstat你是很难看到�q�种状态的�Q�除非你�Ҏ��写了一个客��L��试�E�序�Q�故意将三次TCP握手�q�程中最后一个ACK报文不予发送。因此这�U�状�?�Ӟ��当收到客��L��的ACK报文后，它会�q�入到ESTABLISHED状态�?br style="font: normal normal normal 12px/normal song, Verdana; ">SYN_SENT: �q�个状态与SYN_RCVD遥想呼应�Q�当客户端SOCKET执行CONNECT�q�接�Ӟ��它首先发送SYN报文�Q�因此也随即它会�q�入��C��SYN_SENT�?态，�q�等待服务端的发送三�ơ握手中的第2个报文。SYN_SENT状态表�C�客��L��已发送SYN报文�?br style="font: normal normal normal 12px/normal song, Verdana; ">ESTABLISHED�Q�这个容易理解了�Q�表�C��接已�l�徏立了�?br style="font: normal normal normal 12px/normal song, Verdana; ">FIN_WAIT_1: �q�个状态要好好解释一下，其实FIN_WAIT_1和FIN_WAIT_2状态的真正含义都是表示�{�待�Ҏ��的FIN报文。而这两种状态的区别是：FIN_WAIT_1状态实际上是当SOCKET在ESTABLISHED状态时�Q�它想主动关闭连接，向对方发送了FIN报文�Q�此时该SOCKET�?�q�入到FIN_WAIT_1状态。而当�Ҏ��回应ACK报文后，则进入到FIN_WAIT_2状态，当然在实际的正常情况下，无论�Ҏ��何种情况下，都应该马上回应ACK报文�Q�所以FIN_WAIT_1状态一般是比较难见到的�Q�而FIN_WAIT_2状态还有时常常可以用netstat看到�?br style="font: normal normal normal 12px/normal song, Verdana; ">FIN_WAIT_2�Q�上面已�l�详�l�解释了�q�种状态，实际上FIN_WAIT_2状态下的SOCKET�Q�表�C�半�q�接�Q�也��x��一方要求close�q�接�Q�但另外�q�告诉对方，我暂时还有点数据需要传送给你，�E�后再关闭连接�?br style="font: normal normal normal 12px/normal song, Verdana; ">TIME_WAIT: 表示收到了对方的FIN报文�Q��ƈ发送出了ACK报文�Q�就�{?MSL后即可回到CLOSED可用状态了。如果FIN_WAIT_1状态下�Q�收��C��Ҏ��同时�?FIN标志和ACK标志的报文时�Q�可以直接进入到TIME_WAIT状态，而无��ȝ��q�FIN_WAIT_2状态�?br style="font: normal normal normal 12px/normal song, Verdana; ">CLOSING: �q�种状态比较特�D�，实际情况中应该是很少见，属于一�U�比较罕见的例外状态。正常情况下�Q�当你发送FIN报文后，按理来说是应该先收到�Q�或同时收到�Q�对方的 ACK报文�Q�再收到�Ҏ��的FIN报文。但是CLOSING状态表�C�Z��发送FIN报文后，�q�没有收到对方的ACK报文�Q�反而却也收��C��Ҏ��的FIN报文。什么情况下会出现此�U�情况呢�Q�其实细想一下，也不隑־�出结论：那就是如果双方几乎在同时close一个SOCKET的话�Q�那么就出现了双方同时发送FIN�?文的情况�Q�也即会出现CLOSING状态，表示双方都正在关闭SOCKET�q�接�?br style="font: normal normal normal 12px/normal song, Verdana; ">CLOSE_WAIT: �q�种状态的含义其实是表�C�在�{�待关闭。怎么理解呢？当对方close一个SOCKET后发送FIN报文�l�自己，你系�l�毫无疑问地会回应一个ACK报文�l�对方，此时则进入到CLOSE_WAIT状态。接下来呢，实际上你真正需要考虑的事情是察看你是否还有数据发送给�Ҏ��Q�如果没有的话，那么你也��可�?close�q�个SOCKET�Q�发送FIN报文�l�对方，也即关闭�q�接。所以你在CLOSE_WAIT状态下�Q�需要完成的事情是等待你��d��闭连接�?br style="font: normal normal normal 12px/normal song, Verdana; ">LAST_ACK: �q�个状态还是比较容易好理解的，它是被动关闭一方在发送FIN报文后，最后等待对方的ACK报文。当收到ACK报文后，也即可以�q�入到CLOSED可用状态了�?br style="font: normal normal normal 12px/normal song, Verdana; ">最后有2个问题的回答�Q�我自己分析后的�l�论�Q�不一定保�?00%正确�Q?br style="font: normal normal normal 12px/normal song, Verdana; ">1�?��Z��么徏立连接协议是三次握手�Q�而关闭连接却是四�ơ握手呢�Q?br style="font: normal normal normal 12px/normal song, Verdana; ">�q?是因为服务端的LISTEN状态下的SOCKET当收到SYN报文的徏�q�请求后�Q�它可以把ACK和SYN�Q�ACK起应�{�作用，而SYN起同步作用）攑֜�一个报文里来发送。但关闭�q�接�Ӟ��当收到对方的FIN报文通知�Ӟ��它仅仅表�C�对�Ҏ��有数据发送给你了�Q�但未必你所有的数据都全部发送给�Ҏ��了，所以你可以�?必会马上会关闭SOCKET,也即你可能还需要发送一些数据给�Ҏ��之后�Q�再发送FIN报文�l�对�Ҏ��表示你同意现在可以关闭连接了�Q�所以它�q�里的ACK报文和FIN报文多数情况下都是分开发送的�?br style="font: normal normal normal 12px/normal song, Verdana; ">2�?��Z��么TIME_WAIT状态还需要等2MSL后才能返回到CLOSED状态？
�q�是因�ؓ�Q?虽然双方都同意关闭连接了�Q�而且握手�?个报文也都协调和发送完毕，按理可以直接回到CLOSED状态（��好比从SYN_SEND状态到 ESTABLISH状态那��P��Q�但是因为我们必��要假想�|�络是不可靠的，你无法保证你最后发送的ACK报文会一定被�Ҏ��收到�Q�因此对方处�?LAST_ACK状态下的SOCKET可能会因��时未收到ACK报文�Q�而重发FIN报文�Q�所以这个TIME_WAIT状态的作用��是用来重发可能丢失�?ACK报文�?/font>
转自�Q?/span>
http://blog.chinaunix.net/u2/67780/showart.php?id=2071265

chatler 2009-10-20 21:15 发表评论

亚洲一区二区三区四区中文,99在线观看免费视频精品观看,欧美精品97

Comparing Two High-Performance I/O Design Patterns

Reactor and Proactor: two I/O multiplexing approaches

Current practice

Proposed solution

TProactor

Performance comparison (JAVA versus C++ versus C#).

User code example

Conclusion

Appendix I

Appendix II

Resources

About the authors

Comparing Two High-Performance I/O Design Patterns

Current practice

Proposed solution

TProactor

Performance comparison (JAVA versus C++ versus C#).

User code example

Conclusion

Appendix I

Appendix II

Resources

About the authors

一个基于完成端口的TCP Server Framework,���析IOCP

一个基于Event Poll(epoll)的TCP Server Framework,���析epoll

TCP: SYN ACK FIN RST PSH URG 详解

Linux下面socket�~�程的非��d��TCP 研究

教你用c实现http协议

c语言抓取�|�页数据

TCP的流量控�?

用wget下蝲文�g或目录或者是整个�|�站

http��h��的详�l�过�E?--理解计算机网�l?lt;�?gt;

TCP三次握手/四次挥手详解<�?gt;

一个基于完成端口的TCP Server Framework,��析IOCP

一个基于Event Poll(epoll)的TCP Server Framework,��析epoll