在接收流程一節(jié)中可以看到數(shù)據(jù)包在讀取到用戶空間前,都要經(jīng)過tcp_v4_do_rcv處理,從而在receive queue中排隊(duì)。
在該函數(shù)中,我們只分析當(dāng)連接已經(jīng)建立后的數(shù)據(jù)包處理流程,也即tcp_rcv_established函數(shù)。
tcp_rcv_established函數(shù)的工作原理是把數(shù)據(jù)包的處理分為2類:fast path和slow path,其含義顯而易見。這樣分類
的目的當(dāng)然是加快數(shù)據(jù)包的處理,因?yàn)樵谡G闆r下,數(shù)據(jù)包是按順序到達(dá)的,網(wǎng)絡(luò)狀況也是穩(wěn)定的,這時(shí)可以按照fast path
直接把數(shù)據(jù)包存放到receive queue了。而在其他的情況下則需要走slow path流程了。
在協(xié)議棧中,是用頭部預(yù)測來實(shí)現(xiàn)的,每個(gè)tcp sock有個(gè)pred_flags成員,它就是判別的依據(jù)。
- static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd)
- {
- tp->pred_flags = htonl((tp->tcp_header_len << 26) |
- ntohl(TCP_FLAG_ACK) |
- snd_wnd);
- }
可以看出頭部預(yù)測依賴的是頭部長度字段和通告窗口。也就是說標(biāo)志位除了ACK和PSH外,如果其他的存在的話,就不能用
fast path處理,其揭示的含義如下:
1 Either the data transaction is taking place in only one direction (which means that we are the receiver
and not transmitting any data) or in the case where we are sending out data also, the window advertised
from the other end is constant. The latter means that we have not transmitted any data from our side for
quite some time but are receiving data from the other end. The receive window advertised by the other end is constant.
2. Other than PSH|ACK flags in the TCP header, no other flag is set (ACK is set for each TCP segment).
This means that if any other flag is set such as URG, FIN, SYN, ECN, RST, and CWR, we
know that something important is there to be attended and we need to move into the SLOW path.
3. The header length has unchanged. If the TCP header length remains unchanged,
we have not added/reduced any TCP option and we can safely assume that
there is nothing important to be attended, if the above two conditions are TRUE.
fast path工作的條件
- static inline void tcp_fast_path_check(struct sock *sk)
- {
- struct tcp_sock *tp = tcp_sk(sk);
-
- if (skb_queue_empty(&tp->out_of_order_queue) &&
- tp->rcv_wnd &&
- atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf &&
- !tp->urg_data)
- tcp_fast_path_on(tp);
- }
1 沒有亂序數(shù)據(jù)包
2 接收窗口不為0
3 還有接收緩存空間
4 沒有緊急數(shù)據(jù)
反之,則進(jìn)入slow path處理;另外當(dāng)連接新建立時(shí)處于slow path。
從fast path進(jìn)入slow path的觸發(fā)條件(進(jìn)入slow path 后pred_flags清除為0):
1 在tcp_data_queue中接收到亂序數(shù)據(jù)包
2 在tcp_prune_queue中用完緩存并且開始丟棄數(shù)據(jù)包
3 在tcp_urgent_check中遇到緊急指針
4 在tcp_select_window中發(fā)送的通告窗口下降到0.
從slow_path進(jìn)入fast_path的觸發(fā)條件:
1 When we have read past an urgent byte in tcp_recvmsg() . Wehave gotten an urgent byte and we remain
in the slow path mode until we receive the urgent byte because it is handled in the slow path in
tcp_rcv_established() .
2 當(dāng)在tcp_data_queue中亂序隊(duì)列由于gap被填充而處理完畢時(shí),運(yùn)行tcp_fast_path_check。
3 tcp_ack_update_window()中更新了通告窗口。
fast path處理流程
A 判斷能否進(jìn)入fast path
- if ((tcp_flag_word(th) & TCP_HP_BITS) == tp->pred_flags &&
- TCP_SKB_CB(skb)->seq == tp->rcv_nxt) {
TCP_HP_BITS的作用就是排除flag中的PSH標(biāo)志位。只有在頭部預(yù)測滿足并且數(shù)據(jù)包以正確的順序(該數(shù)據(jù)包的第一個(gè)序號(hào)就是下個(gè)要接收
的序號(hào))到達(dá)時(shí)才進(jìn)入fast path。
- int tcp_header_len = tp->tcp_header_len;
-
-
-
-
-
-
-
- if (tcp_header_len == sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) {
-
- if (!tcp_parse_aligned_timestamp(tp, th))
- goto slow_path;
-
-
- if ((s32)(tp->rx_opt.rcv_tsval - tp->rx_opt.ts_recent) < 0)
- goto slow_path;
-
-
-
-
-
-
- }
該代碼段是依據(jù)時(shí)戳選項(xiàng)來檢查PAWS(Protect Against Wrapped Sequence numbers)。
如果發(fā)送來的僅是一個(gè)TCP頭的話(沒有捎帶數(shù)據(jù)或者接收端檢測到有亂序數(shù)據(jù)這些情況時(shí)都會(huì)發(fā)送一個(gè)純粹的ACK包)
-
- if (len == tcp_header_len) {
-
-
-
-
- if (tcp_header_len ==
- (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
- tp->rcv_nxt == tp->rcv_wup)
- tcp_store_ts_recent(tp);
-
-
-
-
- tcp_ack(sk, skb, 0);
- __kfree_skb(skb);
- tcp_data_snd_check(sk);
- return 0;
- } else {
- TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
- goto discard;
- }
主要的工作如下:
1 保存對(duì)方的最近時(shí)戳 tcp_store_ts_recent。通過前面的if判斷可以看出tcp總是回顯2次時(shí)戳回顯直接最先到達(dá)的數(shù)據(jù)包的時(shí)戳,
rcv_wup只在發(fā)送數(shù)據(jù)(這時(shí)回顯時(shí)戳)時(shí)重置為rcv_nxt,所以接收到前一次回顯后第一個(gè)數(shù)據(jù)包后,rcv_nxt增加了,但是
rcv_wup沒有更新,所以后面的數(shù)據(jù)包處理時(shí)不會(huì)調(diào)用該函數(shù)來保存時(shí)戳。
2 ACK處理。這個(gè)函數(shù)非常復(fù)雜,包含了擁塞控制機(jī)制,確認(rèn)處理等等。
3 檢查是否有數(shù)據(jù)待發(fā)送 tcp_data_snd_check。
如果該數(shù)據(jù)包中包含了數(shù)據(jù)的話
- } else {
- int eaten = 0;
- int copied_early = 0;
-
- if (tp->copied_seq == tp->rcv_nxt &&
- len - tcp_header_len <= tp->ucopy.len) {
- #ifdef CONFIG_NET_DMA
- if (tcp_dma_try_early_copy(sk, skb, tcp_header_len)) {
- copied_early = 1;
- eaten = 1;
- }
- #endif /* 如果該函數(shù)在進(jìn)程上下文中調(diào)用并且sock被用戶占用的話*/
- if (tp->ucopy.task == current &&
- sock_owned_by_user(sk) && !copied_early) {
-
- __set_current_state(TASK_RUNNING);
-
- if (!tcp_copy_to_iovec(sk, skb, tcp_header_len))
- eaten = 1;
- }
- if (eaten) {
-
-
-
-
- if (tcp_header_len ==
- (sizeof(struct tcphdr) +
- TCPOLEN_TSTAMP_ALIGNED) &&
- tp->rcv_nxt == tp->rcv_wup)
- tcp_store_ts_recent(tp);
-
- tcp_rcv_rtt_measure_ts(sk, skb);
-
- __skb_pull(skb, tcp_header_len);
- tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITSTOUSER);
- }
- if (copied_early)
- tcp_cleanup_rbuf(sk, skb->len);
- }
- if (!eaten) {
- if (tcp_checksum_complete_user(sk, skb))
- goto csum_error;
-
-
-
-
-
- if (tcp_header_len ==
- (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
- tp->rcv_nxt == tp->rcv_wup)
- tcp_store_ts_recent(tp);
-
- tcp_rcv_rtt_measure_ts(sk, skb);
-
- if ((int)skb->truesize > sk->sk_forward_alloc)
- goto step5;
-
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITS);
-
-
- __skb_pull(skb, tcp_header_len);
-
- __skb_queue_tail(&sk->sk_receive_queue, skb);
- skb_set_owner_r(skb, sk);
- tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
- }
-
- tcp_event_data_recv(sk, skb);
-
- if (TCP_SKB_CB(skb)->ack_seq != tp->snd_una) {
-
- tcp_ack(sk, skb, FLAG_DATA);
- tcp_data_snd_check(sk);
- if (!inet_csk_ack_scheduled(sk))
- goto no_ack;
- }
-
- if (!copied_early || tp->rcv_nxt != tp->rcv_wup)
- __tcp_ack_snd_check(sk, 0);
- no_ack:
- #ifdef CONFIG_NET_DMA
- if (copied_early)
- __skb_queue_tail(&sk->sk_async_wait_queue, skb);
- else
- #endif
-
- if (eaten)
- __kfree_skb(skb);
- else
- sk->sk_data_ready(sk, 0);
- return 0;
- }
tcp_event_data_recv函數(shù)
- static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
- {
- struct tcp_sock *tp = tcp_sk(sk);
- struct inet_connection_sock *icsk = inet_csk(sk);
- u32 now;
-
- inet_csk_schedule_ack(sk);
-
- tcp_measure_rcv_mss(sk, skb);
-
- tcp_rcv_rtt_measure(tp);
-
- now = tcp_time_stamp;
-
- if (!icsk->icsk_ack.ato) {
-
-
-
- tcp_incr_quickack(sk);
- icsk->icsk_ack.ato = TCP_ATO_MIN;
- } else {
- int m = now - icsk->icsk_ack.lrcvtime;
-
- if (m <= TCP_ATO_MIN / 2) {
-
- icsk->icsk_ack.ato = (icsk->icsk_ack.ato >> 1) + TCP_ATO_MIN / 2;
- } else if (m < icsk->icsk_ack.ato) {
- icsk->icsk_ack.ato = (icsk->icsk_ack.ato >> 1) + m;
- if (icsk->icsk_ack.ato > icsk->icsk_rto)
- icsk->icsk_ack.ato = icsk->icsk_rto;
- } else if (m > icsk->icsk_rto) {
-
-
-
- tcp_incr_quickack(sk);
- sk_mem_reclaim(sk);
- }
- }
- icsk->icsk_ack.lrcvtime = now;
-
- TCP_ECN_check_ce(tp, skb);
-
- if (skb->len >= 128)
- tcp_grow_window(sk, skb);
- }
rcv_ssthresh是當(dāng)前的接收窗口大小的一個(gè)閥值,其初始值就置為rcv_wnd。它跟rcv_wnd配合工作,
當(dāng)本地socket收到數(shù)據(jù)報(bào),并滿足一定條件時(shí),增長rcv_ssthresh的值,在下一次發(fā)送數(shù)據(jù)報(bào)組建TCP首部時(shí),
需要通告對(duì)方當(dāng)前的接收窗口大小,這時(shí)需要更新rcv_wnd,此時(shí)rcv_wnd的取值不能超過rcv_ssthresh的值。
兩者配合,達(dá)到一個(gè)滑動(dòng)窗口大小緩慢增長的效果。
__tcp_ack_snd_check用來判斷ACK的發(fā)送方式
-
-
-
- static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
- {
- struct tcp_sock *tp = tcp_sk(sk);
-
-
- if (((tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss
-
-
-
- && __tcp_select_window(sk) >= tp->rcv_wnd) ||
-
- tcp_in_quickack_mode(sk) ||
-
- (ofo_possible && skb_peek(&tp->out_of_order_queue))) {
-
- tcp_send_ack(sk);
- } else {
-
- tcp_send_delayed_ack(sk);
- }
- }
注釋很清楚,無需解釋。
這里有個(gè)疑問,就是當(dāng)ucopy應(yīng)用讀到需要讀取到的數(shù)據(jù)包后,也即在一次處理中
- if (tp->copied_seq == tp->rcv_nxt &&
- len - tcp_header_len <= tp->ucopy.len) {
的第二個(gè)條件的等號(hào)為真 len - tcp_header_len == tp->ucopy.len,然后執(zhí)行流程到后面eaten為1,所以函數(shù)以釋放skb結(jié)束,沒有
調(diào)用sk_data_ready函數(shù)。假設(shè)這個(gè)處理調(diào)用流程如下:
tcp_recvmsg-> sk_wait_data -> sk_wait_event -> release_sock -> __release_sock-> sk_backlog_rcv-> tcp_rcv_established
那么即使此時(shí)用戶得到了所需的數(shù)據(jù),但是在tcp_rcv_established返回前沒有提示數(shù)據(jù)已得到,
- #define sk_wait_event(__sk, __timeo, __condition) /
- ({ int __rc; /
- release_sock(__sk); /
- __rc = __condition; /
- if (!__rc) { /
- *(__timeo) = schedule_timeout(*(__timeo)); /
- } /
- lock_sock(__sk); /
- __rc = __condition; /
- __rc; /
- })
但是在回到sk_wait_event后,由于__condition為 !skb_queue_empty(&sk->sk_receive_queue),所以還是會(huì)調(diào)用schedule_timeout
來等待。這點(diǎn)顯然是浪費(fèi)時(shí)間,所以這個(gè)condition應(yīng)該考慮下這個(gè)數(shù)據(jù)已經(jīng)讀滿的情況,而不能光靠觀察receive queue來判斷是否等待。
接下來分析slow path
- slow_path:
- if (len < (th->doff << 2) || tcp_checksum_complete_user(sk, skb))
- goto csum_error;
-
-
-
-
-
- res = tcp_validate_incoming(sk, skb, th, 1);
- if (res <= 0)
- return -res;
-
- step5:
- if (th->ack)
- tcp_ack(sk, skb, FLAG_SLOWPATH);
-
- tcp_rcv_rtt_measure_ts(sk, skb);
-
-
- tcp_urg(sk, skb, th);
-
-
- tcp_data_queue(sk, skb);
-
- tcp_data_snd_check(sk);
- tcp_ack_snd_check(sk);
- return 0;
先看看tcp_validate_incoming函數(shù),在slow path處理前檢查輸入數(shù)據(jù)包的合法性。
-
-
-
- static int tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
- struct tcphdr *th, int syn_inerr)
- {
- struct tcp_sock *tp = tcp_sk(sk);
-
-
- if (tcp_fast_parse_options(skb, th, tp) && tp->rx_opt.saw_tstamp &&
- tcp_paws_discard(sk, skb)) {
- if (!th->rst) {
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_PAWSESTABREJECTED);
- tcp_send_dupack(sk, skb);
- goto discard;
- }
-
- }
-
-
- if (!tcp_sequence(tp, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq)) {
-
-
-
-
-
-
- if (!th->rst)
- tcp_send_dupack(sk, skb);
- goto discard;
- }
-
-
- if (th->rst) {
- tcp_reset(sk);
- goto discard;
- }
-
-
-
-
- tcp_replace_ts_recent(tp, TCP_SKB_CB(skb)->seq);
-
-
-
-
- if (th->syn && !before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
- if (syn_inerr)
- TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONSYN);
- tcp_reset(sk);
- return -1;
- }
-
- return 1;
-
- discard:
- __kfree_skb(skb);
- return 0;
- }
第一步:檢查PAWS tcp_paws_discard
- static inline int tcp_paws_discard(const struct sock *sk,
- const struct sk_buff *skb)
- {
- const struct tcp_sock *tp = tcp_sk(sk);
- return ((s32)(tp->rx_opt.ts_recent - tp->rx_opt.rcv_tsval) > TCP_PAWS_WINDOW &&
- get_seconds() < tp->rx_opt.ts_recent_stamp + TCP_PAWS_24DAYS &&
- !tcp_disordered_ack(sk, skb));
- }
PAWS丟棄數(shù)據(jù)包要滿足以下條件
1 The difference between the timestamp value obtained in the current segmentand last seen timestamp on
the incoming TCP segment should be more than TCP_PAWS_WINDOW (= 1), which means that if the segment that was
transmitted 1 clock tick before the segment that reached here earlier TCP seq should be acceptable.
It may be because of reordering of the segments that the latter reached earlier.
2 the 24 days have not elapsed since last time timestamp was stored,
3 tcp_disordered_ack返回0.
以下轉(zhuǎn)載自CU論壇http://linux.chinaunix.net/bbs/viewthread.php?tid=1130308
在實(shí)際進(jìn)行PAWS預(yù)防時(shí),Linux是通過如下代碼調(diào)用來完成的
tcp_rcv_established
|
|-->tcp_paws_discard
|
|-->tcp_disordered_ack
其中關(guān)鍵是local方通過tcp_disordered_ack函數(shù)對(duì)一個(gè)剛收到的數(shù)據(jù)分段進(jìn)行判斷,下面我們對(duì)該函數(shù)的判斷邏輯進(jìn)行下總結(jié):
大前提:該收到分段的TS值表明有回繞現(xiàn)象發(fā)生
a)若該分段不是一個(gè)純ACK,則丟棄。因?yàn)轱@然這個(gè)分段所攜帶的數(shù)據(jù)是一個(gè)老數(shù)據(jù)了,不是local方目前希望接收的(參見PAWS的處理依據(jù)一節(jié))
b)若該分段不是local所希望接收的,則丟棄。這個(gè)原因很顯然
c)若該分段是一個(gè)純ACK,但該ACK并不是一個(gè)重復(fù)ACK(由local方后續(xù)數(shù)據(jù)正確到達(dá)所引發(fā)的),則丟棄。因?yàn)轱@然該ACK是一個(gè)老的ACK,并不是由于為了加快local方重發(fā)而在每收到一個(gè)丟失分段后的分段而發(fā)出的ACK。
d)若該分段是一個(gè)ACK,且為重復(fù)ACK,并且該ACK的TS值超過了local方那個(gè)丟失分段后的重發(fā)rto,則丟棄。因?yàn)轱@然此時(shí)local方已經(jīng)重發(fā)了那個(gè)導(dǎo)致此重復(fù)ACK產(chǎn)生的分段,因此再收到此重復(fù)ACK就可以直接丟棄。
e)若該分段是一個(gè)ACK,且為重復(fù)ACK,但是沒有超過一個(gè)rto的時(shí)間,則不能丟棄,因?yàn)檫@正代表peer方收到了local方發(fā)出的丟失分段后的分段,local方要對(duì)此ACK進(jìn)行處理(例如立刻重傳)
這里有一個(gè)重要概念需要理解,即在出現(xiàn)TS問題后,純ACK和帶ACK的數(shù)據(jù)分段二者是顯著不同的,對(duì)于后者,可以立刻丟棄掉,因?yàn)閺囊粋€(gè)窗口的某個(gè)seq到下一個(gè)窗口的同一個(gè)seq過程中,一定有窗口變化曾經(jīng)發(fā)生過,從而TS記錄值ts_recent也一定更新過,此時(shí)一定可以通過PAWS進(jìn)行丟棄處理。但是對(duì)于前者,一個(gè)純ACK,就不能簡單丟棄了,因?yàn)橛羞@樣一個(gè)現(xiàn)象是合理的,即假定local方的接收緩存很大,并且peer方在發(fā)送時(shí)很快就回繞了,于是在local方的某個(gè)分段丟失后,peer方需要在每收到的后續(xù)分段時(shí)發(fā)送重復(fù)ACK,而此時(shí)該重發(fā)ACK的ack_seq就是這個(gè)丟失分段的序號(hào),而該重發(fā)ACK的seq已經(jīng)是回繞后的重復(fù)序號(hào)了,盡管此時(shí)到底是回繞后的那個(gè)重復(fù)ACK還是之前的那個(gè)同樣序號(hào)seq的重復(fù)ACK,對(duì)于local方來都需要處理(立刻啟動(dòng)重發(fā)動(dòng)作),而不能簡單丟棄掉。
第2步 檢查數(shù)據(jù)包的序號(hào)是否正確,該判斷失敗后調(diào)用tcp_send_dupack發(fā)送一個(gè)duplicate acknowledge(未設(shè)置RST標(biāo)志位時(shí))。
- static inline int tcp_sequence(struct tcp_sock *tp, u32 seq, u32 end_seq)
- {
- return !before(end_seq, tp->rcv_wup) &&
- !after(seq, tp->rcv_nxt + tcp_receive_window(tp));
- }
由rcv_wup的更新時(shí)機(jī)(發(fā)送ACK時(shí)的tcp_select_window)可知位于序號(hào)rcv_wup前面的數(shù)據(jù)都已確認(rèn),所以待檢查數(shù)據(jù)包的結(jié)束序號(hào)至少
要大于該值;同時(shí)開始序號(hào)要落在接收窗口內(nèi)。
第3步 如果設(shè)置了RST,則調(diào)用tcp_reset處理
第4步 更新ts_recent,
第5步 檢查SYN,因?yàn)橹匕l(fā)的SYN和原來的SYN之間不會(huì)發(fā)送數(shù)據(jù),所以這2個(gè)SYN的序號(hào)是相同的,如果不滿足則reset連接。
接下來重點(diǎn)分析tcp_data_queue函數(shù),這里就是對(duì)數(shù)據(jù)包的處理了。
- static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
- {
- struct tcphdr *th = tcp_hdr(skb);
- struct tcp_sock *tp = tcp_sk(sk);
- int eaten = -1;
-
- if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
- goto drop;
-
- __skb_pull(skb, th->doff * 4);
-
- TCP_ECN_accept_cwr(tp, skb);
-
- if (tp->rx_opt.dsack) {
- tp->rx_opt.dsack = 0;
- tp->rx_opt.eff_sacks = tp->rx_opt.num_sacks;
- }
如果該數(shù)據(jù)包剛好是下一個(gè)要接收的數(shù)據(jù),則可以直接copy到用戶空間(如果存在且可用),否則排隊(duì)到receive queue
-
-
-
-
- if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) {
- if (tcp_receive_window(tp) == 0)
- goto out_of_window;
-
-
- if (tp->ucopy.task == current &&
- tp->copied_seq == tp->rcv_nxt && tp->ucopy.len &&
- sock_owned_by_user(sk) && !tp->urg_data) {
- int chunk = min_t(unsigned int, skb->len,
- tp->ucopy.len);
-
- __set_current_state(TASK_RUNNING);
-
- local_bh_enable();
- if (!skb_copy_datagram_iovec(skb, 0, tp->ucopy.iov, chunk)) {
- tp->ucopy.len -= chunk;
- tp->copied_seq += chunk;
- eaten = (chunk == skb->len && !th->fin);
- tcp_rcv_space_adjust(sk);
- }
- local_bh_disable();
- }
-
- if (eaten <= 0) {
- ueue_and_out:
- if (eaten < 0 &&
-
- tcp_try_rmem_schedule(sk, skb->truesize))
- goto drop;
-
- skb_set_owner_r(skb, sk);
- __skb_queue_tail(&sk->sk_receive_queue, skb);
- }
- tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
- if (skb->len)
- tcp_event_data_recv(sk, skb);
- if (th->fin)
- tcp_fin(skb, sk, th);
-
- if (!skb_queue_empty(&tp->out_of_order_queue)) {
- tcp_ofo_queue(sk);
-
-
-
-
-
- if (skb_queue_empty(&tp->out_of_order_queue))
- inet_csk(sk)->icsk_ack.pingpong = 0;
- }
-
- if (tp->rx_opt.num_sacks)
- tcp_sack_remove(tp);
-
- tcp_fast_path_check(sk);
-
- if (eaten > 0)
- __kfree_skb(skb);
- else if (!sock_flag(sk, SOCK_DEAD))
- sk->sk_data_ready(sk, 0);
- return;
- }
下面看看函數(shù)tcp_ofo_queue,也即out-of-order queue的處理
-
-
-
- static void tcp_ofo_queue(struct sock *sk)
- {
- struct tcp_sock *tp = tcp_sk(sk);
- __u32 dsack_high = tp->rcv_nxt;
- struct sk_buff *skb;
-
- while ((skb = skb_peek(&tp->out_of_order_queue)) != NULL) {
-
- if (after(TCP_SKB_CB(skb)->seq, tp->rcv_nxt))
- break;
-
- if (before(TCP_SKB_CB(skb)->seq, dsack_high)) {
- __u32 dsack = dsack_high;
- if (before(TCP_SKB_CB(skb)->end_seq, dsack_high))
- dsack_high = TCP_SKB_CB(skb)->end_seq;
- tcp_dsack_extend(sk, TCP_SKB_CB(skb)->seq, dsack);
- }
-
- if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
- SOCK_DEBUG(sk, "ofo packet was already received /n");
- __skb_unlink(skb, &tp->out_of_order_queue);
- __kfree_skb(skb);
- continue;
- }
- SOCK_DEBUG(sk, "ofo requeuing : rcv_next %X seq %X - %X/n",
- tp->rcv_nxt, TCP_SKB_CB(skb)->seq,
- TCP_SKB_CB(skb)->end_seq);
-
- __skb_unlink(skb, &tp->out_of_order_queue);
- __skb_queue_tail(&sk->sk_receive_queue, skb);
- tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
- if (tcp_hdr(skb)->fin)
- tcp_fin(skb, sk, tcp_hdr(skb));
- }
- }
這里DSACK的處理中為什么即使dsack比end_seq大,還是用dsack作為右邊界呢?
-
- if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
-
- NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_DELAYEDACKLOST);
- tcp_dsack_set(sk, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
-
- out_of_window:
- tcp_enter_quickack_mode(sk);
- inet_csk_schedule_ack(sk);
- drop:
- __kfree_skb(skb);
- return;
- }
-
-
- if (!before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt + tcp_receive_window(tp)))
- goto out_of_window;
-
- tcp_enter_quickack_mode(sk);
-
- if (before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) {
-
- SOCK_DEBUG(sk, "partial packet: rcv_next %X seq %X - %X/n",
- tp->rcv_nxt, TCP_SKB_CB(skb)->seq,
- TCP_SKB_CB(skb)->end_seq);
-
- tcp_dsack_set(sk, TCP_SKB_CB(skb)->seq, tp->rcv_nxt);
-
-
-
-
- if (!tcp_receive_window(tp))
- goto out_of_window;
- goto queue_and_out;
- }
- TCP_ECN_check_ce(tp, skb);
-
-
- if (tcp_try_rmem_schedule(sk, skb->truesize))
- goto drop;
-
-
- tp->pred_flags = 0;
- inet_csk_schedule_ack(sk);
-
- SOCK_DEBUG(sk, "out of order segment: rcv_next %X seq %X - %X/n",
- tp->rcv_nxt, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
-
- skb_set_owner_r(skb, sk);
-
- if (!skb_peek(&tp->out_of_order_queue)) {
-
- if (tcp_is_sack(tp)) {
- tp->rx_opt.num_sacks = 1;
- tp->rx_opt.dsack = 0;
- tp->rx_opt.eff_sacks = 1;
- tp->selective_acks[0].start_seq = TCP_SKB_CB(skb)->seq;
- tp->selective_acks[0].end_seq =
- TCP_SKB_CB(skb)->end_seq;
- }
- __skb_queue_head(&tp->out_of_order_queue, skb);
- } else {
- struct sk_buff *skb1 = tp->out_of_order_queue.prev;
- u32 seq = TCP_SKB_CB(skb)->seq;
- u32 end_seq = TCP_SKB_CB(skb)->end_seq;
-
- if (seq == TCP_SKB_CB(skb1)->end_seq) {
- __skb_queue_after(&tp->out_of_order_queue, skb1, skb);
-
- if (!tp->rx_opt.num_sacks ||
- tp->selective_acks[0].end_seq != seq)
- goto add_sack;
-
-
- tp->selective_acks[0].end_seq = end_seq;
- return;
- }
-
-
-
- do {
- if (!after(TCP_SKB_CB(skb1)->seq, seq))
- break;
- } while ((skb1 = skb1->prev) !=
- (struct sk_buff *)&tp->out_of_order_queue);
-
-
- if (skb1 != (struct sk_buff *)&tp->out_of_order_queue &&
- before(seq, TCP_SKB_CB(skb1)->end_seq)) {
- if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
-
- __kfree_skb(skb);
- tcp_dsack_set(sk, seq, end_seq);
- goto add_sack;
- }
- if (after(seq, TCP_SKB_CB(skb1)->seq)) {
-
- tcp_dsack_set(sk, seq,
- TCP_SKB_CB(skb1)->end_seq);
- } else {
- skb1 = skb1->prev;
- }
- }
-
- __skb_queue_after(&tp->out_of_order_queue, skb1, skb);
-
-
- while ((skb1 = skb->next) !=
- (struct sk_buff *)&tp->out_of_order_queue &&
- after(end_seq, TCP_SKB_CB(skb1)->seq)) {
- if (before(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
- tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
- end_seq);
- break;
- }
- __skb_unlink(skb1, &tp->out_of_order_queue);
- tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
- TCP_SKB_CB(skb1)->end_seq);
- __kfree_skb(skb1);
- }
-
- add_sack:
- if (tcp_is_sack(tp))
-
- tcp_sack_new_ofo_skb(sk, seq, end_seq);
- }
- }
@import url(http://www.shnenglu.com/cutesoft_client/cuteeditor/Load.ashx?type=style&file=SyntaxHighlighter.css);@import url(/css/cuteeditor.css);