posts - 297, comments - 15, trackbacks - 0

1 功能介紹
     epoll與select/poll不同的一點是，它是由一組系統調用組成。
     int epoll_create(int size);
     int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
     int epoll_wait(int epfd, struct epoll_event *events,
                       int maxevents, int timeout);
     epoll相關系統調用是在Linux 2.5.44開始引入的。該系統調用針對傳統的selec
t/poll系統調用的不足，設計上作了很大的改動。select/poll的缺點在于：
     1.每次調用時要重復地從用戶態讀入參數。
     2.每次調用時要重復地掃描文件描述符。
     3.每次在調用開始時，要把當前進程放入各個文件描述符的等待隊列。在調用結
束后，又把進程從各個等待隊列中刪除。
     在實際應用中，select/poll監視的文件描述符可能會非常多，如果每次只是返回
一小部分，那么，這種情況下select/poll顯得不夠高效。epoll的設計思路，是把s
elect/poll單個的操作拆分為1個epoll_create+多個epoll_ctrl+一個wait。此外，
內核針對epoll操作添加了一個文件系統”eventpollfs”，每一個或者多個要監視的
文件描述符都有一個對應的eventpollfs文件系統的inode節點，主要信息保存在eve
ntpoll結構體中。而被監視的文件的重要信息則保存在epitem結構體中。所以他們
是一對多的關系。
     由于在執行epoll_create和epoll_ctrl時，已經把用戶態的信息保存到內核態了
，所以之后即使反復地調用epoll_wait，也不會重復地拷貝參數，掃描文件描述符，
反復地把當前進程放入/放出等待隊列。這樣就避免了以上的三個缺點。
     接下去看看它們的實現：
2 關鍵結構體：
/* Wrapper struct used by poll queueing */
struct ep_pqueue {
         poll_table pt;
         struct epitem *epi;
};
     這個結構體類似于select/poll中的struct poll_wqueues。由于epoll需要在內核
態保存大量信息，所以光光一個回調函數指針已經不能滿足要求，所以在這里引入了
一個新的結構體struct epitem。
/*
 * Each file descriptor added to the eventpoll interface will
 * have an entry of this type linked to the hash.
 */
struct epitem {
         /* RB-Tree node used to link this structure to the eventpoll rb
-tree */
         struct rb_node rbn;
紅黑樹，用來保存eventpoll
         /* List header used to link this structure to the eventpoll rea
dy list */
         struct list_head rdllink;
雙向鏈表，用來保存已經完成的eventpoll
         /* The file descriptor information this item refers to */
         struct epoll_filefd ffd;
這個結構體對應的被監聽的文件描述符信息
         /* Number of active wait queue attached to poll operations */
         int nwait;
poll操作中事件的個數
         /* List containing poll wait queues */
         struct list_head pwqlist;
雙向鏈表，保存著被監視文件的等待隊列，功能類似于select/poll中的poll_tab
le
         /* The "container" of this item */
         struct eventpoll *ep;
指向eventpoll，多個epitem對應一個eventpoll
         /* The structure that describe the interested events and the so
urce fd */
         struct epoll_event event;
記錄發生的事件和對應的fd
         /*
          * Used to keep track of the usage count of the structure. This
 avoids
          * that the structure will desappear from underneath our proces
sing.
          */
         atomic_t usecnt;
引用計數
         /* List header used to link this item to the "struct file" item
s list */
         struct list_head fllink;
雙向鏈表，用來鏈接被監視的文件描述符對應的struct file。因為file里有f_ep
_link，用來保存所有監視這個文件的epoll節點
         /* List header used to link the item to the transfer list */
         struct list_head txlink;
雙向鏈表，用來保存傳輸隊列
         /*
          * This is used during the collection/transfer of events to use
rspace
          * to pin items empty events set.
          */
         unsigned int revents;
文件描述符的狀態，在收集和傳輸時用來鎖住空的事件集合
};
     該結構體用來保存與epoll節點關聯的多個文件描述符，保存的方式是使用紅黑樹
實現的hash表。至于為什么要保存，下文有詳細解釋。它與被監聽的文件描述符一一
對應。
struct eventpoll {
         /* Protect the this structure access */
         rwlock_t lock;
讀寫鎖
         /*
          * This semaphore is used to ensure that files are not removed
          * while epoll is using them. This is read-held during the even
t
          * collection loop and it is write-held during the file cleanup
          * path, the epoll file exit code and the ctl operations.
          */
         struct rw_semaphore sem;
讀寫信號量
         /* Wait queue used by sys_epoll_wait() */
         wait_queue_head_t wq;
         /* Wait queue used by file->poll() */
         wait_queue_head_t poll_wait;
         /* List of ready file descriptors */
         struct list_head rdllist;
已經完成的操作事件的隊列。
         /* RB-Tree root used to store monitored fd structs */
         struct rb_root rbr;
保存epoll監視的文件描述符
};
     這個結構體保存了epoll文件描述符的擴展信息，它被保存在file結構體的priva
te_data中。它與epoll文件節點一一對應。通常一個epoll文件節點對應多個被監視
的文件描述符。所以一個eventpoll結構體會對應多個epitem結構體。
     那么，epoll中的等待事件放在哪里呢？見下面
/* Wait structure used by the poll hooks */
struct eppoll_entry {
         /* List header used to link this structure to the "struct epite
m" */
         struct list_head llink;
         /* The "base" pointer is set to the container "struct epitem" *
/
         void *base;
         /*
          * Wait queue item that will be linked to the target file wait
          * queue head.
          */
         wait_queue_t wait;
         /* The wait queue head that linked the "wait" wait queue item *
/
         wait_queue_head_t *whead;
};
     與select/poll的struct poll_table_entry相比，epoll的表示等待隊列節點的結
構體只是稍有不同，與struct poll_table_entry比較一下。
struct poll_table_entry {
         struct file * filp;
         wait_queue_t wait;
         wait_queue_head_t * wait_address;
};
     由于epitem對應一個被監視的文件，所以通過base可以方便地得到被監視的文件
信息。又因為一個文件可能有多個事件發生，所以用llink鏈接這些事件。
3 epoll_create的實現
     epoll_create()的功能是創建一個eventpollfs文件系統的inode節點。具體由ep
_getfd()完成。ep_getfd()先調用ep_eventpoll_inode()創建一個inode節點，然后
調用d_alloc()為inode分配一個dentry。最后把file,dentry,inode三者關聯起來。
     在執行了ep_getfd()之后，它又調用了ep_file_init(),分配了eventpoll結構體
，并把eventpoll的指針賦給file結構體，這樣eventpoll就與file結構體關聯起來了
。
     需要注意的是epoll_create()的參數size實際上只是起參考作用，只要它不小于
等于0，就并不限制這個epoll inode關聯的文件描述符數量。
4 epoll_ctl的實現
     epoll_ctl的功能是實現一系列操作，如把文件與eventpollfs文件系統的inode節
點關聯起來。這里要介紹一下eventpoll結構體，它保存在file->f_private中，記錄
了eventpollfs文件系統的inode節點的重要信息，其中成員rbr保存了該epoll文件節
點監視的所有文件描述符。組織的方式是一棵紅黑樹，這種結構體在查找節點時非常
高效。
     首先它調用ep_find()從eventpoll中的紅黑樹獲得epitem結構體。然后根據op參
數的不同而選擇不同的操作。如果op為EPOLL_CTL_ADD，那么正常情況下epitem是不
可能在eventpoll的紅黑樹中找到的，所以調用ep_insert創建一個epitem結構體并插
入到對應的紅黑樹中。
     ep_insert()首先分配一個epitem對象，對它初始化后，把它放入對應的紅黑樹。
此外，這個函數還要作一個操作，就是把當前進程放入對應文件操作的等待隊列。這
一步是由下面的代碼完成的。
     init_poll_funcptr(&epq.pt, ep_ptable_queue_proc);
     。。。
     revents = tfile->f_op->poll(tfile, &epq.pt);
     函數先調用init_poll_funcptr注冊了一個回調函數 ep_ptable_queue_proc，這
個函數會在調用f_op->poll時被執行。該函數分配一個epoll等待隊列結點eppoll_e
ntry：一方面把它掛到文件操作的等待隊列中，另一方面把它掛到epitem的隊列中
。此外，它還注冊了一個等待隊列的回調函數ep_poll_callback。當文件操作完成，
喚醒當前進程之前，會調用ep_poll_callback()，把eventpoll放到epitem的完成隊
列中，并喚醒等待進程。
     如果在執行f_op->poll以后，發現被監視的文件操作已經完成了，那么把它放在
完成隊列中了，并立即把等待操作的那些進程喚醒。
5 epoll_wait的實現
     epoll_wait的工作是等待文件操作完成并返回。
     它的主體是ep_poll()，該函數在for循環中檢查epitem中有沒有已經完成的事件
，有的話就把結果返回。沒有的話調用schedule_timeout()進入休眠，直到進程被再
度喚醒或者超時。
6 性能分析
     epoll機制是針對select/poll的缺陷設計的。通過新引入的eventpollfs文件系統
，epoll把參數拷貝到內核態，在每次輪詢時不會重復拷貝。通過把操作拆分為epol
l_create,epoll_ctl,epoll_wait，避免了重復地遍歷要監視的文件描述符。此外，
由于調用epoll的進程被喚醒后，只要直接從epitem的完成隊列中找出完成的事件，
找出完成事件的復雜度由O(N)降到了O(1)。
     但是epoll的性能提高是有前提的，那就是監視的文件描述符非常多，而且每次完
成操作的文件非常少。所以，epoll能否顯著提高效率，取決于實際的應用場景。這
方面需要進一步測試。

轉自http://www.freecity.cn/agent/thread.do?id=LinuxDev-48b24eba-c4e53e6f2d89ff3cb039f2c4ed4102e9

from:

http://blog.chinaunix.net/u2/67780/showart_2064403.html

posted on 2010-04-24 22:41 chatler 閱讀(318) 評論(0) 編輯收藏引用所屬分類: Socket

只有注冊用戶登錄后才能發表評論。
【推薦】100%開源！大型工業跨平臺軟件C++源碼提供，建模，組態！

相關文章: Comparing Two High-Performance I/O Design Patterns Comparing Two High-Performance I/O Design Patterns Linux下getsockopt/setsockopt 函數說明一個基于完成端口的TCP Server Framework,淺析IOCP 一個基于Event Poll(epoll)的TCP Server Framework,淺析epoll [轉]close_wait狀態和time_wait狀態 TCP: SYN ACK FIN RST PSH URG 詳解 epoll 精髓 epoll用法說明 EPoll Mechanism

網站導航: 博客園 IT新聞 BlogJava 博問 Chat2DB 管理

2009年11月

日

一

二

三

四

五

六

常用鏈接

留言簿(10)

隨筆分類(307)

隨筆檔案(297)

algorithm

andytan
algorithm, linux, os, network,etc
EXACT STRING MATCHING ALGORITHMS
httperf -- a web perf test tool
Java多線程
編程夜未眠
布薩空間
結構之法
沈一峰 google技術博客
小兵的窩

Books_Free_Online

Book Fire Center

C++

Bjarne Stroustrup's C++ Style and Technique FAQ
boyplayee column
C Plus Plus
CPP Reference
LearnC++Website
Welcome to Bjarne Stroustrup's homepage!

database

mydear Database
mysql高性能筆記

Linux

獨孤閣

Linux shell

linux
飛翔

linux socket

linux socket programming
sock programming

misce

cloudward
感覺這個博客還是不錯，雖然做的東西和我不大相關，覺得看看還是有好處的

network

nginx

OSS

Google Android
Android is a software stack for mobile devices that includes an operating system, middleware and key applications. This early look at the Android SDK provides the tools and APIs necessary to begin developing applications on the Android platform using the Java programming language.
os161 file list

overall

linux related
linux_overall
loop_in_nodes
tiaot
Ubuntu Zone
陳皓專欄
享受編程的樂趣

常用鏈接

留言簿(10)

隨筆分類(307)

隨筆檔案(297)

algorithm

Books_Free_Online

C++

database

Linux

Linux shell

linux socket

misce

network

OSS

overall

搜索

最新評論

閱讀排行榜

評論排行榜