• <ins id="pjuwb"></ins>
    <blockquote id="pjuwb"><pre id="pjuwb"></pre></blockquote>
    <noscript id="pjuwb"></noscript>
          <sup id="pjuwb"><pre id="pjuwb"></pre></sup>
            <dd id="pjuwb"></dd>
            <abbr id="pjuwb"></abbr>

            woaidongmao

            文章均收錄自他人博客,但不喜標題前加-[轉貼],因其丑陋,見諒!~
            隨筆 - 1469, 文章 - 0, 評論 - 661, 引用 - 0
            數據加載中……

            用C語言編寫一個網絡蜘蛛來搜索網上出現的電子郵件地址

                可能大家經常要去互聯網上搜索特定的內容,比如收集大量郵件地址,如果用 google 之類的搜索引擎是沒法實現這種特定功能的,所以用 C 語言來寫一個吧。它的功能就是不斷去取得網絡上的頁面,然后分析出網頁上出現的郵件地址保存下來。象個蜘蛛一樣,從網絡上一個網頁爬向另一個網頁,不停止地搜索郵件地址。

            當然這只是一個原理展示程序,并沒有進行優化。

            這個程序的 main 函數流程圖如下:
            clip_image002
            即:分析程序運行時的參數,把各網頁地址作為根節點加入到鏈表,然后從鏈表頭開始處理各節點

            對整個鏈表的處理是先處理兄弟節點,流程圖如下:
            clip_image004

            然后再處理各節點的子節點,流程圖如下:
            clip_image006
            當然,這里采用了遞歸調用方法,處理子節點的數據時和處理整個鏈表一樣循環處理就是了。

            /************
            關于本文檔********************************************
            *filename:
            C 語言編寫一個網絡蜘蛛來搜索網上出現的電子郵件地址
            *purpose:
            一個郵址搜索程序的雛形
            *wrote by: zhoulifa(zhoulifa@163.com)
            周立發(http://zhoulifa.bokee.com)
            Linux
            愛好者 Linux知識傳播者 SOHO族開發者 最擅長C語言
            *date time:2006-08-31 21:00:00
            *Note:
            任何人可以任意復制代碼并運用這些文檔,當然包括你的商業用途
            *
            但請遵循GPL
            *Hope:
            希望越來越多的人貢獻自己的力量,為科學技術發展出力
            *********************************************************************/

            程序在運行的過程中要建立一個樹形鏈表結構,結構圖如下:
            clip_image008

            程序啟動時分析所帶參數,把各參數加入到根網頁節點,如果有多個參數則這個根網頁有兄弟節點。
            然后從根節點開始處理這一級上各節點,把各節點網頁上出現的網頁鏈接加到該節點的子節點上,處理完當前這一級后處理子節點這一級。

            源代碼如下:
            [code]
            #include <sys/types.h>
            #include <sys/stat.h>
            #include <fcntl.h>
            #include <sys/mman.h>
            #include <unistd.h>
            #include <stdio.h>
            #include <string.h>
            #include <stdlib.h>
            #include <netdb.h>
            #include <errno.h>
            #include <locale.h>

            #define USERAGENT "Wget/1.10.2"
            #define ACCEPT "*/*"
            #define ACCEPTLANGUAGE "zh-cn,zh;q=0.5"
            #define ACCEPTENCODING "gzip,deflate"
            #define ACCEPTCHARSET "gb2312,utf-8;q=0.7,*;q=0.7"
            #define KEEPALIVE "300"
            #define CONNECTION "keep-alive"
            #define CONTENTTYPE "application/x-www-form-urlencoded"

            #define MAXFILENAME 14
            #define DEBUG 1

            typedef struct webnode {
                    char * host;                 /*
            網頁所在的主機 */
                    int port;                    /*
            網絡服務器所使用的端口 */
                    char * dir;                  /*
            網頁所在的目錄 */
                    char * page;                 /*
            網頁文件名 */
                    char * file;                 /*
            本地保存的文件名 */
                    char IsHandled;              /*
            是否處理過 */
                    struct webnode * brother;    /*
            兄弟節點鏈表指針 */
                    struct webnode * child;      /*
            子節點鏈表指針 */
            } WEBNODE;

            struct sockaddr_in server_addr;
            int sockfd = 0, dsend = 0, totalsend = 0, nbytes = 0, reqn = 0, i = 0, j = 0, ret = 0;
            struct hostent *host;
            char request[409600] = "", buffer[1024] = "", httpheader[1024] = "";
            int FileNumber = 0;
            char e[2] = "@/";
            WEBNODE * NodeHeader, * NodeTail, * NodeCurr;
            char * mapped_mem;

            int GetHost(char * , char ** , char ** , int * , char ** ); /**/
            void AnalyzePage(WEBNODE *); /**/
            void AddInitNode(char *, char *, int, char * ); /**/
            void HandleInitNode(WEBNODE *); /**/
            void DisplayNode(WEBNODE *); /**/
            void HandOneNode(WEBNODE *); /**/
            void DoneWithList(int); /**/
            void DoOnce(); /**/
            void ConnectWeb(void); /**/
            void SendRequest(void); /**/
            void ReceiveResponse(void); /**/
            void GetEmail(char * ); /**/
            void GetLink(char * ); /**/
            void GetBeforePos(char * , char ** ); /**/
            void GetAfterPos(char * , char ** ); /**/
            void AddChildNode(WEBNODE * , char * ); /**/
            void GetAfterPosWithSlash(char * , char ** ); /**/
            void GetMemory(char ** , int ); /**/
            int IsExistWeb(WEBNODE * , char * , char * , int , char * ); /**/
            void Rstrchr(char * , int , char ** ); /**/
            int GetLocalAgent(char * UserAgent, char * Accept, char * AcceptLanguage, char * AcceptEncoding, char * AcceptCharset, char * KeepAlive, char * Connection, char * ContentType); /**/

            /**************************************************************
            功能:設置 HTTP 協議頭內容的一些固定值
            ***************************************************************/
            int GetLocalAgent(char * UserAgent, char * Accept, char * AcceptLanguage, char * AcceptEncoding, char * AcceptCharset, char * KeepAlive, char * Connection, char * ContentType)
            {
              memcpy(UserAgent, USERAGENT, strlen(USERAGENT));
              memcpy(Accept, ACCEPT, strlen(ACCEPT));
              memcpy(AcceptLanguage, ACCEPTLANGUAGE, strlen(ACCEPTLANGUAGE));
              memcpy(AcceptEncoding, ACCEPTENCODING, strlen(ACCEPTENCODING));
              memcpy(AcceptCharset, ACCEPTCHARSET, strlen(ACCEPTCHARSET));
              memcpy(KeepAlive, KEEPALIVE, strlen(KEEPALIVE));
              memcpy(Connection, CONNECTION, strlen(CONNECTION));
              memcpy(ContentType, CONTENTTYPE, strlen(CONTENTTYPE));
              return 0;
            }

            /**************************************************************
            功能:在字符串 s 里搜索 x 字符,并設置指針 d 指向該位置
            ***************************************************************/
            void Rstrchr(char * s, int x, char ** d)
            {
                    int len = strlen(s) - 1;
                    while(len >= 0)        {
                            if(x == s[len]) {(*d) = s + len; return;}
                            len--;
                    }
                    (*d) = 0;
            }

            /**************************************************************
            功能:連接一個網站服務器
            ***************************************************************/
            void ConnectWeb(void) { /* connect to web server */
              /* create a socket descriptor */
              if((sockfd=socket(PF_INET,SOCK_STREAM,0))==-1)
              {
                fprintf(stderr,"\tSocket Error:%s\a\n",strerror(errno));
                exit(1);
              }

              /* bind address */
              bzero(&server_addr, sizeof(server_addr));
              server_addr.sin_family = AF_INET;
              server_addr.sin_port = htons(NodeCurr->port);
              server_addr.sin_addr = *((struct in_addr *)host->h_addr);

              /* connect to the server */
              if(connect(sockfd, (struct sockaddr *)(&server_addr), sizeof(struct sockaddr)) == -1)
              {
                fprintf(stderr, "\tConnect Error:%s\a\n", strerror(errno));
                exit(1);
              }
            }

            /**************************************************************
            功能:向網站發送 HTTP 請求
            ***************************************************************/
            void SendRequest(void) { /* send my http-request to web server */
              dsend = 0;totalsend = 0;
              nbytes=strlen(request);
              while(totalsend < nbytes) {
                dsend = write(sockfd, request + totalsend, nbytes - totalsend);
                if(dsend==-1)  {fprintf(stderr, "\tsend error!%s\n", strerror(errno));exit(0);}
                totalsend+=dsend;
                fprintf(stdout, "\n\tRequest.%d %d bytes send OK!\n", reqn, totalsend);
              }
            }

            /**************************************************************
            功能:接收網站的 HTTP 返回
            ***************************************************************/
            void ReceiveResponse(void) { /* get response from web server */
              fd_set writefds;
              struct timeval tival;
              int retry = 0;
              FILE * localfp = NULL;

              i=0; j = 0;
            __ReCeive:
              FD_ZERO(&writefds);
              tival.tv_sec = 10;
              tival.tv_usec = 0;
              if(sockfd > 0) FD_SET(sockfd, &writefds);
              else {fprintf(stderr, "\n\tError, socket is negative!\n"); exit(0);}

              ret = select(sockfd + 1, &writefds, NULL, NULL, &tival);
              if(ret ==0 ) {
                if(retry++ < 10) goto __ReCeive;
              }
              if(ret <= 0) {fprintf(stderr, "\n\tError while receiving!\n"); exit(0);}

              if(FD_ISSET(sockfd, &writefds))    {
                memset(buffer, 0, 1024);
                memset(httpheader, 0, 1024);
                if((localfp = fopen(NodeCurr->file, "w")) == NULL) {if(DEBUG) fprintf(stderr, "create file '%s' error\n", NodeCurr->file); return;}
                /* receive data from web server */
                while((nbytes=read(sockfd,buffer,1))==1)
                {
                  if(i < 4)  { /*
            獲取 HTTP 消息頭 */
                    if(buffer[0] == '\r' || buffer[0] == '\n')  i++;
                    else i = 0;
                    memcpy(httpheader + j, buffer, 1); j++;
                  }
                  else  { /*
            獲取 HTTP 消息體 */
                    fprintf(localfp, "%c", buffer[0]); /* print content on the screen */
                    //fprintf(stdout, "%c", buffer[0]); /* print content on the screen */
                    i++;
                  }
                }
                fclose(localfp);
              }
            }

            /**************************************************************
            功能:執行一次 HTTP 請求
            ***************************************************************/
            void DoOnce() { /* send and receive */
              ConnectWeb(); /* connect to the web server */

              /* send a request */
              SendRequest();

              /* receive a response message from web server */
              ReceiveResponse();

              close(sockfd); /* because HTTP protocol do something one connection, so I can close it after receiving */
            }

            /**************************************************************
            功能:執行 HTTP 請求
            ***************************************************************/
            void DoneWithList(int flag) {
              if(flag) fprintf(stdout, "\tRequest.%d is:\n%s", ++reqn, request);

              DoOnce();

              if(flag) fprintf(stdout, "\n\tThe following is the response header:\n%s", httpheader);
            }

            /**************************************************************
            功能:從字符串 src 中分析出網站地址和端口,并得到文件和目錄
            ***************************************************************/
            int GetHost(char * src, char ** web, char ** file, int * port, char ** dir)  {
              char * pA, * pB, * pC;
              int len;

              *port = 0;
              if(!(*src))  return -1;
              pA = src;
              if(!strncmp(pA, "http://", strlen("http://")))  pA = src+strlen("http://");
              /* else if(!strncmp(pA, "https://", strlen("https://")))  pA = src+strlen("https://"); */
              else return 1;
              pB = strchr(pA, '/');
              if(pB)  {
                len = strlen(pA) - strlen(pB);
                GetMemory(web, len);
                memcpy((*web), pA, len);
                if(*(pB+1))  {
                  Rstrchr(pB + 1, '/', &pC);
                  if(pC) len = strlen(pB + 1) - strlen(pC);
                  else len = 0;
                  if(len > 0) {
                    GetMemory(dir, len);
                    memcpy((*dir), pB + 1, len);

                    if(pC + 1) {
                      len = strlen(pC + 1);
                      GetMemory(file, len);
                      memcpy((*file), pC + 1, len);
                    }
                    else {
                      len = 1;
                      GetMemory(file, len);
                      memcpy((*file), e, len);
                    }
                  }
                  else {
                    len = 1;
                    GetMemory(dir, len);
                    memcpy((*dir), e + 1, len);

                    len = strlen(pB + 1);
                    GetMemory(file, len);
                    memcpy((*file), pB + 1, len);
                  }
                }
                else {
                  len = 1;
                  GetMemory(dir, len);
                  memcpy((*dir), e + 1, len);

                  len = 1;
                  GetMemory(file, len);
                  memcpy((*file), e, len);
                }
              }
              else  {
                len = strlen(pA);
                GetMemory(web, len);
                memcpy((*web), pA, strlen(pA));
                len = 1;
                GetMemory(dir, len);
                memcpy((*dir), e + 1, len);
                len = 1;
                GetMemory(file, len);
                memcpy((*file), e, len);
              }

              pA = strchr((*web), ':');
              if(pA)  *port = atoi(pA + 1);
              else *port = 80;

              return 0;
            }

            /*********************************************************************
            *filename: mailaddrsearch.c
            *purpose:
            C 語言編寫一個網絡蜘蛛來搜索網上出現的電子郵件地址
            *tidied by: zhoulifa(zhoulifa@163.com)
            周立發(http://zhoulifa.bokee.com)
            Linux
            愛好者 Linux知識傳播者 SOHO族開發者 最擅長C語言
            *date time:2006-08-31 21:00:00
            *Note:
            任何人可以任意復制代碼并運用這些文檔,當然包括你的商業用途
            *
            但請遵循GPL
            *Thanks to: www.gd-linux.org
            廣東省 Linux 公共服務技術支持中心
            *********************************************************************/

            int main(int argc, char ** argv)
            {
                    int WebPort;
                    char * WebHost = 0, * PageAddress = 0, * WebDir = 0;

                    if(argc < 2) {if(DEBUG) fprintf(stdout, "Command error, you should input like this:\n\t%s WebPageAddress1 WebPageAddress2 WebPageAddress3 ...", argv[0]); exit(0);}

                    NodeHeader = NodeTail = NodeCurr = 0;
                    //setlocale(LC_ALL, "zh_CN.gb2312");
                    for(i = 1; i < argc; i++)        {
                            ret = GetHost(argv, &WebHost, &PageAddress, &WebPort, &WebDir); /* Get web page info */
                            if(ret)        {if(DEBUG) fprintf(stdout, "GetHost error from '%s'\n", argv); exit(0);}
                            AddInitNode(WebHost, PageAddress, WebPort, WebDir); /* add this page to chain */
                    }
                    free(WebHost); free(PageAddress);free(WebDir);
                    if(DEBUG)        {
                            fprintf(stdout, "\nDisplay.%5d:", FileNumber);
                            DisplayNode(NodeHeader); /* display every node */
                    }
                    HandleInitNode(NodeHeader); /* handle every page */
                    return 0;
            }

            /**************************************************************
            功能:分析網頁
            ***************************************************************/
            void AnalyzePage(WEBNODE * node)
            {
                    int fd;
                    int flength = 0;
                    fd = open(node->file, O_RDONLY);
                    if(fd == -1)        goto __AnalyzeDone;
                    flength = lseek(fd, 1, SEEK_END);
                    write(fd, "\0", 1);
                    lseek(fd, 0, SEEK_SET);
                    mapped_mem = mmap(0, flength, PROT_READ, MAP_PRIVATE, fd, 0);
                    GetEmail(mapped_mem);
                    GetLink(mapped_mem);
                    close(fd);
                    munmap(mapped_mem, flength);
            __AnalyzeDone:
                    close(fd);
                    node->IsHandled = 1;
                    remove(node->file);
            }

            /**************************************************************
            功能:為根節點設置兄弟節點
            ***************************************************************/
            void AddInitNode(char * Host, char * Page, int Port, char * Dir)
            {
                    WEBNODE * NewNode;
                    char filename[MAXFILENAME + 1] = "";

                    if(NodeHeader == NULL) NewNode = NodeHeader = (WEBNODE *)malloc(sizeof(WEBNODE));
                    else NodeTail->brother = NewNode = (WEBNODE *)malloc(sizeof(WEBNODE));
                    memset(NewNode, 0, sizeof(WEBNODE));
                    NewNode->host = (char *)malloc(strlen(Host) + 1);
                    memset(NewNode->host, 0, strlen(Host) + 1);
                    NewNode->page = (char *)malloc(strlen(Page) + 1);
                    memset(NewNode->page, 0, strlen(Page) + 1);
                    NewNode->dir = (char *)malloc(strlen(Dir) + 1);
                    memset(NewNode->dir, 0, strlen(Dir) + 1);
                    NewNode->file = (char *)malloc(MAXFILENAME + 1);
                    memset(NewNode->file, 0, MAXFILENAME + 1);
                    strcpy(NewNode->host, Host);
                    strcpy(NewNode->page, Page);
                    strcpy(NewNode->dir, Dir);
                    sprintf(filename, "file%05d.html", FileNumber++);
                    strcpy(NewNode->file, filename);
                    NewNode->port = Port;
                    NewNode->IsHandled = 0;
                    NewNode->brother = 0;
                    NewNode->child = 0;
                    NodeTail = NewNode;
            }

            /**************************************************************
            功能:處理根節點信息
            ***************************************************************/
            void HandleInitNode(WEBNODE * node)
            {
                    WEBNODE * CurrentNode = 0;
                    CurrentNode = node;
                    if(CurrentNode)        {
                            while(CurrentNode)        {
                                    if(CurrentNode->IsHandled == 0)        {
                                            HandOneNode(CurrentNode);
                                            if(DEBUG)        {
                                                    fprintf(stdout, "\nDisplay.%5d:", FileNumber);
                                                    DisplayNode(NodeHeader); /* display every node */
                                            }
                                    }
                                    CurrentNode = CurrentNode->brother;
                            }
                            CurrentNode = node;
                            while(CurrentNode)        {
                                    if(CurrentNode->child && CurrentNode->child->IsHandled == 0)        {
                                            HandleInitNode(CurrentNode->child);
                                    }
                                    CurrentNode = CurrentNode->brother;
                            }
                    }
            }

            /**************************************************************
            功能:顯示年有節點信息
            ***************************************************************/
            void DisplayNode(WEBNODE * NodeHeader)
            {
                    WEBNODE * TempNode;
                    TempNode = NodeHeader;
                    fprintf(stdout, "\n");
                    while(TempNode) {
                            if(!strcmp(TempNode->dir, "/"))        fprintf(stdout, "\t%s:%d%s%s => %s %d\n", TempNode->host, TempNode->port, TempNode->dir, strcmp(TempNode->page, "@")?TempNode->page:"", TempNode->file, TempNode->IsHandled);
                            else        fprintf(stdout, "\t%s:%d/%s/%s => %s %d\n", TempNode->host, TempNode->port, TempNode->dir, strcmp(TempNode->page, "@")?TempNode->page:"", TempNode->file, TempNode->IsHandled);
                            TempNode = TempNode->brother;
                    }
                    TempNode = NodeHeader;
                    while(TempNode) {
                            if(TempNode->child)        DisplayNode(TempNode->child);
                            TempNode = TempNode->brother;
                    }
            }

            /**************************************************************
            功能:處理單個節點信息
            ***************************************************************/
            void HandOneNode(WEBNODE * node)
            {
                    char UserAgent[1024] = "", Accept[1024] = "", AcceptLanguage[1024] = "", AcceptEncoding[1024] = "", AcceptCharset[1024] = "", KeepAlive[1024] = "", Connection[1024] = "", ContentType[1024] = "";

                    NodeCurr = node;
                    if((host=gethostbyname(NodeCurr->host))==NULL) /* get ip address by domain */
                    {
                            if(DEBUG)  fprintf(stderr,"\tGethostname '%s' error, %s\n", NodeCurr->host, strerror(errno));
                            exit(1);
                    }
                    GetLocalAgent(UserAgent, Accept, AcceptLanguage, AcceptEncoding, AcceptCharset, KeepAlive, Connection, ContentType); /* Get client browser information */

                    if(strcmp(NodeCurr->dir, "/"))        sprintf(request, "GET /%s/%s HTTP/1.0\r\nHost: %s\r\nUser-Agent: %s\r\nAccept: %s\r\nConnection: %s\r\n\r\n", NodeCurr->dir, strcmp(NodeCurr->page, "@")?NodeCurr->page:"", NodeCurr->host, UserAgent, Accept, Connection);
                    else        sprintf(request, "GET %s%s HTTP/1.0\r\nHost: %s\r\nUser-Agent: %s\r\nAccept: %s\r\nConnection: %s\r\n\r\n", NodeCurr->dir, strcmp(NodeCurr->page, "@")?NodeCurr->page:"", NodeCurr->host, UserAgent, Accept, Connection);
                    DoneWithList(1);
                    AnalyzePage(NodeCurr);
            }

            /**************************************************************
            功能:從字符串 src 中分析出郵件地址保存到文件
            ***************************************************************/
            void GetEmail(char * src)
            {
                    char * pa, * pb, * pc, *pd;
                    char myemail[1024] = "";
                    FILE * mailfp = NULL;
                    if((mailfp = fopen("email.txt", "a+")) == NULL)        return;
                    pa = src;
                    while((pb = strchr(pa, '@')))        {
                            GetBeforePos(pb, &pc);
                            GetAfterPos(pb, &pd);
                            if(pc && pd && (strlen(pc) > (strlen(pd) + 3)))        {
                                    memset(myemail, 0, 1024);
                                    memcpy(myemail, pc, strlen(pc) - strlen(pd));
                                    if(strcmp(NodeCurr->dir, "/")) fprintf(mailfp, "%s\thttp://%s/%s/%s\n", myemail, NodeCurr->host, NodeCurr->dir, strcmp(NodeCurr->page, "@")?NodeCurr->page:"");
                                    else  fprintf(mailfp, "%s\thttp://%s%s%s\n", myemail, NodeCurr->host, NodeCurr->dir, strcmp(NodeCurr->page, "@")?NodeCurr->page:"");
                                    if(*(pd + 1))        pa = pd + 1;
                                    else break;
                            }
                            else if(*(pb + 1))        pa = pb + 1;
                            else        break;
                    }
                    fclose(mailfp);
            }

            /**************************************************************
            功能:從 src 中找出前面的字母、數字等內含,即 email 地址中 @ 的前面部分
            ***************************************************************/
            void GetBeforePos(char * src, char ** d)
            {
                    char * x;
                    if(src - 1)        x = src - 1;
                    else {*d = 0; return ;}
                    while(x)        {
                            if(*x >= 'a' && *x <= 'z') {x--; continue;}
                            else if(*x >= 'A' && *x <= 'Z') {x--; continue;}
                            else if(*x >= '0' && *x <= '9') {x--; continue;}
                            else if(*x == '.' || *x == '-' || *x == '_') {x--; continue;}
                            else {break;}
                    }
                    x++;
                    if(x) *d = x;
                    else *d = 0;
            }

            /**************************************************************
            功能:從 src 中找出后面的字母、數字等內含,即 email 地址中 @ 的后面部分
            ***************************************************************/
            void GetAfterPos(char * src, char ** d)
            {
                    char * x;
                    if(src + 1)        x = src + 1;
                    else {*d = 0; return ;}
                    while(x)        {
                            if(*x >= 'a' && *x <= 'z') {x++; continue;}
                            else if(*x >= 'A' && *x <= 'Z') {x++; continue;}
                            else if(*x >= '0' && *x <= '9') {x++; continue;}
                            else if(*x == '.' || *x == '-' || *x == '_') {x++; continue;}
                            else {break;}
                    }
                    if(x) *d = x;
                    else *d = 0;
            }

             

            /**************************************************************
            功能:從 src 中找出前面的字母、數字等內含,即一個網頁地址中主機名后面的部分
            ***************************************************************/
            void GetAfterPosWithSlash(char * src, char ** d)
            {
                    char * x;
                    if(src)        x = src;
                    else {*d = 0; return ;}
                    while(x)        {
                            if(*x >= 'a' && *x <= 'z') {x++; continue;}
                            else if(*x >= 'A' && *x <= 'Z') {x++; continue;}
                            else if(*x >= '0' && *x <= '9') {x++; continue;}
                            else if(*x == '.' || *x == '-' || *x == '_' || *x == '=') {x++; continue;}
                            else if(*x == ':' || *x == '/' || *x == '?' || *x == '&') {x++; continue;}
                            else {break;}
                    }
                    if(x) *d = x;
                    else *d = 0;
            }

            /**************************************************************
            功能:為 myanchor 分配 len 大小的內存
            ***************************************************************/
            void GetMemory(char ** myanchor, int len)
            {
                    if(!(*myanchor))        (*myanchor) = (char *)malloc(len + 1);
                    else        (*myanchor) = (char *)realloc((void *)(*myanchor), len + 1);
                    memset((*myanchor), 0, len + 1);
            }

            /**************************************************************
            功能:從 src 中分析出網頁鏈接,并加入到當前節點的子節點上
            ***************************************************************/
            void GetLink(char * src)
            {
                    char * pa, * pb, * pc;
                    char * myanchor = 0;
                    int len = 0;

                    pa = src;
                    do {
                            if((pb = strstr(pa, "href='")))        {
                                    pc = strchr(pb + 6, '\'');
                                    len = strlen(pb + 6) - strlen(pc);
                                    GetMemory(&myanchor, len);
                                    memcpy(myanchor, pb + 6, len);
                            }
                            else if((pb = strstr(pa, "href=\"")))        {
                                    pc = strchr(pb + 6, '"');
                                    len = strlen(pb + 6) - strlen(pc);
                                    GetMemory(&myanchor, len);
                                    memcpy(myanchor, pb + 6, len);
                            }
                            else if((pb = strstr(pa, "href=")))        {
                                    GetAfterPosWithSlash(pb + 5, &pc);
                                    len = strlen(pb + 5) - strlen(pc);
                                    GetMemory(&myanchor, len);
                                    memcpy(myanchor, pb + 5, len);
                            }
                            else {goto __returnLink ;}
            /*
                            if(DEBUG)        {
                                    if(strcmp(NodeCurr->dir, "/"))        fprintf(stdout, "%s\thttp://%s/%s/%s\n", myanchor, NodeCurr->host, NodeCurr->dir, strcmp(NodeCurr->page, "`")?NodeCurr->page:"");
                                    else        fprintf(stdout, "%s\thttp://%s%s%s\n", myanchor, NodeCurr->host, NodeCurr->dir, strcmp(NodeCurr->page, "`")?NodeCurr->page:"");
                            }
            */
                            if(strlen(myanchor) > 0)        AddChildNode(NodeCurr, myanchor);
                            if(pc + 1)        pa = pc + 1;
                    }while(pa);
            __returnLink:
                    return;
            }

            /**************************************************************
            功能:為當前節點增加子節點
            ***************************************************************/
            void AddChildNode(WEBNODE * node, char * src)
            {
                    int WebPort, len;
                    char * WebHost = 0, * PageAddress = 0, * WebDir = 0, * pC = 0;
                    WEBNODE * NewNode;
                    char filename[MAXFILENAME + 1] = "";
                    char IsFromRoot = 0;

                    if(!src)        return;
                    if(!strncasecmp(src, "mailto:", strlen("mailto:")))        return ;
                    if(strstr(src, ".css"))        return;
                    if(strstr(src, ".xml"))        return;
                    if(strstr(src, ".ico"))        return;
                    if(strstr(src, ".jpg"))        return;
                    if(strstr(src, ".gif"))        return;
                    if(strstr(src, "javascript:"))        return;
                    if(strstr(src, "+"))        return;

                    ret = GetHost(src, &WebHost, &PageAddress, &WebPort, &WebDir);
                    if(ret)        {
                            len = strlen(node->host);
                            GetMemory(&WebHost, len);
                            strcpy(WebHost, node->host);

                            WebPort = node->port;

                            IsFromRoot = !strncmp(src, "/", 1);
                            if(IsFromRoot && (src + 1))        Rstrchr(src + 1, '/', &pC);
                            else if(!IsFromRoot)        Rstrchr(src, '/', &pC);
                            else        pC = 0;

                            if(pC)        {
                                    if(IsFromRoot)        len = strlen(src + 1) - strlen(pC);
                                    else        len = strlen(src) - strlen(pC) + strlen(node->dir) + 1;
                                    GetMemory(&WebDir, len);
                                    if(IsFromRoot)        memcpy(WebDir, src + 1, len);
                                    else        {memcpy(WebDir, node->dir, strlen(node->dir)); strcat(WebDir, "/"); memcpy(WebDir + strlen(node->dir) + 1, src, strlen(src) - strlen(pC));}

                                    if(pC + 1)        {
                                            len = strlen(pC + 1);
                                            GetMemory(&PageAddress, len);
                                            strcpy(PageAddress, pC + 1);
                                    }
                                    else        {
                                            len = 1;
                                            GetMemory(&PageAddress, len);
                                            memcpy(PageAddress, e, len);
                                    }
                            }
                            else        {
                                    if(IsFromRoot)        {
                                            len = 1;
                                            GetMemory(&WebDir, len);
                                            memcpy(WebDir, e + 1, len);

                                            len = strlen(src + 1);
                                            GetMemory(&PageAddress, len);
                                            memcpy(PageAddress, src + 1, len);
                                    }
                                    else        {
                                            len = strlen(node->dir);
                                            GetMemory(&WebDir, len);
                                            memcpy(WebDir, node->dir, len);

                                            len = strlen(src);
                                            GetMemory(&PageAddress, len);
                                            memcpy(PageAddress, src, len);
                                    }
                            }
                    }
                    ret = IsExistWeb(NodeHeader, WebHost, PageAddress, WebPort, WebDir);
                    if(ret) goto __ReturnAdd;

                    if(node->child == NULL)        NewNode = node->child = (WEBNODE *)malloc(sizeof(WEBNODE));
                    else NodeTail->brother = NewNode = (WEBNODE *)malloc(sizeof(WEBNODE));
                    memset(NewNode, 0, sizeof(WEBNODE));
                    NewNode->host = (char *)malloc(strlen(WebHost) + 1);
                    memset(NewNode->host, 0, strlen(WebHost) + 1);
                    NewNode->page = (char *)malloc(strlen(PageAddress) + 1);
                    memset(NewNode->page, 0, strlen(PageAddress) + 1);
                    NewNode->dir = (char *)malloc(strlen(WebDir) + 1);
                    memset(NewNode->dir, 0, strlen(WebDir) + 1);
                    NewNode->file = (char *)malloc(MAXFILENAME + 1);
                    memset(NewNode->file, 0, MAXFILENAME + 1);
                    strcpy(NewNode->host, WebHost);
                    strcpy(NewNode->page, PageAddress);
                    strcpy(NewNode->dir, WebDir);
                    sprintf(filename, "file%05d.html", FileNumber++);
                    strcpy(NewNode->file, filename);
                    NewNode->port = WebPort;
                    NewNode->IsHandled = 0;
                    NewNode->brother = 0;
                    NewNode->child = 0;
                    NodeTail = NewNode;
            __ReturnAdd:
                    free(WebHost); free(PageAddress); free(WebDir);
            }

            /**************************************************************
            功能:檢查是否已經處理過的網頁
            ***************************************************************/
            int IsExistWeb(WEBNODE * node, char * host, char * page, int port, char * dir)
            {
                    WEBNODE * t;
                    t = node;
                    while(t)        {
                            if(!strcmp(t->host, host) && !strcmp(t->page, page) && t->port == port && !strcmp(t->dir, dir)) return 1;
                            t = t->brother;
                    }
                    t = node;
                    while(t)        {
                            if(t->child)        {
                                    ret = IsExistWeb(t->child, host, page, port, dir);
                                    if(ret)        return 2;
                            }
                            t = t->brother;
                    }
                    return 0;
            }
            [/code]

            編譯這個程序:

            QUOTE:

            gcc mailaddrsearch.c -o mailsearcher


            輸入一個網址作為參數運行一下試試吧:

            QUOTE:

            ./mailsearcher http://zhoulifa.bokee.com/5531748.html


            程序首先找出 http://zhoulifa.bokee.com/5531748.html 頁面上的郵件地址保存到當前目錄下 email.txt 文件里,每行一條記錄,格式為郵件地址和出現該郵件地址的網頁。然后分析這個頁面上出現的網頁鏈接,把各鏈接作為子節點加入鏈表,再去處理子節點,重復上述操作。

            這只是一個示例程序,并不完善,如果要使其達到實用的目的,還需要讓這個程序效率更高點,比如加入 epoll (在 2.4 內核中只有 select 了 ) 實現 I/O 多路復用。又比如對每個子節點實現多線程,每個線程處理一個節點。
            如果對 I/O 多路復用不熟悉,您可以看一下我這篇文章 http://zhoulifa.bokee.com/5345930.html 里關于 Linux 下各類TCP網絡服務器的實現源代碼


            posted on 2008-12-28 04:00 肥仔 閱讀(2393) 評論(0)  編輯 收藏 引用 所屬分類: 網絡編程

            国产一区二区精品久久岳| 久久精品国产精品青草| 国产精品岛国久久久久| 久久精品亚洲日本波多野结衣| 亚洲中文字幕无码久久精品1| 99久久香蕉国产线看观香| 噜噜噜色噜噜噜久久| 欧美噜噜久久久XXX| 久久精品国产影库免费看 | 色天使久久综合网天天| 一本一本久久a久久精品综合麻豆| 亚洲午夜精品久久久久久app| 色婷婷综合久久久久中文字幕| 一本色道久久88综合日韩精品| 亚洲美日韩Av中文字幕无码久久久妻妇| 久久黄色视频| 国产精品女同久久久久电影院| 精品熟女少妇aⅴ免费久久| 国产精品无码久久四虎| 无码日韩人妻精品久久蜜桃| 久久97久久97精品免视看秋霞| 久久99精品久久久久久野外| 久久天天躁夜夜躁狠狠躁2022| 婷婷综合久久狠狠色99h| 99久久精品无码一区二区毛片 | 亚洲人成网亚洲欧洲无码久久 | 久久91精品国产91久久户| 久久精品成人影院| 日本久久久久亚洲中字幕| 色成年激情久久综合| 国产精品久久久香蕉| 久久777国产线看观看精品| 思思久久99热只有频精品66| 国产综合久久久久久鬼色| 国产成人精品久久一区二区三区av | 久久精品夜色噜噜亚洲A∨| 婷婷久久久亚洲欧洲日产国码AV | 久久综合香蕉国产蜜臀AV| 久久av免费天堂小草播放| 性做久久久久久久| 看全色黄大色大片免费久久久|