posts - 297, comments - 15, trackbacks - 0

這里是維基百科對網絡爬蟲的詞條頁面。網絡爬蟲以叫網絡蜘蛛，網絡機器人，這是一個程序，其會自動的通過網絡抓取互聯網上的網頁，這種技術一般可能用來檢查你的站點上所有的鏈接是否是都是有效的。當然，更為高級的技術是把網頁中的相關數據保存下來，可以成為搜索引擎。

從技相來說，實現抓取網頁可能并不是一件很困難的事情，困難的事情是對網頁的分析和整理，那是一件需要有輕量智能，需要大量數學計算的程序才能做的事情。下面一個簡單的流程：

在這里，我們只是說一下如何寫一個網頁抓取程序。

首先我們先看一下，如何使用命令行的方式來找開網頁。

telnet somesite.com 80
GET /index.html HTTP/1.0
按回車兩次

使用telnet就是告訴你其實這是一個socket的技術，并且使用HTTP的協議，如 GET方法來獲得網頁，當然，接下來的事你就需要解析HTML文法，甚至還需要解析Javascript，因為現在的網頁使用Ajax的越來越多了，而很多網頁內容都是通過Ajax技術加載的，因為，只是簡單地解析HTML文件在未來會遠遠不夠。當然，在這里，只是展示一個非常簡單的抓取，簡單到只能做為一個例子，下面這個示例的偽代碼：

取網頁
for each 鏈接 in 當前網頁所有的鏈接
{
if(如果本鏈接是我們想要的 || 這個鏈接從未訪問過)
{
處理對本鏈接
把本鏈接設置為已訪問
}
}

require “rubygems”
require “mechanize”
class Crawler < WWW::Mechanize
attr_accessor :callback
INDEX = 0
DOWNLOAD = 1
PASS = 2
def initialize
super
init
@first = true
self.user_agent_alias = “Windows IE 6″
end
def init
@visited = []
end
def remember(link)
@visited << link
end
def perform_index(link)
self.get(link)
if(self.page.class.to_s == “WWW::Mechanize::Page”)
links = self.page.links.map {|link| link.href } - @visited
links.each do |alink|
start(alink)
end
end
end
def start(link)
return if link.nil?
if(!@visited.include?(link))
action = @callback.call(link)
if(@first)
@first = false
perform_index(link)
end
case action
when INDEX
perform_index(link)
when DOWNLOAD
self.get(link).save_as(File.basename(link))
when PASS
puts “passing on #{link}”
end
end
end
def get(site)
begin
puts “getting #{site}”
@visited << site
super(site)
rescue
puts “error getting #{site}”
end
end
end

上面的代碼就不必多說了，大家可以去試試。下面是如何使用上面的代碼：

require “crawler”
x = Crawler.new
callback = lambda do |link|
if(link =~/\\.(zip|rar|gz|pdf|doc)
x.remember(link)
return Crawler::PASS
elsif(link =~/\\.(jpg|jpeg)/)
return Crawler::DOWNLOAD
end
return Crawler::INDEX;
end
x.callback = callback
x.start(”http://somesite.com”)

下面是一些和網絡爬蟲相關的開源網絡項目

arachnode.net is a .NET crawler written in C# using SQL 2005 and Lucene and is released under the GNU General Public License.
DataparkSearch is a crawler and search engine released under the GNU General Public License.
GNU Wget is a command-line-operated crawler written in C and released under the GPL. It is typically used to mirror Web and FTP sites.
GRUB is an open source distributed search crawler that Wikia Search ( http://wikiasearch.com ) uses to crawl the web.
Heritrix is the Internet Archive’s archival-quality crawler, designed for archiving periodic snapshots of a large portion of the Web. It was written in Java.
ht://Dig includes a Web crawler in its indexing engine.
HTTrack uses a Web crawler to create a mirror of a web site for off-line viewing. It is written in C and released under the GPL.
ICDL Crawler is a cross-platform web crawler written in C++ and intended to crawl Web sites based on

from:
http://coolshell.cn/?p=27

posted on 2010-02-18 21:54 chatler 閱讀(724) 評論(0) 編輯收藏引用所屬分類: SearchEngine

只有注冊用戶登錄后才能發表評論。




網站導航: 博客園 IT新聞 BlogJava 博問 Chat2DB 管理

<

2025年11月

>

日

一

二

三

四

五

六

26

27

28

29

30

31

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

1

2

3

4

5

6

常用鏈接

留言簿(10)

隨筆分類(307)

隨筆檔案(297)

algorithm

andytan
algorithm, linux, os, network,etc
EXACT STRING MATCHING ALGORITHMS
httperf -- a web perf test tool
Java多線程
編程夜未眠
布薩空間
結構之法
沈一峰 google技術博客
小兵的窩

Books_Free_Online

Book Fire Center

C++

Bjarne Stroustrup's C++ Style and Technique FAQ
boyplayee column
C Plus Plus
CPP Reference
LearnC++Website
Welcome to Bjarne Stroustrup's homepage!

database

mydear Database
mysql高性能筆記

Linux

獨孤閣

Linux shell

linux
飛翔

linux socket

linux socket programming
sock programming

misce

cloudward
感覺這個博客還是不錯，雖然做的東西和我不大相關，覺得看看還是有好處的

network

nginx

OSS

Google Android
Android is a software stack for mobile devices that includes an operating system, middleware and key applications. This early look at the Android SDK provides the tools and APIs necessary to begin developing applications on the Android platform using the Java programming language.
os161 file list

overall

linux related
linux_overall
loop_in_nodes
tiaot
Ubuntu Zone
陳皓專欄
享受編程的樂趣

青青草原综合久久大伊人导航_色综合久久天天综合_日日噜噜夜夜狠狠久久丁香五月_热久久这里只有精品