浠庢妧鐩告潵璇達紝瀹炵幇鎶撳彇緗戦〉鍙兘騫朵笉鏄竴浠跺緢鍥伴毦鐨勪簨鎯咃紝鍥伴毦鐨勪簨鎯呮槸瀵圭綉欏電殑鍒嗘瀽鍜屾暣鐞嗭紝閭f槸涓浠墮渶瑕佹湁杞婚噺鏅鴻兘錛岄渶瑕佸ぇ閲忔暟瀛﹁綆楃殑紼嬪簭鎵嶈兘鍋氱殑浜嬫儏銆備笅闈竴涓畝鍗曠殑嫻佺▼錛?/p>
鍦ㄨ繖閲岋紝鎴戜滑鍙槸璇翠竴涓嬪浣曞啓涓涓綉欏墊姄鍙栫▼搴忋?/p>
棣栧厛鎴戜滑鍏堢湅涓涓嬶紝濡備綍浣跨敤鍛戒護琛岀殑鏂瑰紡鏉ユ壘寮緗戦〉銆?/p>
telnet somesite.com 80
GET /index.html HTTP/1.0
鎸夊洖杞︿袱嬈?/p>
浣跨敤telnet灝辨槸鍛婅瘔浣犲叾瀹炶繖鏄竴涓猻ocket鐨勬妧鏈紝騫朵笖浣跨敤HTTP鐨勫崗璁紝濡?GET鏂規硶鏉ヨ幏寰楃綉欏碉紝褰撶劧錛屾帴涓嬫潵鐨勪簨浣犲氨闇瑕佽В鏋怘TML鏂囨硶錛岀敋鑷寵繕闇瑕佽В鏋怞avascript錛屽洜涓虹幇鍦ㄧ殑緗戦〉浣跨敤Ajax鐨勮秺鏉ヨ秺澶氫簡錛岃屽緢澶氱綉欏靛唴瀹歸兘鏄氳繃Ajax鎶鏈姞杞界殑錛屽洜涓猴紝鍙槸綆鍗曞湴瑙f瀽HTML鏂囦歡鍦ㄦ湭鏉ヤ細榪滆繙涓嶅銆傚綋鐒訛紝鍦ㄨ繖閲岋紝鍙槸灞曠ず涓涓潪甯哥畝鍗曠殑鎶撳彇錛岀畝鍗曞埌鍙兘鍋氫負涓涓緥瀛愶紝涓嬮潰榪欎釜紺轟緥鐨勪吉浠g爜錛?/p>
鍙栫綉欏? for each 閾炬帴 in 褰撳墠緗戦〉鎵鏈夌殑閾炬帴 { if(濡傛灉鏈摼鎺ユ槸鎴戜滑鎯寵鐨?|| 榪欎釜閾炬帴浠庢湭璁塊棶榪? { 澶勭悊瀵規湰閾炬帴 鎶婃湰閾炬帴璁劇疆涓哄凡璁塊棶 } }
require “rubygems” require “mechanize” class Crawler < WWW::Mechanize attr_accessor :callback INDEX = 0 DOWNLOAD = 1 PASS = 2 def initialize super init @first = true self.user_agent_alias = “Windows IE 6″ end def init @visited = [] end def remember(link) @visited << link end def perform_index(link) self.get(link) if(self.page.class.to_s == “WWW::Mechanize::Page”) links = self.page.links.map {|link| link.href } - @visited links.each do |alink| start(alink) end end end def start(link) return if link.nil? if(!@visited.include?(link)) action = @callback.call(link) if(@first) @first = false perform_index(link) end case action when INDEX perform_index(link) when DOWNLOAD self.get(link).save_as(File.basename(link)) when PASS puts “passing on #{link}” end end end def get(site) begin puts “getting #{site}” @visited << site super(site) rescue puts “error getting #{site}” end end end
涓婇潰鐨勪唬鐮佸氨涓嶅繀澶氳浜嗭紝澶у鍙互鍘昏瘯璇曘備笅闈㈡槸濡備綍浣跨敤涓婇潰鐨勪唬鐮侊細
require “crawler” x = Crawler.new callback = lambda do |link| if(link =~/\\.(zip|rar|gz|pdf|doc) x.remember(link) return Crawler::PASS elsif(link =~/\\.(jpg|jpeg)/) return Crawler::DOWNLOAD end return Crawler::INDEX; end x.callback = callback x.start(”http://somesite.com”)
涓嬮潰鏄竴浜涘拰緗戠粶鐖櫕鐩稿叧鐨勫紑婧愮綉緇滈」鐩?/p>
from:
http://coolshell.cn/?p=27