原文地址:http://my.chinaunix.net/space.php?uid=24488136&do=blog&id=64821
在書店去逛的時候,偶然看到了搜索專區(qū)的書,都是搜索引擎方面的,翻了下,感覺蠻有意思的,回來就baidu,google了下自己動手做搜索引擎,感覺開源的nutch-1.0蠻好,我就學習配置了下,遇到了一些問題,不過很快解決了。

運行環(huán)境:
Linux **-desktop 2.6.32-25-generic #44-Ubuntu SMP Fri Sep 17 20:26:08 UTC 2010 i686 GNU/Linux ubuntu 10.04
|
1.安裝JDK
因為ubuntu10.04自己自帶了jdk(叫做openjdk),所以我就直接用的是自帶的jdk。可以直接去新立德軟件包里面安裝。安裝完后在/usr/lib/jvm文件夾下面你就會發(fā)現(xiàn)有下面3個文件夾。當然你也可以直接去下載官方最新的jdk.
├── default-java -> java-6-openjdk
├── java-1.6.0-openjdk -> java-6-openjdk
└── java-6-openjdk
|
2.安裝并且配置tomcat,在ubuntu10.04中,tomcat的版本是tomcat6,我還安裝了管理軟件tomcat6-admin
apt-get install tomcat6 tomcat6-admin
|
安裝好tomcat之后,輸入/etc/init.d/tomcat6
start啟動tomcat服務(wù)器。在瀏覽器中輸入"http://localhost:8080",如果顯示“it
works”說明tomcat服務(wù)器正在運行。
It works !
If you're seeing this page via a web browser, it means you've setup Tomcat successfully. Congratulations!
This is the default Tomcat home page. It can be found on the local
filesystem at: /var/lib/tomcat6/webapps/ROOT/index.html
Tomcat6 veterans might be pleased to learn that this system instance of
Tomcat is installed with CATALINA_HOME in /usr/share/tomcat6 and
CATALINA_BASE in /var/lib/tomcat6, following the rules from
/usr/share/doc/tomcat6-common/RUNNING.txt.gz.
You might consider installing the following packages, if you haven't
already done so:
tomcat6-docs:
This package installs a web application that allows to browse the
Tomcat 6 documentation locally. Once installed, you can access it by
clicking here.
tomcat6-examples: This package
installs a web application that allows to access the Tomcat 6 Servlet
and JSP examples. Once installed, you can access it by clicking here.
tomcat6-admin: This package installs two web applications that can help managing this Tomcat instance. Once installed, you can access the manager webapp and the host-manager webapp.
NOTE: For security reasons, using the manager webapp is restricted to users with role "manager". The host-manager webapp is restricted to users with role "admin". Users are defined in /etc/tomcat6/tomcat-users.xml.
|
需要配置用戶才可以進入管理界面,修改/var/lib/tomcat6/conf/tomcat-users.xml
出于安全考慮,把默認的用戶tomcat刪掉了,并添加了自己的用戶,比如hinutch,添加密碼,比如3838438
<?xml version='1.0' encoding='utf-8'?>
<tomcat-users>
<role rolename="manager"/>
<role rolename="admin"/>
<user username="hinutch" password="3838438" roles="admin,manager"/>
</tomcat-users>
|
這個時候你就可以進去管理界面了,如果不行的話,重啟tomcat服務(wù)/etc/init.d/tomcat6 restart
管理界面如下:
Tomcat Web Application Manager
|
3.安裝nutch1.0
下載nutch-1.0.tar.gz,網(wǎng)址http://www.apache.org/dyn/closer.cgi/nutch/
apache-nutch-1.2-bin.zip 25-Sep-2010 05:38 164M
apache-nutch-1.2-bin.zip.asc 25-Sep-2010 05:37 203
apache-nutch-1.2-src.tar.gz 25-Sep-2010 05:37 50M GZIP compressed document
apache-nutch-1.2-src.tar.gz.asc 25-Sep-2010 05:37 203 GZIP compressed document
apache-nutch-1.2-src.zip 25-Sep-2010 05:37 51M
apache-nutch-1.2-src.zip.asc 25-Sep-2010 05:37 203
nutch-0.9.tar.gz 05-Apr-2007 10:17 68M GZIP compressed document
nutch-0.9.tar.gz.asc 05-Apr-2007 10:17 186 GZIP compressed document
nutch-1.0.tar.gz 28-Mar-2009 04:12 83M GZIP compressed document
nutch-1.0.tar.gz.asc 28-Mar-2009 04:12 197 GZIP compressed document
解壓出來,我上面的是:
├── bin
├── build.xml
├── CHANGES.txt
├── conf
├── crawled
├── default.properties
├── docs
├── KEYS
├── lib
├── LICENSE.txt
├── logs
├── NOTICE.txt
├── nutch-1.0.jar
├── nutch-1.0.job
├── nutch-1.0.war
├── plugins
├── README.txt
├── src
├── url.txt(這個是自己建的)
└── webapps
|
首先在Nutch的解壓根目錄下新建一個文本文件,命名為“url.txt”(這個名字你可以隨便取)。里面放的是你需要抓取信息的網(wǎng)址。
我的解壓根目錄為/home/**/nutch-1.0,新建一個url.txt,里面輸入:
http://bbs.chinaunix.net/
|
其次更新配置文件crawl-urlfilter.txt,打開“conf/crawl-urlfilter.txt”,
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept hosts in MY.DOMAIN.NAME
+^http://bbs.chinaunix.net/(這個就是需要修改的,和url.txt里面內(nèi)容一樣)
|
再打開nutch-site.xml文件,修改如下,
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>http.agent.name</name>
<value>my nutch agent</value>(紅色部分可以自己命名)
</property>
<property>
<name>http.agent.version</name>
<value>1.0</value>
</property>
</configuration>~
|
然后運行網(wǎng)絡(luò)蜘蛛抓緊網(wǎng)頁。在/home/**/nutch-1.0(即文件根目錄)輸入以下命令:
./bin/nutch crawl url.txt -dir crawled -depth 4 -topN 100 -threads 4
-dir = crawled 指明下載數(shù)據(jù)存放路徑,該目錄不存在時,會被自動創(chuàng)建
-depth = 4 下載深度為4
-topN = 100 下載符合條件的前100個頁面
-threads = 4 啟動的線程數(shù)目
|
蜘蛛運行時會輸出大量數(shù)據(jù),抓取結(jié)束之后,可以發(fā)現(xiàn)crawled目錄被生成,里面有幾個目錄。
├── crawldb
├── index
├── indexes
├── linkdb
└── segments
|
4.在tomcat中部署nutch項目
將nutch根目錄下的nutch-1.0.war文件放置到/var/lib/tomcat6/webapps文件夾下,然后再訪問http://localhost:8080,tomcat便會將其解壓。
root@**-desktop:/var/lib/tomcat6/webapps# ls
nutch-1.0 nutch-1.0.war ROOT
|
nutch-1.0文件夾下包含:
├── anchors.jsp
├── ca
├── cached.jsp
├── cluster.jsp
├── de
├── en
├── es
├── explain.jsp
├── fi
├── fr
├── hu
├── img
├── include
├── index.jsp
├── it
├── jp
├── META-INF
├── more.jsp
├── ms
├── nl
├── pl
├── pt
├── refine-query-init.jsp
├── refine-query.jsp
├── search.jsp
├── sh
├── sr
├── sv
├── text.jsp
├── th
├── WEB-INF(要修改該文件夾下面的內(nèi)容)
└── zh
|
修改此目錄下的WEB-INF/classes/nutch-site.xml,修改如下:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<nutch-conf>
<property>
<name>searcher.dir</name>
<value>/home/**/nutch-1.0/crawled</value>
</property>
</nutch-conf>
|
上面的value要改成蜘蛛的下載目錄。
5.使用nutch搜索
在瀏覽器中輸入http://localhost:8080/nutch-1.0,出現(xiàn)下面的界面:
然后在搜索框里面輸入你要查找的東西,比如:linux
,會出現(xiàn):
第
1-1項 (共有 1 項查詢結(jié)果):
論壇首頁 - 中國最大的Linux/Unix技術(shù)社區(qū) - IT人的網(wǎng)上社區(qū) - bbs.ChinaUnix.net
... Unix操作系統(tǒng) ←
Linux論壇 RSS訂閱
... by CU管理員
Linux時代首頁 Linux
...
http://bbs.chinaunix.net/
(
網(wǎng)頁快照)
(
評分詳解)
(
anchors)
整個過程就完成了。
------------------------------------------------
|
過程中出現(xiàn)的問題 |
------------------------------------------------
1.說找不到JAVA_HOME
解決方案:修改/etc/environment文件,添加JAVA_HOME;
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"
JAVA_HOME="/usr/lib/jvm/java-6-openjdk"
|
2.信息是抓取了,但是搜索不出來東西
解決方案:除了修改以上的東西外,有個文件還得注意下:/home/**/nutch-1.0/conf/nutch-default.xml,找到下面的部分,然后參照修改
<!-- searcher properties -->
<property>
<name>searcher.dir</name>
<value>/home/**/nutch-1.0/crawled</value>(一定要是存抓取信息的路徑)
<description>
|
有時候出不來結(jié)果,還得運行:
/etc/init.d/tomcat6 restart
|
呵呵,就這么多了!!!
posted on 2011-05-04 13:34
漂漂 閱讀(1152)
評論(0) 編輯 收藏 引用 所屬分類:
linux