在書店去逛的時候,偶然看到了搜索專區的書,都是搜索引擎方面的,翻了下,感覺蠻有意思的,回來就baidu,google了下自己動手做搜索引擎,感覺開源的nutch-1.0蠻好,我就學習配置了下,遇到了一些問題,不過很快解決了。

運行環境:
|
1.安裝JDK
因為ubuntu10.04自己自帶了jdk(叫做openjdk),所以我就直接用的是自帶的jdk。可以直接去新立德軟件包里面安裝。安裝完后在/usr/lib/jvm文件夾下面你就會發現有下面3個文件夾。當然你也可以直接去下載官方最新的jdk.

|
2.安裝并且配置tomcat,在ubuntu10.04中,tomcat的版本是tomcat6,我還安裝了管理軟件tomcat6-admin
|
安裝好tomcat之后,輸入/etc/init.d/tomcat6 start啟動tomcat服務器。在瀏覽器中輸入"http://localhost:8080",如果顯示“it works”說明tomcat服務器正在運行。
It works ! If you're seeing this page via a web browser, it means you've setup Tomcat successfully. Congratulations! This is the default Tomcat home page. It can be found on the local filesystem at: /var/lib/tomcat6/webapps/ROOT/index.html Tomcat6 veterans might be pleased to learn that this system instance of Tomcat is installed with CATALINA_HOME in /usr/share/tomcat6 and CATALINA_BASE in /var/lib/tomcat6, following the rules from /usr/share/doc/tomcat6-common/RUNNING.txt.gz. You might consider installing the following packages, if you haven't already done so: tomcat6-docs: This package installs a web application that allows to browse the Tomcat 6 documentation locally. Once installed, you can access it by clicking here. tomcat6-examples: This package installs a web application that allows to access the Tomcat 6 Servlet and JSP examples. Once installed, you can access it by clicking here. tomcat6-admin: This package installs two web applications that can help managing this Tomcat instance. Once installed, you can access the manager webapp and the host-manager webapp. NOTE: For security reasons, using the manager webapp is restricted to users with role "manager". The host-manager webapp is restricted to users with role "admin". Users are defined in /etc/tomcat6/tomcat-users.xml. |
需要配置用戶才可以進入管理界面,修改/var/lib/tomcat6/conf/tomcat-users.xml
|
這個時候你就可以進去管理界面了,如果不行的話,重啟tomcat服務/etc/init.d/tomcat6 restart
管理界面如下:
![]() ![]() |
Tomcat Web Application Manager |
Message: |
OK |
Manager | |||
List Applications | HTML Manager Help | Manager Help | Server Status |
Applications | ||||
Path | Display Name | Running | Sessions | Commands |
/ | true | 0 | Start Stop Reload Undeploy | |
|
||||
/host-manager | Tomcat Manager Application | true | 0 | Start Stop Reload Undeploy |
|
3.安裝nutch1.0
下載nutch-1.0.tar.gz,網址http://www.apache.org/dyn/closer.cgi/nutch/
apache-nutch-1.2-bin.zip 25-Sep-2010 05:38 164M
apache-nutch-1.2-bin.zip.asc 25-Sep-2010 05:37 203
apache-nutch-1.2-src.tar.gz 25-Sep-2010 05:37 50M GZIP compressed document
apache-nutch-1.2-src.tar.gz.asc 25-Sep-2010 05:37 203 GZIP compressed document
apache-nutch-1.2-src.zip 25-Sep-2010 05:37 51M
apache-nutch-1.2-src.zip.asc 25-Sep-2010 05:37 203
nutch-0.9.tar.gz 05-Apr-2007 10:17 68M GZIP compressed document
nutch-0.9.tar.gz.asc 05-Apr-2007 10:17 186 GZIP compressed document
nutch-1.0.tar.gz 28-Mar-2009 04:12 83M GZIP compressed document
nutch-1.0.tar.gz.asc 28-Mar-2009 04:12 197 GZIP compressed document
解壓出來,我上面的是:
|
首先在Nutch的解壓根目錄下新建一個文本文件,命名為“url.txt”(這個名字你可以隨便取)。里面放的是你需要抓取信息的網址。
|
其次更新配置文件crawl-urlfilter.txt,打開“conf/crawl-urlfilter.txt”,
|
再打開nutch-site.xml文件,修改如下,
|
然后運行網絡蜘蛛抓緊網頁。在/home/**/nutch-1.0(即文件根目錄)輸入以下命令:
|
蜘蛛運行時會輸出大量數據,抓取結束之后,可以發現crawled目錄被生成,里面有幾個目錄。
|
4.在tomcat中部署nutch項目
將nutch根目錄下的nutch-1.0.war文件放置到/var/lib/tomcat6/webapps文件夾下,然后再訪問http://localhost:8080,tomcat便會將其解壓。
|
|
修改此目錄下的WEB-INF/classes/nutch-site.xml,修改如下:
|
5.使用nutch搜索
在瀏覽器中輸入http://localhost:8080/nutch-1.0,出現下面的界面:
![]() ![]() |
|||||||
|
|
|
ca | de | en | es | fi | fr | hu | it | jp | ms | nl | pl | pt | sh | sr | sv | th | zh |
||

![]() ![]() |
|||||||
|
論壇首頁 - 中國最大的Linux/Unix技術社區 - IT人的網上社區 - bbs.ChinaUnix.net
... Unix操作系統 ← Linux論壇 RSS訂閱 ... by CU管理員 Linux時代首頁 Linux ...
http://bbs.chinaunix.net/ (網頁快照) (評分詳解) (anchors)
RSS |
ca | de | en | es | fi | fr | hu | it | jp | ms | nl | pl | pt | sh | sr | sv | th | zh |
||

------------------------------------------------
| 過程中出現的問題 |
------------------------------------------------
1.說找不到JAVA_HOME
解決方案:修改/etc/environment文件,添加JAVA_HOME;
|
2.信息是抓取了,但是搜索不出來東西
解決方案:除了修改以上的東西外,有個文件還得注意下:/home/**/nutch-1.0/conf/nutch-default.xml,找到下面的部分,然后參照修改
|
有時候出不來結果,還得運行:
|
呵呵,就這么多了!!!
