背景:先前在生產(chǎn)庫(kù)上配置了oracle監(jiān)控,每5分鐘嘗試連接一次數(shù)據(jù)庫(kù),若連接失敗則通過(guò)nagios+fetion自動(dòng)報(bào)警,此配置參考文章:http://ylw6006.blog.51cto.com/470441/787496
早晨收到報(bào)警信息后,登陸數(shù)據(jù)庫(kù)執(zhí)行ps -ef查看oracle的后臺(tái)進(jìn)程都在,使用conn /as sysdba的方式登陸數(shù)據(jù)庫(kù),提示連接到空閑的實(shí)例,使用easy connect 方式連接則報(bào)oracle實(shí)例無(wú)法分配內(nèi)存,從報(bào)錯(cuò)提示上看,就像oracle數(shù)據(jù)庫(kù)實(shí)例未打開(kāi)的狀態(tài)!分析alert日志不斷出現(xiàn)如下錯(cuò)誤信息:
Process J002 died, see its trace file
kkjcre1p: unable to spawn jobq slave process
Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_cjq0_18577.trc:
trace文件摘要信息如下:
*** 2012-07-26 10:20:31.068
Process J002 is dead (pid=13857 req_ver=1136 cur_ver=1136 state=KSOSP_SPAWNED).
*** 2012-07-26 10:20:32.069
Process J002 is dead (pid=13876 req_ver=1594 cur_ver=1594 state=KSOSP_SPAWNED).
google查詢一番后,發(fā)現(xiàn)大部分描述和oracle的進(jìn)程數(shù)設(shè)置有關(guān),又或者是內(nèi)存不足引起!于是在oracle 10g環(huán)境下測(cè)試,線上數(shù)據(jù)庫(kù)環(huán)境為11.2.0.3
一:設(shè)置processes初始化參數(shù)值為20,重啟數(shù)據(jù)庫(kù)后,已經(jīng)占用19個(gè)進(jìn)程
- SQL> select count(*) from v$process;
-
- COUNT(*)
- ----------
- 19
-
- SQL> show parameter process;
-
- NAME TYPE VALUE
- ------------------------------------ ----------- ------------------------------
- aq_tm_processes integer 0
- db_writer_processes integer 1
- gcs_server_processes integer 0
- job_queue_processes integer 10
- log_archive_max_processes integer 2
- processes integer 20
新的會(huì)話連接,則報(bào)連接到空閑的實(shí)例,alert日志則出現(xiàn)相應(yīng)的報(bào)錯(cuò)
- [root@db1 ~]# su - oracle
- [oracle@db1 ~]$ sqlplus /nolog
- SQL*Plus: Release 10.2.0.1.0 - Production on Wed Jul 4 13:50:22 2012
- Copyright (c) 1982, 2005, Oracle. All rights reserved.
-
- SQL> conn /as sysdba
- Connected to an idle instance.
-
- [oracle@db1 dbs]$ tail -f /u01/app/oracle/admin/db1/bdump/alert_db1.log
- Wed Jul 4 13:52:23 2012
- ksvcreate: Process(q000) creation failed
- Wed Jul 4 13:52:35 2012
- Process q001 died, see its trace file
- Wed Jul 4 13:52:35 2012
- ksvcreate: Process(q001) creation failed
- Wed Jul 4 13:52:37 2012
- Process m000 died, see its trace file
- Wed Jul 4 13:52:37 2012
- ksvcreate: Process(m000) creation failed
二:于是想到監(jiān)控oracle的進(jìn)程和會(huì)話數(shù)來(lái)進(jìn)一步確定問(wèn)題
1:首先要對(duì)用戶進(jìn)行顯示授權(quán),否則后面創(chuàng)建存儲(chǔ)過(guò)程編譯將會(huì)報(bào)錯(cuò)
- SQL> grant select on V_$SESSION tbo hr;
- Grant succeeded.
-
- SQL> grant select on V_$PROCESS to hr;
- Grant succeeded.
2:建表,用來(lái)存儲(chǔ)結(jié)果
- SQL> create table session_monitor(time timestamp,session_count number,process_count number);
- Table created.
3:創(chuàng)建存儲(chǔ)過(guò)程,將數(shù)據(jù)插入表
- SQL> create or replace procedure proc_session
- 2 is
- 3 v_session number(8);
- 4 v_process number(8);
- 5 begin
- 6 select count(*) into v_session from v$session;
- 7 select count(*) into v_process from v$process;
- 8 insert into session_monitor values (sysdate,v_session,v_process);
- 9 commit;
- 10 end proc_session;
- Procedure created.
4:創(chuàng)建任務(wù)
- SQL> var job number;
- SQL> begin
- 2 sys.dbms_job.submit(job => :job,
- 3 what => 'proc_session();',
- 4 next_date => sysdate,
- 5 interval => 'sysdate+2/1440');
- 6* end;
- PL/SQL procedure successfully completed.
5:測(cè)試效果
- SQL> exec proc_session;
- PL/SQL procedure successfully completed.
-
- SQL> alter session set nls_date_format='YYYY-MM-DD-HH24:MI:SS';
- Session altered.
-
- SQL> select * from session_monitor;
-
- TIME SESSION_COUNT PROCESS_COUNT
- ---------------------------------------- ------------- -------------
- 26-JUL-12 03.02.12.000000 PM 140 155
- 26-JUL-12 03.02.14.000000 PM 141 157
-
- SQL> select job,next_date from user_jobs where what='proc_session();';
-
- JOB NEXT_DATE
- ---------- -------------------
- 145 2012-07-26-15:04:14
-
- SQL> select * from session_monitor;
-
- TIME SESSION_COUNT PROCESS_COUNT
- ---------------------------------------- ------------- -------------
- 26-JUL-12 03.04.14.000000 PM 87 94
- 26-JUL-12 03.02.12.000000 PM 140 155
- 26-JUL-12 03.02.14.000000 PM 141 157
6:如果要?jiǎng)h除任務(wù),則運(yùn)行下列的命令,145代表user_jobs視圖中的job列
- SQL> begin
- 2 dbms_job.remove(145);
- 3 end;
- PL/SQL procedure successfully completed.
三:使用nagios+fetion,定時(shí)去監(jiān)控會(huì)話和進(jìn)程數(shù)
1:創(chuàng)建監(jiān)控腳本,該腳本放任務(wù)計(jì)劃中運(yùn)行,每2分鐘自動(dòng)執(zhí)行
- [root@server240 libexec]# cat session_oracle.sh
- #!/bin/sh
- rm -rf /tmp/session_oracle.log
- export ORACLE_HOME=/u01/app/oracle/product/11.2.0/db1
- /u01/app/oracle/product/11.2.0/db1/bin/sqlplus hr/hr@192.168.1.240:1521/orcl <<EOF
- set echo off
- set feedback off
- spool /tmp/session_oracle.log
- alter session set nls_date_format='YYYY-MM-DD:HH24:MI:SS';
- select session_count from (select * from session_monitor order by time desc ) where rownum=1;
- select process_count from (select * from session_monitor order by time desc ) where rownum=1;
- spool off
- set echo on
- set feedback on EOF
2:創(chuàng)建第二腳本,用來(lái)處理前面監(jiān)控腳本的日志輸出,將結(jié)果返回給監(jiān)控服務(wù)器
- [root@server240 ~]# cat /tmp/session_oracle.log
SQL> alter session set nls_date_format='YYYY-MM-DD:HH24:MI:SS';
SQL> select session_count from (select * from session_monitor order by time desc ) where rownum=1; - SESSION_COUNT
-------------
138
SQL> select process_count from (select * from session_monitor order by time desc ) where rownum=1; - PROCESS_COUNT
-------------
153
SQL> spool off -
- [root@server240 libexec]# cat check_oracle_session.sh
- #!/bin/sh
- STATE_OK=0
- STATE_CRITICAL=2
-
- if [ -f /tmp/session_oracle.log ];then
- SESSION=$(grep -A 2 'SESSION_COUNT' /tmp/session_oracle.log |tail -1|sed 's/[ ][ ]*//g')
- PROCESS=$(grep -A 2 'PROCESS_COUNT' /tmp/session_oracle.log |tail -1|sed 's/[ ][ ]*//g')
- else
- echo "something wrong,please check monitor script"
- exit $STATE_CRITICAL
- fi
-
- if [ $SESSION -gt 500 ] || [ $PROCESS -gt 500 ];then
- echo "Current session is $SESSION,process is $PROCESS "
- exit $STATE_CRITICAL
- else
- echo "Current session is $SESSION,process is $PROCESS "
- exit $STATE_OK
- fi
四:實(shí)際效果
[root@server198 ~]# /usr/local/nagios/libexec/check_nrpe -H 192.168.1.240 -c check_oracle_session
Current session is 138,process is 153

五:后期觀察結(jié)果,估計(jì)和內(nèi)存問(wèn)題相關(guān)
[oracle@server240 ~]$ sar -r |grep 10:20
10:20:02 AM 3481492 21195164 85.89 675584 13661448 3683012 413552 10.10
[oracle@server240 ~]$ sar -r |grep 04:40
04:40:01 PM 2076748 22599908 91.58 734088 14581728 4048864 47700 1.16