lprof

How to catch the problem next time?

We had slow DB for a whole week. We started with AWR reports. I like to look at the load profile first. For our hard parsing problem it did not really help. Time model section had more information (ballooning parse time elapsed), but not enough to catch the increase in ‘parse time cpu’. In awrdiff_0418_0427_dbsidxxx.html data for ‘parse time cpu’ is listed somewhere below between hundreds of other metrics.

awrdiff_parse_time_cpu

Woow, we had 1184% increase in ‘parse time cpu’. But that information is buried in an AWR report generated long after the fact.

Wouldn’t be nice to see relevant/main metrics real time?

In my load test script (hog) I combine information from several sources. I wanted to see some of the AIX specific output from lparstat (app, physc) with CPU related info from ASH (sampled) and sysstat (instrumented)…would be nice to have similar information every few seconds.

I also wanted something generic enough to catch issues with wide range of resources…CPU, disk or network.

I wanted it not to spill my screen with too much information, I made it easy to pick and choose metrics.

I wanted to see results every few seconds allowing me to observe changes in resource usage real time.

I wanted it light weight and work in a simple terminal (no GUI).

I wanted it to run without modification on both AIX and Linux.

Script lprof was born.

Linux or AIX> lprof

Oracle load profile sampler
Usage
        lprof <sleep seconds>
Output (any/all of these)
        time    date +%H%M%S
        app     lprstat available physical processors in the shared pool or vmstat r (when lparstat is not avail)
        physc   lprstat number of physical processors consumed or vmstat pc (AIX) or vmstat id (on linux)
        busy    lprstat lbusy (percentage of logical processor utilization) or vmstat busy (us+sy) 
        cpu_os% v$osstat, (&BUSY_TIME_2-&BUSY_TIME_1)/100/&SLEEP,2)/&CPU_COUNT*100
        cpu_os  v$osstat, (&BUSY_TIME_2-&BUSY_TIME_1)/100/&SLEEP
        db_cpu  v$sys_time_model, stat_name = 'DB CPU'/&SLEEP
        cpu_bgr v$sys_time_model, stat_name = 'background cpu time'/&SLEEP
        lio_k   v$sysstat where name='session logical reads' [1000]
        pio     v$sysstat where name='physical reads'
        rmbps   v$sysstat where name='physical read total bytes' [MB]
        wmbps   v$sysstat where name='physical write total bytes' [MB]
        riops   v$sysstat where name='physical read total IO requests'
        wiops   v$sysstat where name='physical write total IO requests'
        seqr_ms v$system_event, 'db file sequential read'
        execs   v$sysstat where name='execute count'
        prs_tot v$sysstat where name='parse count (total)'
        prs_h   v$sysstat where name='parse count (hard)'
        prs_ela v$sysstat where name='parse time elapsed
        cpu_prs v$sysstat where name='parse time cpu'
        bstcl_k v$sysstat where name='bytes sent via SQL*Net to client'                
        brfcl_k v$sysstat where name='bytes received via SQL*Net from client'  
        rtrcl   v$sysstat where name='SQL*Net roundtrips to/from client'               
        bstdbl  v$sysstat where name='bytes sent via SQL*Net to dblink'                
        brfdbl  v$sysstat where name='bytes received via SQL*Net from dblink'  
        rtrdbl  v$sysstat where name='SQL*Net roundtrips to/from dblink'               
        load    v$osstat, LOAD average for last minute
        aas     v$active_session_history sessions with session_state = 'ON CPU' in last sample
        ash_cpu v$active_session_history with session_state = 'ON CPU'
        ash_prs v$active_session_history with session_state = 'ON CPU' and IN_PARSE
        ash_prh v$active_session_history with session_state = 'ON CPU' and IN_HARD_PARSE
        ash_lit v$active_session_history count sql-s with most literals in last sample, not normalized for time
        ash_tm  v$active_session_history latest sample_time
        aas1    v$active_session_history aas for most active event
        ash_1st v$active_session_history most active event
        aas2    v$active_session_history aas for 2nd most active event
        ash_2nd v$active_session_history 2nd most active event
        sleep   actual time between measurements, can increase on busy environment

It is my way to have the first look at a DB, helps me to feel out what is going on in the environment.

The script was developed and tested on AIX and Linux with Oracle 11R2, you may need to make minor adjustments for other OS/DB versions.

uname -vM
7 IBM,8286-42A
 
uname -a
Linux ol6-112-rac1.localdomain 2.6.39-200.24.1.el6uek.x86_64 #1 SMP Sat Jun 23 02:39:07 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production

Use case #1

On Linux I used hog to generate parse load, watched changes in number of parses per second and cpu used for parsing.

Use case #2

On AIX I used the same hog script to generate parse load, watched changes in number of parses per second and cpu used for parsing. Also watched how ASH shows the session is using 100% CPU. Observed the AIX specific core based CPU utilization reported by instrumentation (cpu_db, cpu_prs).

Use case #3

We had a similar episode. For several hours the DB maxed out the CPU on the lpar. It was from hard parsing. I was able to catch it real time using lprof.

AIX host> lprof 5

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
Waits between measurements [sec] 5
System configuration: type=Shared mode=Uncapped smt=8 lcpu=128 mem=131072MB psize=12 ent=10.00 
Online Virtual CPUs                        : 16
We are using lparstat on AIX

time    app     physc   aas     aas1    ash_1st aas2    ash_2nd execs   prs_tot prs_h   lio_k   cpu_db  cpu_prs rmbps   wmbps   riops   wiops   seqr_ms
160428  0.00    11.60   110.60  35.39   cursor: 26.54   CPU     3140    908     11      511     8.5     4.64    498.5   11.7    2380    639     17.69   
160452  0.00    11.65   78.80   22.06   cursor: 18.91   CPU     3469    823     11      516     8.9     5.32    408.0   12.4    3166    609     8.17    
160515  0.00    11.45   190.00  55.10   library 45.60   CPU     2959    833     9       508     8.1     5.22    508.3   11.8    2361    624     15.06   
160539  0.00    11.62   151.20  51.41   cursor: 37.80   CPU     3070    810     14      514     8.6     4.99    466.4   10.8    2354    584     12.50   
160601  0.00    11.26   38.00   14.82   cursor: 9.12    CPU     2931    817     10      564     8.3     4.65    490.2   11.3    2394    611     15.00   
160626  0.00    11.45   80.60   33.05   cursor: 18.54   CPU     2724    762     9       530     8.6     4.77    492.7   11.2    2396    520     13.79   
160651  0.00    11.39   40.40   15.35   cursor: 9.70    CPU     2462    766     9       531     8.7     4.29    487.0   8.3     2109    406     20.06   
160717  0.00    11.67   82.60   23.95   cursor: 20.65   CPU     2743    827     13      575     8.4     4.60    512.0   9.3     2133    469     19.51   
160742  0.00    11.71   80.20   28.87   cursor: 20.85   CPU     3235    803     12      614     8.7     3.78    485.6   10.1    2052    455     21.03   
160808  0.24    11.33   121.40  38.85   cursor: 27.92   CPU     3051    809     8       491     8.7     4.23    439.5   10.8    2986    585     10.64  

# next day...normal...

time    app     physc   aas     aas1    ash_1st aas2    ash_2nd execs   prs_tot prs_h   lio_k   cpu_db  cpu_prs rmbps   wmbps   riops   wiops   seqr_ms
165733  2.93    8.84    17.80   6.41    CPU     4.81    SQL*Net 2005    612     14      655     4.5     .07     814.4   3.2     5214    255     2.23    
165743  2.29    9.46    17.40   6.61    CPU     4.87    SQL*Net 2240    617     11      873     4.6     .04     673.5   3.5     6298    264     1.76    
165754  1.85    9.96    22.60   7.68    CPU     6.33    SQL*Net 2356    709     15      789     4.4     .07     773.5   3.2     4439    207     1.28

The lpar used 11+ cores (physc), there was no core resource left in the shared pool (app=0).  From the 11 cores around 8 were used by the db (cpu_db). Around half of that was used for parsing (cpu_prs)… bad. Also, look at the high ‘db file sequential read’ times (seqr_ms), the side effect of CPU starvation.

File lprof.txt has the script with sample test runs, enjoy.

References:
Load profile
SQL Dashboard V2.1 (beta)