Cputoolkit on lpar

While researching CPU behavior on AIX/POWER/lpar I found Karl Arao’s article on cores vs threads, v2 vs x2. The article refers the cputoolkit tool from https://karlarao.wordpress.com/scripts-resources/.

Soon I realized I need to compare CPU usage from several sources. This approach allowed me to observe both time spent on CPU and core based usage. I ended up creating my own script (hog) to generate load and capture CPU related information.

I was wondering how hog compares to cputoolkit. I added the driver sql from cputoolkit so I was expecting something similar.

--lotslios, borrowed from Karl Arao's cputoolkit, I think he got it from a script by Tanel Poder
--calibrated where clause for around 0.3 sec run time on test lpar
select /*+ cputoolkit ordered use_nl(b) use_nl(c) use_nl(d) full(a) full(b) full(c) full(d) */ count(*) into a 
from sys.obj$ a, sys.obj$ b, sys.obj$ c, sys.obj$ d 
where a.owner# = b.owner# and b.owner# = c.owner# and c.owner# = d.owner# and rownum <= 3000000;

In cputoolkit the data is gathered by sampling v$sql every few seconds, the tool is picking  the last sample from each set. I was wondering if all logical IOs (LIOs) are captured/reported this way.

In hog I am calibrating the sessions to run for set time and catching the number of transactions (number of executions of the driver sqls) and LIOs from v$mystat on exit. This approach guarantees exact counts and also helps with smoother transition between sets, I do not have to kill Oracle sessions.

I opted to report data normalized on time. This way I can easily compare information from ASH, instrumentation, and OS.

I picked an lpar with 10 virtual processors, smt=4, 40 logical CPU-s. During the tests there was enough avail core capacity  in shared pool to fully utilize all processors.

cputoolkit

sh cputoolkit 1 1 50 $ORACLE_SID

Workload starts with 179526 LIOs/sec for the 1st session. I see an almost linear increase up until around 10 sessions (make sense, we have 10 virtual processors). After that we see diminishing increase from additional sessions. Max workload is around 2.3 million LIO/sec at 40 sessions. I calculated LIOs/sec by multiplying  LIOS_ELAP with number of sessions. It is matching ‘Logical Reads Per Sec’ from loadprof.sql which is recorded by cputoolkit during the test. I uploaded logs in cputoolkit_031717.zip

cputoolkit_lpar_throughput

The 1st session can produce 179526 LIOs per second. By the time we run 40 concurrent sessions performance for each session degrades to 57910 LIOs.

At the sql level as we run more concurrent sessions each sql runs a bit longer; ELAP_EXEC is up from 0.35 to 0.92 seconds with 40 sessions. This means less number of executions by each session; EXEC/sec/CPU is down from 2.37 to 1.09 executions  per second.cputoolkit_session_throughput

You can see all data points in cputoolkit.xlsx

For some reason LIOs per execution (LIOS_EXEC column in cputoolkit.xlsx) is slightly decreasing. This would mean our driver sql executes with less logical IOs… not likely. I suspect measuring error from periodic sampling of buffer_gets from v$sql. Data points from Karl also show decreasing LIOS_EXEC resulting in sloping down LIOs/Elap even when running less sessions than cores.

But the real issue on AIX/POWER/lpar is thet cpu_time in v$sql is not representing time spent on CPU. It is core capacity used. Therefore CPU WAIT can’t be calculated by subtracting cpu_time from elapsed_time.

hog

hog lio nobind '0 50 1'

Workload starts with 179000 LIOs/sec for the 1st session. Similarly to cputoolkit I see an almost linear increase up until around 10 sessions. After that there is diminishing returns from additional sessions. Max workload is 2.2 million LIOs/sec at 40 sessions. I calculated LIOs/sec from slio_k which is thousands of ‘session logical reads’ from v$mystat. It is matching with test results from cputoolkit.
hog_lpar_throughput

The 1st session can deliver 179000 LIOs/sec which degrades to 67650 by the time we run 40 concurrent sessions.

Workload graphs beyond 20 sessions are jerky. I think it is from POWER processor switching between SMT levels to run more sessions, i.e. when a processor changes from SMT2 to SMT4 there will be a slight degradation in performance until all threads are utilized… check out power7.xlsx utilization sheet to see how that can happen.

At the sql level as we run more concurrent sessions each sql runs a bit longer: res_ms (brown line) is up from 357 msec to 932 msec with 40 sessions (this was very similar, between 0.35 to 0.92 seconds with cputoolkit). This means less number of sql executions by each hog: hog throughput (tx_s/hog, blue line) is down from 2.8  to 1.07 executions  per second with 40 concurrent hogs running.
hog_session_throughput

With hog I collect enough information to calculate the data reported by cputoolkit. “time” from pprof is the sum of time spent on CPU by Oracle sessions. Assuming sqls fully utilize the sessions it is equivalent to CPU used by the driver sqls (see hog sheet in  cputoolkit.xlsxtime -> CPUSECS). ELAP_EXEC (brown) is the duration of the test divided by the number of sql executions. From these I was able to calculate CPU WAIT (gray line). Waits start to show up after all 40 logical CPUs were saturated, beyond that elapsed start to diverge from CPU used.cpu_wait.jpg

Converting/substituting core based CPU usage to time spent on CPU allowed me to calculate CPU WAIT on AIX/POWER/lpar. I obtained time spent on CPU from pprof. I could use ASH ‘ON CPU’ but ASH would continue to grow beyond 40 as we run more hogs. On the other hand, pprof CPU usage is maxed out at 40 since we have 40 logical CPUs providing maximum 40 seconds of CPU time each wall clock second.

LIOS_EXEC on hog sheet is flat and slightly decreasing with cputoolkit. I think we have some collection problem with cputoolkit, that might be the reason for having not perfectly linear throughput growth even when number of sessions are less than number of cores.

With hog up until around 10 sessions each process gets 0.63 core resource. Yet response time/performance also has a slight degradation even under 10 sessions. I wonder if some friction inside the DB or at OS/lpar level kicks in. In theory I would expect (more/perfectly) linear growth even with our logical read heavy sql up until all processors are running single threaded. This calls for more tests, at this point I do not know the reason.
In the past I used PL/SQL loop to drive up the load:

a := 1; begin for i in 1..40000 loop a := ( a + i )/11; end loop; end;

With that load type I measured a more linear performance increase with flatter response times until 10 hogs, check out lpar_test sheet in cputoolkit.xlsx.

I started with testing cputoolkit on AIX/POWER/lpar. Noted problems with CPU WAIT due to cpu_time is on different scale…it is not time on AIX/POWER. Then I moved on to test with hog. I updated it to use the same sql to drive the load. It produced similar results for lpar throughput and sql response time.  Hog collects both core capacity used and time spent on CPU by Oracle sessions. With that information I was able to calculate CPU WAIT on AIX/POWER/lpar.

Soft thoughts

Learning

There is always the next level.

I am still learning. What I wrote here mostly exists in bits & pieces on the Internet. Not much new… it was new for me. During my research I focused on CPU utilization on AIX/POWER/lpar. I felt the CPU used by the DB somehow was not accounted for in AWR reports, nor was reflected in CPU utilization at the OS level.

I built on findings by Marcin Przepiorowski in his article on Oracle on AIX – where’s my cpu time ?. I also benefited from Jeremy Schneider’s excellent summary Understanding CPU on AIX Power SMT Systems. Another great source was IBM AIX POWER7 CPU Usage and ThroughputCPU Usage and Throughput by ksun. Applying the information allowed me to explain low CPU utilization of the machine/lpar, and the low CPU usage numbers by the DB. It also led me to the root cause of slow DB performance.

Why Write?

Writing my experience down forced me to think it over. I reached a whole new level of understanding. I hope you will learn something by reading this. I also hope someone out there will point out mistakes in my posts, that will give me another push to dig deeper.

Rule #1

The DB is not operating in thin air. It is executing queries on behalf of the applications. It is responding to those requests by using resources from the OS/HW. Neither side is controlled by the DBA. Yet we are frequently called upon to fix slowness in the DB.

Rule #1: the DBA is guilty. The application is slow from the DB unless proven otherwise. The DB is using too much resource on the OS, DBA is guilty again. But how can a DBA know the app and the OS/HW? In today’s specialized world you can’t cover it all. So people supporting the app and the OS are your friends. They can help you find unusual patterns of activity affecting the DB. Do not expect them to find you, you need to reach out… you are lucky if they are approachable.

Management may not recognize this; they think pushing the DBA harder is the best way to solve the problem. This can quickly become a distraction: you may be asked to participate on fire calls and report progress every few hours. That time would be better spent on thinking, or consulting with the few people relevant to the problem. All hands on deck can be counter productive. Ultimately the client is telling you what to do: you may let them know it is not helping, but I see no point sweating this too much.

Engaging too many “experts” and feeding them with information is another sure way to waste some time. While until we do not know what hit us  any help is welcome, we just need to balance it somehow.

Our job

As a DBA we are responsible to keep the DB humming. If something unusual is going on you need to recognize it. Do yourself a favor, know your DB. Know the numbers in your load profile. Know the type of workload the application is requesting. Understand the limitations of the HW the DB is running on. Basic stuff, like 1000 sql executions/sec, 10 hard parses/sec, 2msec average db file sequential reads (which can’t possibly come from physical disk), max 10000 IOPS for the SAN, max 11 seconds of CPU core capacity on lpar for the main DB (which is on different scale than the 128 CPU seconds the OS thinks the lpar has)…

Dead-ends

So far I did not post about dead-ends/side tracks. Those are the necessary evil. No way to know they are dead-ends until you pursue them. You may conclude faster they are irrelevant (victims instead of root causes) by learning from others.

Some of the pitfalls we had:
– using only AWR reports to explain slowness…not everything is in there
– trying to shoestring the problem into one of the seemingly possible scenarios based on data in AWR reports
– picking the wrong day (with less load) for baseline to “prove” slowness was from increased workload
– picking the wrong day (with sudden jump in number of soft parses) for baseline to “prove” slowness was from increased soft parse
– in the heat of the battle making changes on “expert” advice wo/verifying or understanding
– disk IO/speed trap on machine with CPU starvation
– allowing yourself to be distracted by constant/useless conference calls
– not following through with application owners, explain reason DB can’t cope, work together on solution

At the end…

Our week-long DB slowness  ended before we knew what hit us. At that time a lot of attention was devoted to batches. We thought they were the reason why day time activity was slower. The vendor stopped churning through the whole data set and started to process only the subset of accounts (those matter). The fastest sql is the one not running 🙂

Little did we know that the batches were the victims of hard parsing by other sessions. At around the same time when they fixed the batch, the sqls stopped coming. So we were back to normal. Couple of months later I tracked down the application responsible. By that time management moved on and was not interested to pursue it further.

The same application still puts out bursts of similar sqls few times a year. During my research I created a tool (lprof) to catch hard parsing as it happens. Couple of months ago the tool helped me identify the issue on the spot. It was “just” a 6 hour burst at business hours, it did not interfere with overnight batches,and users did not really noticed it (there were no complains). I notified management and they acknowledged it. I am out of the environment by now but this blog may help DBAs the next time it happens again.

Thanks for reading.

lprof

How to catch the problem next time?

We had slow DB for a whole week. We started with AWR reports. I like to look at the load profile first. For our hard parsing problem it did not really help. Time model section had more information (ballooning parse time elapsed), but not enough to catch the increase in ‘parse time cpu’. In awrdiff_0418_0427_dbsidxxx.html data for ‘parse time cpu’ is listed somewhere below between hundreds of other metrics.

awrdiff_parse_time_cpu

Woow, we had 1184% increase in ‘parse time cpu’. But that information is buried in an AWR report generated long after the fact.

Wouldn’t be nice to see relevant/main metrics real time?

In my load test script (hog) I combine information from several sources. I wanted to see some of the AIX specific output from lparstat (app, physc) with CPU related info from ASH (sampled) and sysstat (instrumented)…would be nice to have similar information every few seconds.

I also wanted something generic enough to catch issues with wide range of resources…CPU, disk or network.

I wanted it not to spill my screen with too much information, I made it easy to pick and choose metrics.

I wanted to see results every few seconds allowing me to observe changes in resource usage real time.

I wanted it light weight and work in a simple terminal (no GUI).

I wanted it to run without modification on both AIX and Linux.

Script lprof was born.

Linux or AIX> lprof

Oracle load profile sampler
Usage
        lprof <sleep seconds>
Output (any/all of these)
        time    date +%H%M%S
        app     lprstat available physical processors in the shared pool or vmstat r (when lparstat is not avail)
        physc   lprstat number of physical processors consumed or vmstat pc (AIX) or vmstat id (on linux)
        busy    lprstat lbusy (percentage of logical processor utilization) or vmstat busy (us+sy) 
        cpu_os% v$osstat, (&BUSY_TIME_2-&BUSY_TIME_1)/100/&SLEEP,2)/&CPU_COUNT*100
        cpu_os  v$osstat, (&BUSY_TIME_2-&BUSY_TIME_1)/100/&SLEEP
        db_cpu  v$sys_time_model, stat_name = 'DB CPU'/&SLEEP
        cpu_bgr v$sys_time_model, stat_name = 'background cpu time'/&SLEEP
        lio_k   v$sysstat where name='session logical reads' [1000]
        pio     v$sysstat where name='physical reads'
        rmbps   v$sysstat where name='physical read total bytes' [MB]
        wmbps   v$sysstat where name='physical write total bytes' [MB]
        riops   v$sysstat where name='physical read total IO requests'
        wiops   v$sysstat where name='physical write total IO requests'
        seqr_ms v$system_event, 'db file sequential read'
        execs   v$sysstat where name='execute count'
        prs_tot v$sysstat where name='parse count (total)'
        prs_h   v$sysstat where name='parse count (hard)'
        prs_ela v$sysstat where name='parse time elapsed
        cpu_prs v$sysstat where name='parse time cpu'
        bstcl_k v$sysstat where name='bytes sent via SQL*Net to client'                
        brfcl_k v$sysstat where name='bytes received via SQL*Net from client'  
        rtrcl   v$sysstat where name='SQL*Net roundtrips to/from client'               
        bstdbl  v$sysstat where name='bytes sent via SQL*Net to dblink'                
        brfdbl  v$sysstat where name='bytes received via SQL*Net from dblink'  
        rtrdbl  v$sysstat where name='SQL*Net roundtrips to/from dblink'               
        load    v$osstat, LOAD average for last minute
        aas     v$active_session_history sessions with session_state = 'ON CPU' in last sample
        ash_cpu v$active_session_history with session_state = 'ON CPU'
        ash_prs v$active_session_history with session_state = 'ON CPU' and IN_PARSE
        ash_prh v$active_session_history with session_state = 'ON CPU' and IN_HARD_PARSE
        ash_lit v$active_session_history count sql-s with most literals in last sample, not normalized for time
        ash_tm  v$active_session_history latest sample_time
        aas1    v$active_session_history aas for most active event
        ash_1st v$active_session_history most active event
        aas2    v$active_session_history aas for 2nd most active event
        ash_2nd v$active_session_history 2nd most active event
        sleep   actual time between measurements, can increase on busy environment

It is my way to have the first look at a DB, helps me to feel out what is going on in the environment.

Use case #1

On Linux I used hog to generate parse load, watched changes in number of parses per second and cpu used for parsing.

Use case #2

On AIX I used the same hog script to generate parse load, watched changes in number of parses per second and cpu used for parsing. Also watched how ASH shows the session is using 100% CPU. Observed the AIX specific core based CPU utilization reported by instrumentation (cpu_db, cpu_prs).

Use case #3

We had a similar episode. For several hours the DB maxed out the CPU on the lpar. It was from hard parsing. I was able to catch it real time using lprof.

AIX host> lprof 5

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
Waits between measurements [sec] 5
System configuration: type=Shared mode=Uncapped smt=8 lcpu=128 mem=131072MB psize=12 ent=10.00 
Online Virtual CPUs                        : 16
We are using lparstat on AIX

time    app     physc   aas     aas1    ash_1st aas2    ash_2nd execs   prs_tot prs_h   lio_k   cpu_db  cpu_prs rmbps   wmbps   riops   wiops   seqr_ms
160428  0.00    11.60   110.60  35.39   cursor: 26.54   CPU     3140    908     11      511     8.5     4.64    498.5   11.7    2380    639     17.69   
160452  0.00    11.65   78.80   22.06   cursor: 18.91   CPU     3469    823     11      516     8.9     5.32    408.0   12.4    3166    609     8.17    
160515  0.00    11.45   190.00  55.10   library 45.60   CPU     2959    833     9       508     8.1     5.22    508.3   11.8    2361    624     15.06   
160539  0.00    11.62   151.20  51.41   cursor: 37.80   CPU     3070    810     14      514     8.6     4.99    466.4   10.8    2354    584     12.50   
160601  0.00    11.26   38.00   14.82   cursor: 9.12    CPU     2931    817     10      564     8.3     4.65    490.2   11.3    2394    611     15.00   
160626  0.00    11.45   80.60   33.05   cursor: 18.54   CPU     2724    762     9       530     8.6     4.77    492.7   11.2    2396    520     13.79   
160651  0.00    11.39   40.40   15.35   cursor: 9.70    CPU     2462    766     9       531     8.7     4.29    487.0   8.3     2109    406     20.06   
160717  0.00    11.67   82.60   23.95   cursor: 20.65   CPU     2743    827     13      575     8.4     4.60    512.0   9.3     2133    469     19.51   
160742  0.00    11.71   80.20   28.87   cursor: 20.85   CPU     3235    803     12      614     8.7     3.78    485.6   10.1    2052    455     21.03   
160808  0.24    11.33   121.40  38.85   cursor: 27.92   CPU     3051    809     8       491     8.7     4.23    439.5   10.8    2986    585     10.64  

# next day...normal...

time    app     physc   aas     aas1    ash_1st aas2    ash_2nd execs   prs_tot prs_h   lio_k   cpu_db  cpu_prs rmbps   wmbps   riops   wiops   seqr_ms
165733  2.93    8.84    17.80   6.41    CPU     4.81    SQL*Net 2005    612     14      655     4.5     .07     814.4   3.2     5214    255     2.23    
165743  2.29    9.46    17.40   6.61    CPU     4.87    SQL*Net 2240    617     11      873     4.6     .04     673.5   3.5     6298    264     1.76    
165754  1.85    9.96    22.60   7.68    CPU     6.33    SQL*Net 2356    709     15      789     4.4     .07     773.5   3.2     4439    207     1.28

The lpar used 11+ cores (physc), there was no core resource left in the shared pool (app=0).  From the 11 cores around 8 were used by the db (cpu_db). Around half of that was used for parsing (cpu_prs)… bad. Also, look at the high ‘db file sequential read’ times (seqr_ms), the side effect of CPU starvation.

File lprof.txt has the script with sample test runs, enjoy.

References:
Load profile
SQL Dashboard V2.1 (beta)

Root cause part 2

In previous post I narrowed down the root cause to hard parsing. In this post we will track down the hard parsing sqls.

AWR report (and DBA_HIST_SQLSTAT) may not have such information, each sql may execute only once (or few times), they are not big enough to make it into AWR.

But ASH report’s “Top SQL using literals” section showed two FORCE_MATCHING_SIGNATUREs (FMS) with many sql versions. The DB was executing the same sqls different only in literals passed in.

We can also find those FMSs using custom scripts, see how I do that in rootcause.txt. There are 2 FMSs with distinct sql counts of 90+ thousand on 4/27. They started to show up around 4/23, same time when parse CPU shut up.

We can also find parse related information in ASH. We can even calculate time spent on logical CPUs for parsing. ASH also has session count, account used, the application connecting, the machine initiating the connections, etc… a gold mine of information. With such detail we can track down the source parsing sqls executed from.

Let’s look at how parse CPU from the 2 FMSs compare with CPU used for all parses and CPU used for all the sqls in sqlstat. Please see query “–parse_cpu_fms” in rootcause.txt.

Data points are also listed in “rootcause” sheet of mining_awr.xlsx or you can zoom in
parse_cpu_fms.jpg

parse_cpu_fms.jpg

CPU_HARD_PARSE_SEC	ASH session_state='ON CPU' and IN_HARD_PARSE='Y' 
CPU_SEC_FMS		ASH session_state='ON CPU' for the 2 FMSs
CPU_HARD_PARSE_SEC_FMS	ASH session_state='ON CPU' and IN_HARD_PARSE='Y' for the 2 FMSs

Most of the DB hard parse CPU (CPU_HARD_PARSE_SEC) came from hard parsing the sqls with the 2 FMSs (CPU_HARD_PARSE_SEC_FMS). The sqls with literals used most of the CPU for parsing since the CPU used during FMSs execution (CPU_SEC_FMS) is the same as CPU_HARD_PARSE_SEC_FMS.

For some reason 4/29 was different. That day parse related CPU usage disappeared, instead CPU was used for executing the sqls with FMSs (CPU_SEC_FMS, yellow line). Any guesses what happened that day? I may write about this in the future.

According to ASH we had 30 sessions executing the 2 FMSs.  The sessions were connected from 4 machines, please see query “–parsing_sessions” in rootcause.txt. In my previous post Are we maxed on CPU? we saw OS load increased by 30 on slow day, these are the sessions responsible for the load increase.

During the execution of sqls related to the 2 FMSs the sessions were constantly hard parsing.  That is a CPU intensive task, see query “–session_state by FORCE_MATCHING_SIGNATURE”.

FORCE_MATCHING_SIGNATURE SESSION EVENT                          I I TOTAL_TIME        CPU    WAITING
------------------------ ------- ------------------------------ - - ---------- ---------- ----------
    10458055140908993565 ON CPU                                 N N      59000      59000          0 batch related
    17710121666287808037 ON CPU                                 Y Y      47000      47000          0 hard parsing FMS
     6747409576012680712 ON CPU                                 Y Y      44270      44270          0 hard parsing FMS
      201958791624409026 ON CPU                                 N N      11100      11100          0
    16583812627080328788 WAITING db file sequential read        N N      13210          0      13210
     1409798687556705245 WAITING read by other session          N N      52440          0      524400

Hard parsing sessions together with the sqls in batch increased CPU usage up until DB used all core capacity available  to the lpar.

I am declaring hard parsing of sqls with literals as the root cause.

During my discovery I did a lot of research on (hard) parsing, it is no accident you will see parse_hard & parse_soft hog types in my load testing “hog” script. In a future post I may demonstrate how to generate (and catch) a parse heavy DB load.

At the beginning of my research I ruled out parsing:
– number of parses (soft and hard) did not change a lot, soft parse decreased by 12%, hard parse increased by 19%
– the machine was running with around 40% CPU utilization on slow week. I was thinking we have plenty of unused CPU capacity

I was wrong. Even without batches the load was around 60 on a machine with 128 logical CPUs/threads. On POWER8 that translates to close to 90% core utilization while the OS thinks CPU is around 50% utilized. Therefore CPU utilization was under reported, it was misleading.

But it gets worse, the lpar was not able to get full cores for all 16 virtual processors, only around 14 cores were available from shared CPU pool, so we were utilized beyond 90% on core capacity available to the lpar.

How is it possible that such a small change in number of hard parses maxed out the CPUs?
The extra 200 thousand hard parse on slow day showed up as an increase of 11.4  to 13.7 hard parses per second, does not look that much, right?
Turns out not the number of hard parses but the type of sql we parsed was the problem. I won’t post the sql to protect the innocent. Each sql was a complex 20K long sql, the DB processed close to 200 thousand per day on our slow week.

On our slow week on top of normal DB activity and the overnight batches an application started to bomb the DB with those complex  sql-s with literals. The 30 hard parsing sessions increased the load on the machine. POWER8 processor turned on more threads to accommodate the load. Oracle sessions (sqls in those sessions) had slower response time due to POWER8 processor was running with higher SMT.

The 2nd biggest DB Time increase (after ‘hard parse elapsed’) came from waits on ‘db file sequential read’. A machine with CPU starvation can serve disk requests slower especially when data is coming from cache/memory. I may show you in later posts how I reproduced slower disks reads while maxing out the CPU on a test lpar.

During my analysis I got distracted a bit by the intricacies of AIX/POWER. I wasted some time on dead ends. I also learned a lot along the way. Hope you will benefit and your next root cause analysis can conclude in days instead of months.

An abbreviated version of my experience is available in powerpoint format, Oracle on aix v4.pptx.

Root cause part 1

In “What happened to the batch?” I noted the increased load from the DB. Where the extra load coming from?

Most metrics are the same or less in DB load profile for slow day. In AWR diff report (awrdiff_0418_0427_dbsidxxx.html) parsing is down by 12% but hard parsing is up by 19% from 11.4/sec to 13.7/sec. The diff report does not compare the “Instance Efficiency Percentages” section, for that we have to go back to AWR reports. Between awr_0418_dbsidxxx.html and awr_0427_dbsidxxx.html) “% Non-Parse CPU” changed from 92% to 46%… more than half of the CPU was used for parsing on 4/28.

Let’s visualize CPU used for parsing and by the db. Sheet “whole db” in mining_awr.xlsx has both. I also included CPU used by all sqls found in sqlstat… which is not all the sqls executed by the DB… please remind me if you know the article explaining what goes into AWR with the snapshots.parse_cpu

DBCPU_sec	DBA_HIST_SYS_TIME_MODEL, 'DB CPU'
PARSE_CPU_S	DBA_HIST_SYSSTAT, 'parse time cpu'
SQL_CPU_SEC	DBA_HIST_SQLSTAT, sum(CPU_TIME_DELTA)

Zoom in to see it better parse_cpu.jpg

Around 4/23 parse CPU (PARSE_CPU_S) shut up and become big part of overall CPU use by the whole DB (DBCPU_sec). It calmed down for the weekend just to go back up for the whole week of 4/25. Also, it looks like combined CPU usage from DBA_HIST_SQLSTAT by all sqls (SQL_CPU_SEC) and parse CPU from DBA_HIST_SYSSTAT can’t push overall CPU utilization (DBCPU_sec) higher than 10000 centiseconds for any 15 min snapshot…core based DB CPU usage is maxed at around 11 sec (=10000/15/60).

The biggest DB Time gain on slow day (4/28) came from parse elapsed. See the data points in mining_awr.xlsx , sheet “whole db”. On this picture I am showing corelation between ‘parse time elapsed’ and ‘parse time cpu’.

parse_elapsed

parse_elapsed.jpeg

We have around 25000 seconds of parse related elapsed time for 15 minute snaps… translates to 2.4 million seconds added DB time from parsing (=25000*4*24) for the day (the length of my AWR reports). It is matching ‘hard parse time elapsed’ increase in awr_0418_dbsidxxx.html and awr_0427_dbsidxxx.html in Time Model Statistics section.

Check out Variable Load Impact On Oracle Database Hard Parsing by Craig Shallahamer, it explains a scenario with ballooning parse elapsed.

Next I will investigate where those hard parsing sqls came from.

What happened to the batch?

On our slow week I was primarily concerned about the batch. It was reaching into the day and it was several days behind. We thought it might be the reason for slowness at business hours.

The main sql in the batch did not change: we had the same execution plan and similar data set for 4/18 and 4/27. Yet on 4/27 morning it finished in 255 minutes instead of 180 minutes, a 40% increase.

Please see the queries I used in this post in Whathappennedtothebatch.txt.

I checked execution statistics in DBA_HIST_SQLSTAT for the dominant sql in batch.  It had more and slower physical reads, hence we see an increase in IOWAIT_SEC. But CPU used (CPU_SEC) remained about the same. CPU was much smaller relative to elapsed time.

Further evidence of disk problem is the increase in synchronous read speed. I list ‘db file sequential read’ from DBA_HIST_SYSTEM_EVENT in mining_awr.xlsx, sheet “whole db”, column SEQR_ms. It is around 2.5msec on the 18th; it fluctuates between 4 and 14 msec on the 27th morning during batch hours.

At this point I was thinking batches were slower from more physical reads and also from slower disks.

But analyzing ASH data showed a different picture. The sql was running on CPU much longer (85% of the time) then waiting on disk . This is not obvious from looking at the CPU used (CPU_SEC) in sqlstat. No surprise there, we should know it better by now,  the CPU numbers in sqlstat are core based (on different/smaller scale).

Per ASH data the dominant sql was all about CPU!
This realization did not come easy. I researched disk IO, checked on the volume of read/writes, IOPS, single/multi block reads, etc.
I used Karl Arao’s scripts from https://karlarao.wordpress.com/scripts-resources/ to decide on the day I will use as baseline. On 4/18 we had normal DB performance with similar or bigger disk IO, this helped me to avoid the “more disk IO made us slower” trap.

Load profile from awrdiff_0418_0427_dbsidxxx.html

diff_load_profile

We do have 29% more “Read IO requests” per second (IOPS) but it is well within normal range for the SAN. 5000 read requests/sec is OK for this environment. On our slow day (4/27) with similar disk IO profile ‘db file sequential read’ times increased from 2.75ms to 6.11ms. This allowed me to eliminate more IO load as reason for disk and DB slowness, that was a major side track at the beginning of my analysis. What happened with the disks outside of the DB (on lpar, on shared SAN) is not known and therefore the possibility for causing problems is there, I just do not have data on it.

The sql in batch is mostly running on CPU, I have a better candidate to explain slowness.

 --4/18
 SESSION EVENT                                                            COUNT(*)*10          %
 ------- ---------------------------------------------------------------- ----------- ----------
 WAITING latch free                                                             29500         17
 ON CPU                                                                        143000         83

 --4/27
 SESSION EVENT                                                            COUNT(*)*10          %
 ------- ---------------------------------------------------------------- ----------- ----------
 WAITING db file sequential read                                                20000         15
 ON CPU                                                                        112630         85

Let’s crosscheck CPU data from ASH with instrumentation. Above counts are from sampled ASH data for the sql in batch running on logical CPUs for an hour. Let’s convert it to core based usage (another fuzzy math) so we can compare with CPU used by the same sql in sqlstat (CPU_SEC in Whathappennedtothebatch.txt). Assuming we are mostly using CPU on the 4 threads while running with SMT4… we have around 35000 (=143000/4) core seconds for an hour. That is around 9000 seconds of CPU usage for 15 min by the sqls in the batch. I think the batch was the main CPU user in the DB CPU at morning hours of 4/18.
Let’s check DB CPU usage for the whole DB from time model: DBCPU_sec on “whole db” sheet of mining_awr.xlsx,  data points there are from 15 min snaps. We see around 10000 seconds at the time of the morning batch.
We have a fairly good match for CPU used between instrumented core based data and sampled ASH data.

During batch hours OS load jumped from 65 on 4/18 to around 90 on 4/27. Please check OS_LOAD arewemaxedoncpu.jpg or in mining_awr.xls, sheet “arewemaxedoncpu”. The extra demand is coming from the DB, CPU_ORA_DEMAND correlates with OS load. The DB had 30 more CPU heavy sessions during our slow batch day.

According to power8.xls, “utilization” sheet, with 65 processes we are running with SMT4. Adding 30 more switches half of the cores to SMT8. Relative thread response time will increase from 2.11 to 4 when all cores switched to the higher SMT level. With half of them remaining on SMT4 on average we can expect 0.95 (=4/2.1/2) fold increase in response time. That is about 45% (0.45=0.95/2.1) slower compared to running all Oracle batch sessions on threads with SMT4. You can grill me on this fuzzy math, I would just run tests to prove it.

Batch is CPU heavy, therefore changes in CPU behavior directly influence its speed. The higher load on lpar forced many of the virtual processors to use SMT8 instead of SMT4. That change resulted in higher response times for processes executing on a logical processor assigned to threads with SMT8 mode. That can explain the reason batch finishing 40% slower on 4/27.

BTW, in mining_awr.xlsx, sheet “whole db”, data points for “app_tx/s” show sql throughput for the sql in batch: it changed from around 17 to 12 executions/sec… a decrease of 41%.

Later that week we tinkered with DB parameters. We also disabled an sql profile. related to the sql in batch. Those changes completely altered the load profile of the sql making it harder to diagnose real root cause. Little we know there was something else lurking under the hood.

By now the batch is gone from the environment. I could just make something similar up and mass execute (like the batch did), I may do it if you convince me the batch had something to do with DB slowness. At this point it looks more like the victim not the root cause.

Request for help:
Is there a way to see what SMT level is used by the processors on the lpar? Currently I am inferring from OS load…

Are we maxed on CPU?

Depends how we look at it.
Yes, we were maxed out on available core capacity.
No, the lpar did not use all threads/logical CPUs all the time.

In my Mining AWR post I observed discrepancy for CPU usage numbers between the OS and the DB even when the DB is the main user on the machine (machine has another instance but this DB is the main production DB).
Mystery solved: for the OS CPU usage we were looking at time spent on threads (logical CPUs) by all processes running on the machine; for the DB the numbers reflected core capacity used by the DB.

Core capacity used

According to my Where is my CPU post on slow day (4/27) the lpar was using around 14 cores whole day. The lpar is uncapped, the shared pool has 16 physical cores, obviously not all available for this machine.
Also, out of the 14 the DB is using around 11 cores.

Time spent on threads

OS CPU usage from V$OSSTAT, or host CPU in SYSMETRIC is thread based. So as sampled ASH data from the DB.

The OS includes all processes running on the lpar, but I suspect the DB must be the biggest CPU user. ASH may also including CPU WAIT when the machine is short on CPU resource…it can help us to track CPU shortage at thread level.

Let’s look at logical CPU usage for both the OS and the DB.

CPU_ORA_DEMAND		ASH session_state='ON CPU'
CPU_ORA_CONSUMED	dba_hist_sysmetric_summary metric_name='CPU Usage Per Sec'
CPU_OS   		dba_hist_sysmetric_summary metric_name='Host CPU Usage Per Sec'
OS_LOAD   		dba_hist_sysmetric_summary metric_name='Current OS Load'

arewemaxedoncpu

You may need to zoom in to see the details in arewemaxedoncpu.jpg or in mining_awr.xls sheet “arewemaxedoncpu”.
Here is the query arewemaxedoncpu.txt

We can see that the DB is responsible for most of the CPU demand on the machine. As demand goes up so as OS_LOAD and CPU_OS. Outside of batches demand is slightly lower than OS_CPU, the lpar is holding up. Demand is higher than available CPU capacity during batches, lpar CPU is maxed out during batches (from batches and other activity combined).
OS_CPU capacity used during batches peaks at around 92… assuming SMT8 and maxed out logical CPUs by the DB it translates to 11.5 (=92/8) core capacity used…good match with core based CPU_ORA_CONSUMED.

The incident was almost a year ago. We still have AWR and grid data. But no access to nmon/OS info (I am a DBA w/limited OS support). I can account for around 11 cores used by the DB. With lpar using around 14 cores I am guessing the difference was used by processes outside the DB . Any idea if there is a way to track down what was using the remaining 3 cores (from sources I have access to…AWR & grid)?