POWER7 vs POWER8

In this post I will compare maximum workloads between lpars using POWER7 and POWER8 processors.

Detailed processor information for the two test lpars

Processor Type: PowerPC_POWER7
 Processor Implementation Mode: POWER 7
 Processor Version: PV_7_Compat
 Number Of Processors: 10
 Processor Clock Speed: 4228 MHz
 System configuration: type=Shared mode=Uncapped smt=4 lcpu=40 mem=92160MB psize=16 ent=1.00

Processor Type: PowerPC_POWER8
 Processor Implementation Mode: POWER 8
 Processor Version: PV_8_Compat
 Number Of Processors: 8
 Processor Clock Speed: 4157 MHz
 System configuration: type=Shared mode=Uncapped smt=8 lcpu=64 mem=102400MB psize=16 ent=0.80

My expectation for processor performance is based on specint_cpu_explorer from Karl Arao

I picked the closest processor from the list based on clock speed and type

ResultPerCore: 54.63
 Operating System: IBM AIX V7.1
 Processor MHz: 4228
 Processor: POWER7+
 System: IBM Power 730 Express (4.2 GHz, 16 core)

ResultPerCore: 79.00
 Operating System: IBM AIX V7.1
 Processor MHz: 4192
 Processor: POWER8
 System: IBM Power E870 (4.19 GHz, 80 core)

Expected workload increase between POWER7 and POWER8 processors based on specint_rates

79/54.63=1.44

Test results  generated by my hog load script for both test lpars

lpar using POWER7 with 10 cores: 1821 tx/s, 182 tx/s/core
lpar using POWER8 with  8 cores: 2521 tx/s, 315 tx/s/core

Workload ratios between POWER7 and POWER8 when lpar-s are maxed out

lpar 2521/1821 = 1.38
core 315/182 = 1.73

POWER7 vs POWER8

References
hog
Lpar test
Lpar test POWER8

Lpar test POWER8

In lpar test I looked at an lpar running on POWER7 virtual processors. This time I will run the same test on an lpar with POWER8.

In power8.xlsx I calculated these theoretical utilization numbers:

smt core perf core use thread perf thread use thread resp
1 1.00 0.50 1.00 0.50 1.00
2 1.45 0.73 0.73 0.36 1.38
4 1.89 0.95 0.47 0.24 2.12
8 2.00 1.00 0.25 0.13 4.00

My expectation is to have a 2 fold workload increase between single threaded and SMT8 modes.
power8_performance

At the same time each thread will run 4 times slower, a trade-off for more throughput.

In my load test script (hog) I am using a PL/SQL loop to drive up the load on the CPUs

a := 1; begin for i in 1..40000 loop a := ( a + i )/11; end loop; end;

I use no data/sql so there is no variation in execution (like change in execution plan, or use of direct path read during full table scans, etc.). Therefore the same hog can be used on different databases to compare performance. I am counting the number of executions “tx_s” and response time for each loop “res_ms”.

lpar with POWER8> hog plsql_loop nobind '0 80 1' | tee hog_plsql_loop_`hostname -s`_${ORACLE_SID}_`date +%m%d%y_%H%M`.log

power8 top activity

Please see data collected in power8.xlsx lpar test sheet.

On this lpar we do not have “app” output from lparstat (I am guessing sysadmins did not expose it) but vmstat’s physical processors consumed (pc) steadily increased during the test. Core capacity used by the lpar was maxed out at around 8,  it has 8 virtual processors backed by sufficient core capacity from the shared pool.

DB and session CPU consumptions “db_cpu”/”sess_cpu” are matching core capacity used “pc” of the machine, the database is responsible for the load on the machine.

“sess_cpu” is matching “purr” which is core used by all hog processes as it captured by ‘pprof’, therefore Oracle session CPU usage is expressed in core capacity.

At the same time hog process CPU consumption expressed as time spent on CPU in “time” column is matching the number of hogs. Each hog is designed to max out (constantly running on) the logical CPU/thread.

Up until around 60 hogs “os_cpu” is matching “time” from ‘pprof’, so we know OS CPU utilization is time based. OS is busy from hogs.

At high load “os_cpu” and “ash_cpu” are going above the maximum time based CPU capacity of 64. Same true for the core based Db and session CPU used from time model “tm_cpu” and “db_cpu”. Looks like these metrics won’t give us true CPU capacity used (time or core based), they mix in some time component after CPUs maxed out. All metrics are documented in hog post.

Let’s see if test results are matching my expectations.

Workload increase is linear from 1 to 10 hogs, we have 10 virtual processors.
Workload went up from 1169 tx/s to 1830 tx/sec, a 56% increase when the processor switched from single threads to SMT2.
Workload went up from 1169 tx/s to 2521 tx/sec, a 2.15 fold increase when the processor switched from single threads to SMT8.
We are close (slightly better) compared to calculated workload increase for POWER8.

Response time started at 6 ms with 1 hog and goes up to 25 ms with 64 hogs… 3.7 fold increase…a bit faster then I expected. Calculation was based on expected performance gain based on typical workload. Perhaps the workload generated by hog favors POWER8, so I revised the estimated performance increase for POWER8; see updated numbers in lpar test sheet in power8.xlsx

power8 throughput and response time

Looking at the data… for this type of workload I would not recommend running the DB with SMT8 (above 32 threads). It won’t provide much additional throughput and response time is significantly worse.

Our production DB is running on POWER8 with 16 virtual CPUs. It is time to apply some fuzzy logic. During batches half of the virtual processors switched from SMT4 to SMT8 (ballpark). Average expected response time increase is 44% (=(4-2.12)/2.12/2). It is matching the run time increase of batch between good day (4/18) and slow day (4/27). You can check post What happened to the batch? for the details.

In the next post I will compare test results between POWER7 and POWER8.

Cputoolkit on lpar

While researching CPU behavior on AIX/POWER/lpar I found Karl Arao’s article on cores vs threads, v2 vs x2. The article refers the cputoolkit tool from https://karlarao.wordpress.com/scripts-resources/.

Soon I realized I need to compare CPU usage from several sources. This approach allowed me to observe both time spent on CPU and core based usage. I ended up creating my own script (hog) to generate load and capture CPU related information.

I was wondering how hog compares to cputoolkit. I added the driver sql from cputoolkit so I was expecting something similar.

--lotslios, borrowed from Karl Arao's cputoolkit, I think he got it from a script by Tanel Poder
--calibrated where clause for around 0.3 sec run time on test lpar
select /*+ cputoolkit ordered use_nl(b) use_nl(c) use_nl(d) full(a) full(b) full(c) full(d) */ count(*) into a 
from sys.obj$ a, sys.obj$ b, sys.obj$ c, sys.obj$ d 
where a.owner# = b.owner# and b.owner# = c.owner# and c.owner# = d.owner# and rownum <= 3000000;

In cputoolkit the data is gathered by sampling v$sql every few seconds, then we use the data from the last sample for each set. I was wondering if all logical IOs (LIOs) are accounted for this way.

In hog I am calibrating the sessions to run for set time and catching the number of transactions (number of executions of the driver sqls) and LIOs from v$mystat on exit. This approach guarantees exact counts and also helps with smoother transition between sets, I do not have to kill Oracle sessions.

I opted to report data normalized on time. This way I can easily compare information from ASH, instrumentation, and OS.

hog script is one self-contained executable; it is easy to deploy; requires no change to the db;connects as sysdba.

I picked an lpar with 10 virtual processors, smt=4, 40 logical CPU-s. During the tests there was enough avail core capacity  in shared pool to fully utilize all processors.

cputoolkit

sh cputoolkit 1 1 50 $ORACLE_SID

Workload starts with 179526 LIOs/sec for the 1st session. I see an almost linear increase up until around 10 sessions (make sense, we have 10 virtual processors). After that we see diminishing increase from additional sessions. Max workload is around 2.3 million LIO/sec at 40 sessions. I calculated LIOs/sec by multiplying  LIOS_ELAP with number of sessions. It is matching ‘Logical Reads Per Sec’ from loadprof.sql which is recorded by cputoolkit during the test. I uploaded logs in cputoolkit_031717.zip

cputoolkit_lpar_throughput

The 1st session can produce 179526 LIOs per second. By the time we run 40 concurrent sessions performance for each session degrades to 57910 LIOs.

At the sql level as we run more concurrent sessions each sql runs a bit longer; ELAP_EXEC is up from 0.35 to 0.92 seconds with 40 sessions. This means less number of executions by each session; EXEC/sec/CPU is down from 2.37 to 1.09 executions  per second.cputoolkit_session_throughput

You can see all data points in cputoolkit.xlsx

For some reason LIOs per execution (LIOS_EXEC column in cputoolkit.xlsx) is slightly decreasing, not the case with hog. This would mean our driver sql executes with less logical IOs… not likely. I suspect measuring error from periodic sampling of buffer_gets from v$sql. Data points from Karl also show decreasing LIOS_EXEC resulting in sloping down LIOs/Elap even when running less sessions than cores.

But the real issue on AIX/POWER/lpar is that cpu_time in v$sql is not representing time spent on CPU. It is core capacity used. Therefore CPU WAIT can’t be calculated by subtracting cpu_time from elapsed_time.

hog

hog lio nobind '0 50 1'

Workload starts with 179000 LIOs/sec for the 1st session. Similarly to cputoolkit I see an almost linear increase up until around 10 sessions. After that there is diminishing returns from additional sessions. Max workload is 2.2 million LIOs/sec at 40 sessions. I calculated LIOs/sec from slio_k which is thousands of ‘session logical reads’ from v$mystat. It is matching with test results from cputoolkit.
hog_lpar_throughput

The 1st session can deliver 179000 LIOs/sec which degrades to 67650 by the time we run 40 concurrent sessions.

Workload graphs beyond 20 sessions are jerky. I think it is from POWER processor switching between SMT levels to run more sessions, i.e. when a processor changes from SMT2 to SMT4 there will be a slight degradation in performance until all threads are utilized… check out power7.xlsx utilization sheet to see how that can happen.

At the sql level as we run more concurrent sessions each sql runs a bit longer: res_ms (brown line) is up from 357 msec to 932 msec with 40 sessions (this was very similar, between 0.35 to 0.92 seconds with cputoolkit). This means less number of sql executions by each hog: hog throughput (tx_s/hog, blue line) is down from 2.8  to 1.07 executions  per second with 40 concurrent hogs running.
hog_session_throughput

For workload and sql response time hog is matching results from cputoolkit.

With hog I collect enough information to calculate the same information reported by cputoolkit. “time” from pprof is the sum of time spent on CPU by Oracle sessions. Assuming sqls fully utilize the sessions it is equivalent to CPU used by the driver sqls (see hog sheet in  cputoolkit.xlsxtime -> CPUSECS). ELAP_EXEC (brown) is the duration of the test divided by the number of sql executions. From these I was able to calculate CPU WAIT (gray line). Waits start to show up after all 40 logical CPUs were saturated, beyond that elapsed start to diverge from CPU used.cpu_wait.jpg

Converting/substituting core based CPU usage to time spent on CPU allowed me to calculate CPU WAIT on AIX/POWER/lpar. I obtained time spent on CPU from pprof. I could use ASH ‘ON CPU’ but ASH would continue to grow beyond 40 as we run more hogs. On the other hand, pprof CPU usage is maxed out at 40 since we have 40 logical CPUs providing maximum 40 seconds of CPU time each wall clock second.

I started with testing cputoolkit on AIX/POWER/lpar. Noted problems with CPU WAIT due to cpu_time is on different scale…it is not time on AIX/POWER. Then tested with hog. I updated it to use the same sql to drive the load. It produced similar results for lpar throughput and sql response time.  Hog collects both core capacity used and time spent on CPU by Oracle sessions. With that information I was able to calculate CPU WAIT. This post is another example where it is important to understand the difference between the two ways CPU interpreted/reported on AIX/POWER/lpar.

Soft thoughts

Learning

There is always the next level.

I am still learning. What I wrote here mostly exists in bits & pieces on the Internet. Not much new… it was new for me. During my research I focused on CPU utilization on AIX/POWER/lpar. I felt the CPU used by the DB somehow was not accounted for in AWR reports, nor was reflected in CPU utilization at the OS level.

I built on findings by Marcin Przepiorowski in his article on Oracle on AIX – where’s my cpu time ?. I also benefited from Jeremy Schneider’s excellent summary Understanding CPU on AIX Power SMT Systems. Another great source was IBM AIX POWER7 CPU Usage and ThroughputCPU Usage and Throughput by ksun. Applying the information allowed me to explain low CPU utilization of the machine/lpar, and the low CPU usage numbers by the DB. It also led me to the root cause of slow DB performance.

Why Write?

Writing my experience down forced me to think it over. I reached a whole new level of understanding. I hope you will learn something by reading this. I also hope someone out there will point out mistakes in my posts, that will give me another push to dig deeper.

Rule #1

The DB is not operating in thin air. It is executing queries on behalf of the applications. It is responding to those requests by using resources from the OS/HW. Neither side is controlled by the DBA. Yet we are frequently called upon to fix slowness in the DB.

Rule #1: the DBA is guilty. The application is slow from the DB unless proven otherwise. The DB is using too much resource on the OS, DBA is guilty again. But how can a DBA know the app and the OS/HW? In today’s specialized world you can’t cover it all. So people supporting the app and the OS are your friends. They can help you find unusual patterns of activity affecting the DB. Do not expect them to find you, you need to reach out… you are lucky if they are approachable.

Management may not recognize this; they think pushing the DBA harder is the best way to solve the problem. This can quickly become a distraction: you may be asked to participate on fire calls and report progress every few hours. That time would be better spent on thinking, or consulting with the few people relevant to the problem. All hands on deck can be counter productive. Ultimately the client is telling you what to do: you may let them know it is not helping, but I see no point sweating this too much.

Engaging too many “experts” and feeding them with information is another sure way to waste some time. While until we do not know what hit us  any help is welcome, we just need to balance it somehow.

Our job

As a DBA we are responsible to keep the DB humming. If something unusual is going on you need to recognize it. Do yourself a favor, know your DB. Know the numbers in your load profile. Know the type of workload the application is requesting. Understand the limitations of the HW the DB is running on. Basic stuff, like 1000 sql executions/sec, 10 hard parses/sec, 2msec average db file sequential reads (which can’t possibly come from physical disk), max 10000 IOPS for the SAN, max 11 seconds of CPU core capacity on lpar for the main DB (which is on different scale than the 128 CPU seconds the OS thinks the lpar has)…

Dead-ends

So far I did not post about dead-ends/side tracks. Those are the necessary evil. No way to know they are dead-ends until you pursue them. You may conclude faster they are irrelevant (victims instead of root causes) by learning from others.

Some of the pitfalls we had:
– using only AWR reports to explain slowness…not everything is in there
– trying to shoestring the problem into one of the seemingly possible scenarios based on data in AWR reports
– picking the wrong day (with less load) for baseline to “prove” slowness was from increased workload
– picking the wrong day (with sudden jump in number of soft parses) for baseline to “prove” slowness was from increased soft parse
– in the heat of the battle making changes on “expert” advice wo/verifying or understanding
– disk IO/speed trap on machine with CPU starvation
– allowing yourself to be distracted by constant/useless conference calls
– not following through with application owners, explain reason DB can’t cope, work together on solution

At the end

Our week-long DB slowness  ended before we knew what hit us. At that time a lot of attention was devoted to batches. We thought they were the reason why day time activity was slower. The vendor stopped churning through the whole data set and started to process only the subset of accounts (those matter). The fastest sql is the one not running 🙂

Little did we know that the batches were the victims of hard parsing by other sessions. At around the same time they modified the batch sqls with literals also stopped coming. So we were back to normal. Couple of months later I tracked down the application responsible. By that time management moved on and was not interested to pursue it further.

The same application still puts out bursts of similar sqls few times a year. During my research I created a tool (lprof) to catch hard parsing as it happens. Couple of months ago the tool helped me identify the issue on the spot. It was “just” a 6 hour burst at business hours, it did not interfere with overnight batches, users did not really noticed it (there were no complains). I notified management and they acknowledged it. I am out of the environment by now but this blog may help DBAs the next time it happens again.

Thanks for reading.

lprof

How to catch the problem next time?

We had slow DB for a whole week. We started with AWR reports. I like to look at the load profile first. For our hard parsing problem it did not really help. Time model section had more information (ballooning parse time elapsed), but not enough to catch the increase in ‘parse time cpu’. In awrdiff_0418_0427_dbsidxxx.html data for ‘parse time cpu’ is listed somewhere below between hundreds of other metrics.

awrdiff_parse_time_cpu

Woow, we had 1184% increase in ‘parse time cpu’. But that information is buried in an AWR report generated long after the fact.

Wouldn’t be nice to see relevant/main metrics real time?

In my load test script (hog) I combine information from several sources. I wanted to see some of the AIX specific output from lparstat (app, physc) with CPU related info from ASH (sampled) and sysstat (instrumented)…would be nice to have similar information every few seconds.

I also wanted something generic enough to catch issues with wide range of resources…CPU, disk or network.

I wanted it not to spill my screen with too much information, I made it easy to pick and choose metrics.

I wanted to see results every few seconds allowing me to observe changes in resource usage real time.

I wanted it light weight and work in a simple terminal (no GUI).

I wanted it to run without modification on both AIX and Linux.

Script lprof was born.

Linux or AIX> lprof

Oracle load profile sampler
Usage
        lprof <sleep seconds>
Output (any/all of these)
        time    date +%H%M%S
        app     lprstat available physical processors in the shared pool or vmstat r (when lparstat is not avail)
        physc   lprstat number of physical processors consumed or vmstat pc (AIX) or vmstat id (on linux)
        busy    lprstat lbusy (percentage of logical processor utilization) or vmstat busy (us+sy) 
        cpu_os% v$osstat, (&BUSY_TIME_2-&BUSY_TIME_1)/100/&SLEEP,2)/&CPU_COUNT*100
        cpu_os  v$osstat, (&BUSY_TIME_2-&BUSY_TIME_1)/100/&SLEEP
        db_cpu  v$sys_time_model, stat_name = 'DB CPU'/&SLEEP
        cpu_bgr v$sys_time_model, stat_name = 'background cpu time'/&SLEEP
        lio_k   v$sysstat where name='session logical reads' [1000]
        pio     v$sysstat where name='physical reads'
        rmbps   v$sysstat where name='physical read total bytes' [MB]
        wmbps   v$sysstat where name='physical write total bytes' [MB]
        riops   v$sysstat where name='physical read total IO requests'
        wiops   v$sysstat where name='physical write total IO requests'
        seqr_ms v$system_event, 'db file sequential read'
        execs   v$sysstat where name='execute count'
        prs_tot v$sysstat where name='parse count (total)'
        prs_h   v$sysstat where name='parse count (hard)'
        prs_ela v$sysstat where name='parse time elapsed
        cpu_prs v$sysstat where name='parse time cpu'
        bstcl_k v$sysstat where name='bytes sent via SQL*Net to client'                
        brfcl_k v$sysstat where name='bytes received via SQL*Net from client'  
        rtrcl   v$sysstat where name='SQL*Net roundtrips to/from client'               
        bstdbl  v$sysstat where name='bytes sent via SQL*Net to dblink'                
        brfdbl  v$sysstat where name='bytes received via SQL*Net from dblink'  
        rtrdbl  v$sysstat where name='SQL*Net roundtrips to/from dblink'               
        load    v$osstat, LOAD average for last minute
        aas     v$active_session_history sessions with session_state = 'ON CPU' in last sample
        ash_cpu v$active_session_history with session_state = 'ON CPU'
        ash_prs v$active_session_history with session_state = 'ON CPU' and IN_PARSE
        ash_prh v$active_session_history with session_state = 'ON CPU' and IN_HARD_PARSE
        ash_lit v$active_session_history count sql-s with most literals in last sample, not normalized for time
        ash_tm  v$active_session_history latest sample_time
        aas1    v$active_session_history aas for most active event
        ash_1st v$active_session_history most active event
        aas2    v$active_session_history aas for 2nd most active event
        ash_2nd v$active_session_history 2nd most active event
        sleep   actual time between measurements, can increase on busy environment

It is my way to have the first look at a DB, helps me to feel out what is going on in the environment.

The script was developed and tested on AIX and Linux with Oracle 11R2, you may need to make minor adjustments for other OS/DB versions.

uname -vM
7 IBM,8286-42A
 
uname -a
Linux ol6-112-rac1.localdomain 2.6.39-200.24.1.el6uek.x86_64 #1 SMP Sat Jun 23 02:39:07 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production

Use case #1

On Linux I used hog to generate parse load, watched changes in number of parses per second and cpu used for parsing.

Use case #2

On AIX I used the same hog script to generate parse load, watched changes in number of parses per second and cpu used for parsing. Also watched how ASH shows the session is using 100% CPU. Observed the AIX specific core based CPU utilization reported by instrumentation (cpu_db, cpu_prs).

Use case #3

We had a similar episode. For several hours the DB maxed out the CPU on the lpar. It was from hard parsing. I was able to catch it real time using lprof.

AIX host> lprof 5

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
Waits between measurements [sec] 5
System configuration: type=Shared mode=Uncapped smt=8 lcpu=128 mem=131072MB psize=12 ent=10.00 
Online Virtual CPUs                        : 16
We are using lparstat on AIX

time    app     physc   aas     aas1    ash_1st aas2    ash_2nd execs   prs_tot prs_h   lio_k   cpu_db  cpu_prs rmbps   wmbps   riops   wiops   seqr_ms
160428  0.00    11.60   110.60  35.39   cursor: 26.54   CPU     3140    908     11      511     8.5     4.64    498.5   11.7    2380    639     17.69   
160452  0.00    11.65   78.80   22.06   cursor: 18.91   CPU     3469    823     11      516     8.9     5.32    408.0   12.4    3166    609     8.17    
160515  0.00    11.45   190.00  55.10   library 45.60   CPU     2959    833     9       508     8.1     5.22    508.3   11.8    2361    624     15.06   
160539  0.00    11.62   151.20  51.41   cursor: 37.80   CPU     3070    810     14      514     8.6     4.99    466.4   10.8    2354    584     12.50   
160601  0.00    11.26   38.00   14.82   cursor: 9.12    CPU     2931    817     10      564     8.3     4.65    490.2   11.3    2394    611     15.00   
160626  0.00    11.45   80.60   33.05   cursor: 18.54   CPU     2724    762     9       530     8.6     4.77    492.7   11.2    2396    520     13.79   
160651  0.00    11.39   40.40   15.35   cursor: 9.70    CPU     2462    766     9       531     8.7     4.29    487.0   8.3     2109    406     20.06   
160717  0.00    11.67   82.60   23.95   cursor: 20.65   CPU     2743    827     13      575     8.4     4.60    512.0   9.3     2133    469     19.51   
160742  0.00    11.71   80.20   28.87   cursor: 20.85   CPU     3235    803     12      614     8.7     3.78    485.6   10.1    2052    455     21.03   
160808  0.24    11.33   121.40  38.85   cursor: 27.92   CPU     3051    809     8       491     8.7     4.23    439.5   10.8    2986    585     10.64  

# next day...normal...

time    app     physc   aas     aas1    ash_1st aas2    ash_2nd execs   prs_tot prs_h   lio_k   cpu_db  cpu_prs rmbps   wmbps   riops   wiops   seqr_ms
165733  2.93    8.84    17.80   6.41    CPU     4.81    SQL*Net 2005    612     14      655     4.5     .07     814.4   3.2     5214    255     2.23    
165743  2.29    9.46    17.40   6.61    CPU     4.87    SQL*Net 2240    617     11      873     4.6     .04     673.5   3.5     6298    264     1.76    
165754  1.85    9.96    22.60   7.68    CPU     6.33    SQL*Net 2356    709     15      789     4.4     .07     773.5   3.2     4439    207     1.28

The lpar used 11+ cores (physc), there was no core resource left in the shared pool (app=0).  From the 11 cores around 8 were used by the db (cpu_db). Around half of that was used for parsing (cpu_prs)… bad. Also, look at the high ‘db file sequential read’ times (seqr_ms), the side effect of CPU starvation.

File lprof.txt has the script with sample test runs, enjoy.

References:
Load profile
SQL Dashboard V2.1 (beta)

Root cause part 2

In previous post I narrowed down the root cause to hard parsing. In this post we will track down the hard parsing sqls.

AWR report (and DBA_HIST_SQLSTAT) may not have such information, each sql may execute only once (or few times), they are not big enough to make it into AWR.

But ASH report’s “Top SQL using literals” section showed two FORCE_MATCHING_SIGNATUREs (FMS) with many sql versions. The DB was executing the same sqls different only in literals passed in.

We can also find those FMSs using custom scripts, see how I do that in rootcause.txt. There are 2 FMSs with distinct sql counts of 90+ thousand on 4/27. They started to show up around 4/23, same time when parse CPU shut up.

We can also find parse related information in ASH. We can even calculate time spent on logical CPUs for parsing. ASH also has session count, account used, the application connecting, the machine initiating the connections, etc… a gold mine of information. With such detail we can track down the source parsing sqls executed from.

Let’s look at how parse CPU from the 2 FMSs compare with CPU used for all parses and CPU used for all the sqls in sqlstat. Please see query “–parse_cpu_fms” in rootcause.txt.

Data points are also listed in “rootcause” sheet of mining_awr.xlsx or you can zoom in
parse_cpu_fms.jpg

parse_cpu_fms.jpg

CPU_HARD_PARSE_SEC	ASH session_state='ON CPU' and IN_HARD_PARSE='Y' 
CPU_SEC_FMS		ASH session_state='ON CPU' for the 2 FMSs
CPU_HARD_PARSE_SEC_FMS	ASH session_state='ON CPU' and IN_HARD_PARSE='Y' for the 2 FMSs

Most of the DB hard parse CPU (CPU_HARD_PARSE_SEC) came from hard parsing the sqls with the 2 FMSs (CPU_HARD_PARSE_SEC_FMS). The sqls with literals used most of the CPU for parsing since the CPU used during FMSs execution (CPU_SEC_FMS) is the same as CPU_HARD_PARSE_SEC_FMS.

For some reason 4/29 was different. That day parse related CPU usage disappeared, instead CPU was used for executing the sqls with FMSs (CPU_SEC_FMS, yellow line). Any guesses what happened that day? I may write about this in the future.

According to ASH we had 30 sessions executing the 2 FMSs.  The sessions were connected from 4 machines, please see query “–parsing_sessions” in rootcause.txt. In my previous post Are we maxed on CPU? we saw OS load increased by 30 on slow day, these are the sessions responsible for the load increase.

During the execution of sqls related to the 2 FMSs the sessions were constantly hard parsing.  That is a CPU intensive task, see query “–session_state by FORCE_MATCHING_SIGNATURE”.

FORCE_MATCHING_SIGNATURE SESSION EVENT                          I I TOTAL_TIME        CPU    WAITING
------------------------ ------- ------------------------------ - - ---------- ---------- ----------
    10458055140908993565 ON CPU                                 N N      59000      59000          0 batch related
    17710121666287808037 ON CPU                                 Y Y      47000      47000          0 hard parsing FMS
     6747409576012680712 ON CPU                                 Y Y      44270      44270          0 hard parsing FMS
      201958791624409026 ON CPU                                 N N      11100      11100          0
    16583812627080328788 WAITING db file sequential read        N N      13210          0      13210
     1409798687556705245 WAITING read by other session          N N      52440          0      524400

Hard parsing sessions together with the sqls in batch increased CPU usage up until DB used all core capacity available  to the lpar.

I am declaring hard parsing of sqls with literals as the root cause.

During my discovery I did a lot of research on (hard) parsing, it is no accident you will see parse_hard & parse_soft hog types in my load testing “hog” script. In a future post I may demonstrate how to generate (and catch) a parse heavy DB load.

At the beginning of my research I ruled out parsing:
– number of parses (soft and hard) did not change a lot, soft parse decreased by 12%, hard parse increased by 19%
– the machine was running with around 40% CPU utilization on slow week. I was thinking we have plenty of unused CPU capacity

I was wrong. Even without batches the load was around 60 on a machine with 128 logical CPUs/threads. On POWER8 that translates to close to 90% core utilization while the OS thinks CPU is around 50% utilized. Therefore CPU utilization was under reported, it was misleading.

But it gets worse, the lpar was not able to get full cores for all 16 virtual processors, only around 14 cores were available from shared CPU pool, so we were utilized beyond 90% on core capacity available to the lpar.

How is it possible that such a small change in number of hard parses maxed out the CPUs?
The extra 200 thousand hard parse on slow day showed up as an increase of 11.4  to 13.7 hard parses per second, does not look that much, right?
Turns out not the number of hard parses but the type of sql we parsed was the problem. I won’t post the sql to protect the innocent. Each sql was a complex 20K long sql, the DB processed close to 200 thousand per day on our slow week.

On our slow week on top of normal DB activity and the overnight batches an application started to bomb the DB with those complex  sql-s with literals. The 30 hard parsing sessions increased the load on the machine. POWER8 processor turned on more threads to accommodate the load. Oracle sessions (sqls in those sessions) had slower response time due to POWER8 processor was running with higher SMT.

The 2nd biggest DB Time increase (after ‘hard parse elapsed’) came from waits on ‘db file sequential read’. A machine with CPU starvation can serve disk requests slower especially when data is coming from cache/memory. See case study CPU starvation disguised as an I/O issue (yet another AWR case study) by Savvinov with similar scenario.

During my analysis I got distracted a bit by the intricacies of AIX/POWER. I wasted some time on dead ends. I also learned a lot along the way. Hope you will benefit and your next root cause analysis can conclude in days instead of months.

An abbreviated version of my experience is available in powerpoint format, Oracle on aix v4.pptx.

Root cause part 1

In “What happened to the batch?” I noted the increased load from the DB. Where the extra load coming from?

Most metrics are the same or less in DB load profile for slow day. In AWR diff report (awrdiff_0418_0427_dbsidxxx.html) parsing is down by 12% but hard parsing is up by 19% from 11.4/sec to 13.7/sec. The diff report does not compare the “Instance Efficiency Percentages” section, for that we have to go back to AWR reports. Between awr_0418_dbsidxxx.html and awr_0427_dbsidxxx.html) “% Non-Parse CPU” changed from 92% to 46%… more than half of the CPU was used for parsing on 4/27.

Let’s visualize CPU used for parsing and by the whole db. Sheet “whole db” in mining_awr.xlsx has both. I also included CPU used by all sqls found in sqlstat… which is not all the sqls executed by the DB… please remind me if you know the article explaining what goes into AWR with the snapshots.parse_cpu

DBCPU_sec	DBA_HIST_SYS_TIME_MODEL, 'DB CPU'
PARSE_CPU_S	DBA_HIST_SYSSTAT, 'parse time cpu'
SQL_CPU_SEC	DBA_HIST_SQLSTAT, sum(CPU_TIME_DELTA)

Zoom in to see it better parse_cpu.jpg

Around 4/23 parse CPU (PARSE_CPU_S) shut up and become big part of overall CPU use by the whole DB (DBCPU_sec). It calmed down for the weekend just to go back up for the whole week of 4/25. Also, it looks like combined CPU usage from DBA_HIST_SQLSTAT by all sqls (SQL_CPU_SEC) and parse CPU from DBA_HIST_SYSSTAT (PARSE_CPU_S) can’t push overall CPU utilization (DBCPU_sec) higher than 10000 centiseconds for any 15 min snapshot…core based DB CPU usage is maxed at around 11 sec (=10000/15/60).

The biggest DB Time gain on slow day (4/27) came from parse elapsed. See the data points in mining_awr.xlsx , sheet “whole db”. On this picture I am showing corelation between ‘parse time elapsed’ and ‘parse time cpu’.

parse_elapsed

parse_elapsed.jpeg

We have around 25000 seconds of parse related elapsed time for 15 minute snaps… translates to 2.4 million seconds added DB time from parsing (=25000*4*24) for the day (the length of my AWR reports). It is matching ‘hard parse time elapsed’ increase between awr_0418_dbsidxxx.html and awr_0427_dbsidxxx.html in Time Model Statistics section.

Check out Variable Load Impact On Oracle Database Hard Parsing by Craig Shallahamer, it explains a scenario with ballooning parse elapsed.

Next I will investigate where those hard parsing sqls came from.

What happened to the batch?

On our slow week I was primarily concerned about the batch. It was reaching into the day and it was several days behind. We thought it might be the reason for slowness at business hours.

The main sql in the batch did not change: we had the same execution plan and similar data set for 4/18 and 4/27. Yet on 4/27 morning it finished in 255 minutes instead of 180 minutes, a 40% increase.

Please see the queries I used in this post in Whathappennedtothebatch.txt.

I checked execution statistics in DBA_HIST_SQLSTAT for the dominant sql in batch.  It had more and slower physical reads, hence we see an increase in IOWAIT_SEC. But CPU used (CPU_SEC) remained about the same. CPU was much smaller relative to elapsed time.

Further evidence of disk problem is the increase in synchronous read speed. I list ‘db file sequential read’ from DBA_HIST_SYSTEM_EVENT in mining_awr.xlsx, sheet “whole db”, column SEQR_ms. It is around 2.5msec on the 18th; it fluctuates between 4 and 14 msec on the 27th morning during batch hours.

At this point I was thinking batches were slower from more physical reads and also from slower disks.

But analyzing ASH data showed a different picture. The sql was running on CPU much longer (85% of the time) then waiting on disk . This is not obvious from looking at the CPU used (CPU_SEC) in sqlstat. No surprise there, we should know it better by now,  the CPU numbers in sqlstat are core based (on different/smaller scale).

Per ASH data the dominant sql was all about CPU!
This realization did not come easy. I researched disk IO, checked on the volume of read/writes, IOPS, single/multi block reads, etc.
I used Karl Arao’s scripts from https://karlarao.wordpress.com/scripts-resources/ to decide on the day I will use as baseline. On 4/18 we had normal DB performance with similar or bigger disk IO, this helped me to avoid the “more disk IO made us slower” trap.

Load profile from awrdiff_0418_0427_dbsidxxx.html

diff_load_profile

We do have 29% more “Read IO requests” per second (IOPS) but it is well within normal range for the SAN. 5000 read requests/sec is OK for this environment. On our slow day (4/27) with similar disk IO profile ‘db file sequential read’ times increased from 2.75ms to 6.11ms. This allowed me to eliminate more IO load as reason for disk and DB slowness, that was a major side track at the beginning of my analysis. What happened with the disks outside of the DB (on lpar, on shared SAN) is not known and therefore the possibility for causing problems is there, I just do not have data on it.

The sql in batch is mostly running on CPU, I have a better candidate to explain slowness.

 --4/18
 SESSION EVENT                                                            COUNT(*)*10          %
 ------- ---------------------------------------------------------------- ----------- ----------
 WAITING latch free                                                             29500         17
 ON CPU                                                                        143000         83

 --4/27
 SESSION EVENT                                                            COUNT(*)*10          %
 ------- ---------------------------------------------------------------- ----------- ----------
 WAITING db file sequential read                                                20000         15
 ON CPU                                                                        112630         85

Let’s crosscheck CPU data from ASH with instrumentation. Above counts are from sampled ASH data for the sql in batch running on logical CPUs for an hour. Let’s convert it to core based usage (another fuzzy math) so we can compare with CPU used by the same sql in sqlstat (CPU_SEC in Whathappennedtothebatch.txt). Assuming we are mostly using CPU on the 4 threads while running with SMT4… we have around 35000 (=143000/4) core seconds for an hour. That is around 9000 seconds of CPU usage for 15 min by the sqls in the batch. I think the batch was the main CPU user in the DB CPU at morning hours of 4/18.
Let’s check DB CPU usage for the whole DB from time model: DBCPU_sec on “whole db” sheet of mining_awr.xlsx,  data points there are from 15 min snaps. We see around 10000 seconds at the time of the morning batch.
We have a fairly good match for CPU used between instrumented core based data and sampled ASH data.

During batch hours OS load jumped from 65 on 4/18 to around 90 on 4/27. Please check OS_LOAD arewemaxedoncpu.jpg or in mining_awr.xls, sheet “arewemaxedoncpu”. The extra demand is coming from the DB, CPU_ORA_DEMAND correlates with OS load. The DB had 30 more CPU heavy sessions during our slow batch day.

According to power8.xls, “utilization” sheet, with 65 processes we are running with SMT4. Adding 30 more switches half of the cores to SMT8. Relative thread response time will increase from 2.11 to 4 when all cores switched to the higher SMT level. With half of them remaining on SMT4 on average we can expect 0.95 (=4/2.1/2) fold increase in response time. That is about 45% (0.45=0.95/2.1) slower compared to running all Oracle batch sessions on threads with SMT4. You can grill me on this fuzzy math, I would just run tests to prove it.

Batch is CPU heavy, therefore changes in CPU behavior directly influence its speed. The higher load on lpar forced many of the virtual processors to use SMT8 instead of SMT4. That change resulted in higher response times for processes executing on a logical processor assigned to threads with SMT8 mode. That can explain the reason batch finishing 40% slower on 4/27.

BTW, in mining_awr.xlsx, sheet “whole db”, data points for “app_tx/s” show sql throughput for the sql in batch: it changed from around 17 to 12 executions/sec… a decrease of 41%.

Later that week we tinkered with DB parameters. We also disabled an sql profile. related to the sql in batch. Those changes completely altered the load profile of the sql making it harder to diagnose real root cause. Little we know there was something else lurking under the hood.

By now the batch is gone from the environment. I could just make something similar up and mass execute (like the batch did), I may do it if you convince me the batch had something to do with DB slowness. At this point it looks more like the victim not the root cause.

Request for help:
Is there a way to see what SMT level is used by the processors on the lpar? Currently I am inferring from OS load…

Are we maxed on CPU?

Depends how we look at it.
Yes, we were maxed out on available core capacity.
No, the lpar did not use all threads/logical CPUs all the time.

In my Mining AWR post I observed discrepancy for CPU usage numbers between the OS and the DB even when the DB is the main user on the machine (machine has another instance but this DB is the main production DB).
Mystery solved: for the OS CPU usage we were looking at time spent on threads (logical CPUs) by all processes running on the machine; for the DB the numbers reflected core capacity used by the DB.

Core capacity used

According to my Where is my CPU post on slow day (4/27) the lpar was using around 14 cores whole day. The lpar is uncapped, the shared pool has 16 physical cores, obviously not all available for this machine.
Also, out of the 14 the DB is using around 11 cores.

Time spent on threads

OS CPU usage from V$OSSTAT, or host CPU in SYSMETRIC is thread based. So as sampled ASH data from the DB.

The OS includes all processes running on the lpar, but I suspect the DB must be the biggest CPU user. ASH may also including CPU WAIT when the machine is short on CPU resource…it can help us to track CPU shortage at thread level.

Let’s look at logical CPU usage for both the OS and the DB.

CPU_ORA_DEMAND		ASH session_state='ON CPU'
CPU_ORA_CONSUMED	dba_hist_sysmetric_summary metric_name='CPU Usage Per Sec'
CPU_OS   		dba_hist_sysmetric_summary metric_name='Host CPU Usage Per Sec'
OS_LOAD   		dba_hist_sysmetric_summary metric_name='Current OS Load'

arewemaxedoncpu

You may need to zoom in to see the details in arewemaxedoncpu.jpg or in mining_awr.xls sheet “arewemaxedoncpu”.
Here is the query arewemaxedoncpu.txt

We can see that the DB is responsible for most of the CPU demand on the machine. As demand goes up so as OS_LOAD and CPU_OS. Outside of batches demand is slightly lower than OS_CPU, the lpar is holding up. Demand is higher than available CPU capacity during batches, lpar CPU is maxed out during batches (from batches and other activity combined).
OS_CPU capacity used during batches peaks at around 92… assuming SMT8 and maxed out logical CPUs by the DB it translates to 11.5 (=92/8) core capacity used…good match with core based CPU_ORA_CONSUMED.

The incident was almost a year ago. We still have AWR and grid data. But no access to nmon/OS info (I am a DBA w/limited OS support). I can account for around 11 cores used by the DB. With lpar using around 14 cores I am guessing the difference was used by processes outside the DB . Any idea if there is a way to track down what was using the remaining 3 cores (from sources I have access to…AWR & grid)?

Lpar test

In previous post I introduced “hog”, let’s put it to work. I tested an lpar with 10 virtual processors, POWER7, smt=4, 40 logical CPUs.

hog plsql_loop nobind '0 50 1'

lpar_test_oem_top_activity

Please see data collected in power7.xlsx.

During my test there was always some spare core capacity available in the shared pool, “app” never reached 0 while “physc” went up.

Core capacity used by the lpar was maxed out at around 10,  it has 10 virtual processors backed by sufficient core capacity from the shared pool.

DB and session CPU consumptions “db_cpu”/”sess_cpu” are matching core capacity used “physc” of the machine, the database is responsible for the load on the machine.

“sess_cpu” is matching “purr” which is core used by all hog processes as it captured by ‘pprof’, therefore Oracle session CPU usage is expressed in core capacity.

At the same time hog process CPU consumption expressed as time spent on CPU in “time” column is matching the number of hogs. Each hog is designed to max out (constantly running on) the logical CPU/thread.

“os_cpu” is matching “time” from ‘pprof’, so we know OS CPU utilization is time based. OS is busy from hogs.

lpar_test_oem_performance

On OEM’s performance page database utilization is all green: the load is from CPU.

Core based DB CPU utilization is the dark green area, it is maxed out at around 10, the number of virtual processors assigned to the lpar.

The difference between time and core based CPU utilization is the ‘AIX CPU gap’. It is the light green part up until the 40 line. With 40 logical CPUs we have 40 CPU seconds the most, “time” column maxes out at 40 even when more hogs are running.

Within the database logical CPU usage is included in ASH data “ash_cpu”. “ash_cpu” is measuring demand for CPU, it may go beyond logical CPU consumed when demand is higher than the available logical CPUs. Therefore the light green part above the 40 line is representing real ‘CPU WAIT’ on this lpar.

We have a good match between theoretical and test data, pls compare “utilization” and “lpar test” sheets in power7.xls. Therefore it should not come as a surprise that AIX CPU and utilization gap graphs are looking similar to calculated graphs in “utilization” sheet.

The main difference is that the test lpar had 10 virtual processors, core based consumption levels at 10 (vs 8 virtual processors with my theoretical lpar). Also, I spawned more than 40 hogs to demonstrate the “CPU Wait” on AIX.

lpar_test_cpu_gap

lpar_test_utiization_gap

Why IBM moved away from the well known/accepted time based CPU utilization?
To have a more predictable relationship between throughput and CPU capacity used.
According to Understanding CPU utilization on AIX “The general intent is to provide a measure of CPU utilization wherein there is a linear relationship between the current throughput (e.g., transactions per second) and the CPU utilization being measured for that level of throughput.”
For example, we got 329 tx/s for 20% core utilization, 661 tx/s for around 40%, and around 1000 tx/s for 60%… the data has pretty good linear relationship between throughput and core based utilization.

Another aspect of POWER processor…as more threads are active it can produce more work (throughput goes up), but each thread (oracle session running on that thread, and the sql in that session) slows down. The 1st is good, but the 2nd can be problematic for those looking for predictable response time for each execution. See “tx_s” and “res_ms” graphs in “lpar test” sheet.

After this little detour into IBM POWER/AIX/SMT I will revisit my slow DB problem. Understanding better the hardware the DB is running on can help to explain some of the slowness.