What happened to the batch?

On our slow week I was primarily concerned about the batch. It was reaching into the day and it was several days behind. We thought it might be the reason for slowness at business hours.

The main sql in the batch did not change: we had the same execution plan and similar data set for 4/18 and 4/27. Yet on 4/27 morning it finished in 255 minutes instead of 180 minutes, a 40% increase.

Please see the queries I used in this post in Whathappennedtothebatch.txt.

I checked execution statistics in DBA_HIST_SQLSTAT for the dominant sql in batch.  It had more and slower physical reads, hence we see an increase in IOWAIT_SEC. But CPU used (CPU_SEC) remained about the same. CPU was much smaller relative to elapsed time.

Further evidence of disk problem is the increase in synchronous read speed. I list ‘db file sequential read’ from DBA_HIST_SYSTEM_EVENT in mining_awr.xlsx, sheet “whole db”, column SEQR_ms. It is around 2.5msec on the 18th; it fluctuates between 4 and 14 msec on the 27th morning during batch hours.

At this point I was thinking batches were slower from more physical reads and also from slower disks.

But analyzing ASH data showed a different picture. The sql was running on CPU much longer (85% of the time) then waiting on disk . This is not obvious from looking at the CPU used (CPU_SEC) in sqlstat. No surprise there, we should know it better by now,  the CPU numbers in sqlstat are core based (on different/smaller scale).

Per ASH data the dominant sql was all about CPU!
This realization did not come easy. I researched disk IO, checked on the volume of read/writes, IOPS, single/multi block reads, etc.
I used Karl Arao’s scripts from https://karlarao.wordpress.com/scripts-resources/ to decide on the day I will use as baseline. On 4/18 we had normal DB performance with similar or bigger disk IO, this helped me to avoid the “more disk IO made us slower” trap.

Load profile from awrdiff_0418_0427_dbsidxxx.html

diff_load_profile

We do have 29% more “Read IO requests” per second (IOPS) but it is well within normal range for the SAN. 5000 read requests/sec is OK for this environment. On our slow day (4/27) with similar disk IO profile ‘db file sequential read’ times increased from 2.75ms to 6.11ms. This allowed me to eliminate more IO load as reason for disk and DB slowness, that was a major side track at the beginning of my analysis. What happened with the disks outside of the DB (on lpar, on shared SAN) is not known and therefore the possibility for causing problems is there, I just do not have data on it.

The sql in batch is mostly running on CPU, I have a better candidate to explain slowness.

 --4/18
 SESSION EVENT                                                            COUNT(*)*10          %
 ------- ---------------------------------------------------------------- ----------- ----------
 WAITING latch free                                                             29500         17
 ON CPU                                                                        143000         83

 --4/27
 SESSION EVENT                                                            COUNT(*)*10          %
 ------- ---------------------------------------------------------------- ----------- ----------
 WAITING db file sequential read                                                20000         15
 ON CPU                                                                        112630         85

Let’s crosscheck CPU data from ASH with instrumentation. Above counts are from sampled ASH data for the sql in batch running on logical CPUs for an hour. Let’s convert it to core based usage (another fuzzy math) so we can compare with CPU used by the same sql in sqlstat (CPU_SEC in Whathappennedtothebatch.txt). Assuming we are mostly using CPU on the 4 threads while running with SMT4… we have around 35000 (=143000/4) core seconds for an hour. That is around 9000 seconds of CPU usage for 15 min by the sqls in the batch. I think the batch was the main CPU user in the DB CPU at morning hours of 4/18.
Let’s check DB CPU usage for the whole DB from time model: DBCPU_sec on “whole db” sheet of mining_awr.xlsx,  data points there are from 15 min snaps. We see around 10000 seconds at the time of the morning batch.
We have a fairly good match for CPU used between instrumented core based data and sampled ASH data.

During batch hours OS load jumped from 65 on 4/18 to around 90 on 4/27. Please check OS_LOAD arewemaxedoncpu.jpg or in mining_awr.xls, sheet “arewemaxedoncpu”. The extra demand is coming from the DB, CPU_ORA_DEMAND correlates with OS load. The DB had 30 more CPU heavy sessions during our slow batch day.

According to power8.xls, “utilization” sheet, with 65 processes we are running with SMT4. Adding 30 more switches half of the cores to SMT8. Relative thread response time will increase from 2.11 to 4 when all cores switched to the higher SMT level. With half of them remaining on SMT4 on average we can expect 0.95 (=4/2.1/2) fold increase in response time. That is about 45% (0.45=0.95/2.1) slower compared to running all Oracle batch sessions on threads with SMT4. You can grill me on this fuzzy math, I would just run tests to prove it.

Batch is CPU heavy, therefore changes in CPU behavior directly influence its speed. The higher load on lpar forced many of the virtual processors to use SMT8 instead of SMT4. That change resulted in higher response times for processes executing on a logical processor assigned to threads with SMT8 mode. That can explain the reason batch finishing 40% slower on 4/27.

BTW, in mining_awr.xlsx, sheet “whole db”, data points for “app_tx/s” show sql throughput for the sql in batch: it changed from around 17 to 12 executions/sec… a decrease of 41%.

Later that week we tinkered with DB parameters. We also disabled an sql profile. related to the sql in batch. Those changes completely altered the load profile of the sql making it harder to diagnose real root cause. Little we know there was something else lurking under the hood.

By now the batch is gone from the environment. I could just make something similar up and mass execute (like the batch did), I may do it if you convince me the batch had something to do with DB slowness. At this point it looks more like the victim not the root cause.

Request for help:
Is there a way to see what SMT level is used by the processors on the lpar? Currently I am inferring from OS load…

Advertisements

Are we maxed on CPU?

Depends how we look at it.
Yes, we were maxed out on available core capacity.
No, the lpar did not use all threads/logical CPUs all the time.

In my Mining AWR post I observed discrepancy for CPU usage numbers between the OS and the DB even when the DB is the main user on the machine (machine has another instance but this DB is the main production DB).
Mystery solved: for the OS CPU usage we were looking at time spent on threads (logical CPUs) by all processes running on the machine; for the DB the numbers reflected core capacity used by the DB.

Core capacity used

According to my Where is my CPU post on slow day (4/27) the lpar was using around 14 cores whole day. The lpar is uncapped, the shared pool has 16 physical cores, obviously not all available for this machine.
Also, out of the 14 the DB is using around 11 cores.

Time spent on threads

OS CPU usage from V$OSSTAT, or host CPU in SYSMETRIC is thread based. So as sampled ASH data from the DB.

The OS includes all processes running on the lpar, but I suspect the DB must be the biggest CPU user. ASH may also including CPU WAIT when the machine is short on CPU resource…it can help us to track CPU shortage at thread level.

Let’s look at logical CPU usage for both the OS and the DB.

CPU_ORA_DEMAND		ASH session_state='ON CPU'
CPU_ORA_CONSUMED	dba_hist_sysmetric_summary metric_name='CPU Usage Per Sec'
CPU_OS   		dba_hist_sysmetric_summary metric_name='Host CPU Usage Per Sec'
OS_LOAD   		dba_hist_sysmetric_summary metric_name='Current OS Load'

arewemaxedoncpu

You may need to zoom in to see the details in arewemaxedoncpu.jpg or in mining_awr.xls sheet “arewemaxedoncpu”.
Here is the query arewemaxedoncpu.txt

We can see that the DB is responsible for most of the CPU demand on the machine. As demand goes up so as OS_LOAD and CPU_OS. Outside of batches demand is slightly lower than OS_CPU, the lpar is holding up. Demand is higher than available CPU capacity during batches, lpar CPU is maxed out during batches (from batches and other activity combined).
OS_CPU capacity used during batches peaks at around 92… assuming SMT8 and maxed out logical CPUs by the DB it translates to 11.5 (=92/8) core capacity used…good match with core based CPU_ORA_CONSUMED.

The incident was almost a year ago. We still have AWR and grid data. But no access to nmon/OS info (I am a DBA w/limited OS support). I can account for around 11 cores used by the DB. With lpar using around 14 cores I am guessing the difference was used by processes outside the DB . Any idea if there is a way to track down what was using the remaining 3 cores (from sources I have access to…AWR & grid)?

Lpar test

In previous post I introduced “hog”, let’s put it to work. I tested an lpar with 10 virtual processors, POWER7, smt=4, 40 logical CPUs.

hog plsql_loop nobind '0 50 1'

lpar_test_oem_top_activity

Please see data collected in power7.xlsx.

During my test there was always some spare core capacity available in the shared pool, “app” never reached 0 while “physc” went up.

Core capacity used by the lpar was maxed out at around 10,  it has 10 virtual processors backed by sufficient core capacity from the shared pool.

DB and session CPU consumptions “db_cpu”/”sess_cpu” are matching core capacity used “physc” of the machine, the database is responsible for the load on the machine.

“sess_cpu” is matching “purr” which is core used by all hog processes as it captured by ‘pprof’, therefore Oracle session CPU usage is expressed in core capacity.

At the same time hog process CPU consumption expressed as time spent on CPU in “time” column is matching the number of hogs. Each hog is designed to max out (constantly running on) the logical CPU/thread.

“os_cpu” is matching “time” from ‘pprof’, so we know OS CPU utilization is time based. OS is busy from hogs.

lpar_test_oem_performance

On OEM’s performance page database utilization is all green: the load is from CPU.

Core based DB CPU utilization is the dark green area, it is maxed out at around 10, the number of virtual processors assigned to the lpar.

The difference between time and core based CPU utilization is the ‘AIX CPU gap’. It is the light green part up until the 40 line. With 40 logical CPUs we have 40 CPU seconds the most, “time” column maxes out at 40 even when more hogs are running.

Within the database logical CPU usage is included in ASH data “ash_cpu”. “ash_cpu” is measuring demand for CPU, it may go beyond logical CPU consumed when demand is higher than the available logical CPUs. Therefore the light green part above the 40 line is representing real ‘CPU WAIT’ on this lpar.

We have a good match between theoretical and test data, pls compare “utilization” and “lpar test” sheets in power7.xls. Therefore it should not come as a surprise that AIX CPU and utilization gap graphs are looking similar to calculated graphs in “utilization” sheet.

The main difference is that the test lpar had 10 virtual processors, core based consumption levels at 10 (vs 8 virtual processors with my theoretical lpar). Also, I spawned more than 40 hogs to demonstrate the “CPU Wait” on AIX.

lpar_test_cpu_gap

lpar_test_utiization_gap

Why IBM moved away from the well known/accepted time based CPU utilization?
To have a more predictable relationship between throughput and CPU capacity used.
According to Understanding CPU utilization on AIX “The general intent is to provide a measure of CPU utilization wherein there is a linear relationship between the current throughput (e.g., transactions per second) and the CPU utilization being measured for that level of throughput.”
For example, we got 329 tx/s for 20% core utilization, 661 tx/s for around 40%, and around 1000 tx/s for 60%… the data has pretty good linear relationship between throughput and core based utilization.

Another aspect of POWER processor…as more threads are active it can produce more work (throughput goes up), but each thread (oracle session running on that thread, and the sql in that session) slows down. The 1st is good, but the 2nd can be problematic for those looking for predictable response time for each execution. See “tx_s” and “res_ms” graphs in “lpar test” sheet.

After this little detour into IBM POWER/AIX/SMT I will revisit my slow DB problem. Understanding better the hardware the DB is running on can help to explain some of the slowness.

hog

In previous post I calculated an lpar’s core capacity used and CPU time accumulated on logical CPUs. My calculation was based on estimated throughput increase of the POWER processor. I was wondering how close my results are to reality.

By this time I was tired of forking processes, manually collecting and summarizing results. I had to automate this: “hog” was born, you can download it here… hog.txt.

“hog” is a korn shell script designed to run either on Linux or on AIX. It should work on both without modification. It can execute several resource intensive processes (hogs) in a controlled way. Different type of hogs are available to generate load on resources like CPU or disk. It can bind hogs to specific CPUs or let the OS scheduler decide how to spread the load. It will collect and report CPU related metrics (and more). It is one self contained script for simple deployment.

You will get readme when you execute it with no parameter:

oracle@ol6-112-rac1.localdomain:/home/oracle/tmp/opload
[RAC1]> hog

Oracle utilization test on Linux or IAX/POWER/lpar

USAGE
        hog [hog type] [bind 'cpu list'|nobind 'start end incr'] [test]

hog types:
        bc              hog is using bc to put load on CPU until we stop it
        bc2             bc loop to put load on CPU until set time, collect vmstat data
        bc3             bc loop to put load on CPU until set time, collect vmstat & pprof data (if pprof avail)
        plsql_loop      pl/sql loop running on the db to put load on CPU until set time, collect vmstat & pprof data
        lio             execute sql to generate logical IOs on the db to put load on CPU until set time, collect vmstat & pprof data
        disk_seqr       hogs with indexed reads to generate 'db file sequential read'
        disk_dirpr      hogs generating event 'direct path read'
        parse_hard      loop with hard parsing sql for set time, collect vmstat & pprof data
        parse_soft      loop with soft parsing sql for set time, collect vmstat & pprof data

        bind    bind hogs to specific processors, silently ignored when can't bind
        nobind  start specified number of hogs, let OS pick processors to run them on

        test    dry run

EXAMPLES
        dry run
                hog bc bind '31' test
        start 1 hog, bind it to cpu 31
                hog bc bind '31'
        start 2 hogs, bind them to cpu 30 and 31
                hog plsql_loop bind '30 31'
        start 1 hog, increment by 2 until we reach 6, report in between, let OS assign CPU-s
                hog bc nobind '1 6 2'

OUTPUT... only show those relevant to hog type picked
        hogs    number of hogs started
        app     lprstat, available physical processors in the shared pool (or vmstat's r)
        physc   lprstat, number of physical processors consumed (or vmstat's pc or id)
        lbusy%  lprstat, percentage of logical processor utilization (or vmstat's busy=us+sy)
        os_cpu% v$osstat, (&BUSY_TIME_2-&BUSY_TIME_1)/100/&SLEEP,2)/&CPU_COUNT*100
        os_cpu  v$osstat, (&BUSY_TIME_2-&BUSY_TIME_1)/100/&SLEEP
        db_cpu  v$sys_time_model, stat_name = 'DB CPU'/&SLEEP
        backgr  v$sys_time_model, stat_name = 'background cpu time'/&SLEEP
        seqr_ms v$system_event, 'db file sequential read'
        M       effective core, ((&IDLE_TIME_2-&IDLE_TIME_1)/100 + (&BUSY_TIME_2-&BUSY_TIME_1)/100)/&SLEEP
        load    v$osstat, stat_name = 'LOAD'
        aas     v$active_session_history active sessions
        ash_cpu v$active_session_history with session_state = 'ON CPU'
        ash_prs v$active_session_history with session_state = 'ON CPU' and IN_HARD_PARSE
        ses_cpu v$mystat, 'CPU used by this session'
        tm_cpu  v$sess_time_model, stat_name in ('DB CPU')
        ses_prs v$mystat, 'parse time cpu'
        slio_k  v$mystat, thousands of 'session logical reads'
        tx_s    sum of # of transactions by all hogs per second during the measuring window
        res_ms  response time, time took to run 1 transaction
        time    pprof ACC_time/STP-STT, process (hog) active time on CPU per second
        avg_t   TIME/hogs, active time on CPU for each hog per second
        purr    pprof ACC_time/STP-STT, PURR based process (hog) active time per second
        avg_p   PURR/hogs, PURR based active time on CPU per hog per second

PURR   Stands for Processor Utilization of Resources Register and its available per Hardware Thread Context.
        PURR provides an actual count of physical processing time units that a hardware thread has used.
        The hardware increments for PURR is done based on how each hardware thread is using the resources of the processor core.

The script was developed and tested on AIX and Linux with Oracle 11R2, you may need to make minor adjustments for other OS/DB versions.

uname -vM
7 IBM,8286-42A
 
uname -a
Linux ol6-112-rac1.localdomain 2.6.39-200.24.1.el6uek.x86_64 #1 SMP Sat Jun 23 02:39:07 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production

While developing “hog” I benefited & borrowed ideas from these excellent sources:
Craig Shallahamer’s OP Load Generator
Karl Arao’s cputoolkit
Kyle Hailey’s Oracle CPU Time
Charles Hooper’s CPU Usage Monitoring – What are the Statistics and when are the Statistics Updated?

See examples in hog in action.txt. Please let me know if you had problem running it in your environment. Use at your own risk. Do not run in production environment…unless you exactly know what you are doing.

Next we will put an lpar to a test.

AIX CPU gap

We can gather CPU used by an Oracle database from several sources. Within Oracle it can  come from instrumentation and sampling active sessions. At the OS level we can use tools provided by the operating system.

For an Oracle database running on AIX with POWER type processor CPU usage from instrumentation represents core capacity used. That is what shows up as ‘DB CPU’ in AWR reports. That is what we see in OEM with dark green color as ‘CPU’.

Another way to track CPU is from active session history (ASH) filtered on status ‘ON CPU’. ASH will capture sessions’ time running on logical CPUs. It should match CPU numbers from instrumentation unless processes can’t get enough CPU resource. In OEM grid control ASH data is used to calculate the light green ‘CPU WAIT’ on the Average Active Session graph.

So there are two ways to express the same CPU usage: the 1st is core capacity used, the 2nd is time spent on logical CPU.

For Oracle an lpar with 8 cores has 8 seconds of CPU capacity for each wall clock second. But AIX thinks the lpar has 32 (logical) CPUs with 32 seconds of CPU resource… Oracle’s CPU usage is underreported on the machine.

I think of the difference between the two numbers representing the same CPU usage as the “AIX CPU gap”.

power7_cpu_gap

 

Core and thread based CPU utilizations also differ. I call the difference between the two utilizations the “AIX Utilization Gap”.

power7_utlization_gap

 

I calculated the gaps for both POWER7 and POWER8 processors, see ‘utilization’ sheets in power7.xls and power8.xls.

Our production DB is running on POWER8. In the AWR report I posted in My database is slow, now what?  the OS had 43% CPU utilization for a slow day. I felt something was off. Now I know  we were running at around 88% of the lpar-s core capacity.

Per Understanding CPU on AIX Power SMT Systems “I’ve heard one person say that they always double the CPU numbers in AWR reports for AIX/SMT systems. It’s a stab in the dark but perhaps useful to remember.”

Instead of guessing we can look up the chart to have a better understanding of CPU consumed by the machine or the database. It is good to know–also helps paying the bill if you are a contractor–you are close to being maxed out on CPU.

Request for help…if you are a math wiz… is there a way to create a formula between  columns %core and %thread in my excel utilization sheet? Such formula would help DBAs to convert underreported OS CPU utilization to core based utilization.

Another request…if you are an Oracle or AIX wiz…do you (dis)agree?

Enough theory, let’s gather some evidence. That is coming next.

Simulation

Up until now my focus was on one core. In real life it is more likely we have several POWER processors assigned to an AIX virtual machine (logical partition, lpar). With dedicated configuration each virtual processor is backed by a physical core; one less variable to think about.

Let’s assume I have an lpar with 8 virtual processors mapped to POWER7 cores. Each core can run single threaded, SMT2, or SMT4 depending on the workload. I created a simulation in excel to show CPU utilization for both core capacity used and time spent on logical CPUs for the lpar. Check out the simulation sheet in power7.xls.

Each “1” in the blue area represents a maxed out thread (logical CPU) on the machine.
Green cells show CPU capacity from cores and threads:
– for core it is the sum of core capacity used
– for thread it is time spent on logical CPUs

As you “turn on” more threads you are doing the job of the OS scheduler. You will see both core capacity used and time spent on logical CPUs for the whole machine. It might help to understand how these numbers relate…and diverge as we put more load on the lpar.

We are already familiar with the scenario when 1 logical CPU maxed out by 1 hog. Core capacity used is 0.625 while the hog process is running on logical CPU the whole time. The thread/logical CPU makes the lpar’s core 7.8% busy; at the same time AIX thinks the lpar’s CPU utilization is 3.1%.

power7_thread1

 

The default AIX scheduling policy is to try to run all processes single threaded, that provides the most power, that is the fastest. So each hog is assigned to different core. With 8 hogs active core utilization is already 62.5%, but the OS sees that as 25%.power7_thread8

When load goes beyond the number of cores SMT2 kicks in; the core splits, overall throughput–represented by core capacity used–increases… although with diminishing effect (we doubled the number of hogs but core capacity only went up from 5 to 7).power7_thread16

 

As we dispatch more hogs some of the cores must switch to SMT4 to run them. With 32 hogs running all threads are used, both core and thread utilization is at 100%. At this point core capacity used is 8 seconds and logical CPU capacity used is 32 seconds for each wall clock second.

power7_thread32

 

I created a similar excel sheet for POWER8.xls.

What do you see? On POWER8 how much is the core based utilization when OS is telling us the machine is 50% utilized?
(hint…94.5%)

 

If you check the utilization sheets, you can guess what I will write about next.

The power of POWER7

I got the idea to summarize core, thread utilizations, and thread response times from IBM AIX POWER7 CPU Usage and Throughput…a great article!

Another great source is Local, Near & Far Memory part 3 – Scheduling processes to SMT & Virtual Processors, “With POWER7 we have off or SMT=2 and SMT=4. With SMT=4 we get four programs/processes (or four threads of execution) at the same time.  This is a technique that boosts performance when there are lots of processes and/or threads to run at the same time… as SMT increases the number of instructions executed by a single thread on the CPU-core goes down but then as we are running more threads the throughput goes up.”

The article goes on to explain about the performance boost of the POWER7 processor when running in higher simultaneous multi-threading (SMT) modes. Using the relative power increase estimates I created this table:

smt	relative power	core used	thread	response time	logical CPU	CPU%
1	1  		0.62		0.62	1   		1  		25 <- 62% core capacity used shows up as 25%
2	1.4		0.88		0.44	1.43 =0.62/0.44 2  		50 <- 88% core capacity used shows up as 50%
4	1.6		1		0.25	2.5  =0.62/0.25 4		100

The power of POWER7 increases as more threads are used.
With 2 threads the estimated boost is 40% more capacity compared to 1 thread; 2 threads make the core 88% busy, each thread using 44% of the core.
4 threads produce 60% more power; core is maxed out, each thread using 25% of the core.

relative_performance_by_smt_mode

Thread response times are calculated from thread utilizations relative to the 1st thread, i.e. less utilization means longer response time. Response times increase as more threads are active. With 4 concurrent threads it takes 2.5 times longer to finish the same job as when the core runs 1 thread.

“logical CPU” column shows how much CPU resource is used by the hogs.
When hogs max out the logical CPUs, each hog is using 1 CPU second per wall clock second. AIX tools like vmstat, mpstat report CPU% at the logical CPU (thread) level.

Therefore we can express the amount of CPU used by hog(s) as
1) core capacity consumed
2) time spent on logical CPU

Unfortunately some tools still think “2) time spent on logical CPU” when they are reporting figures from “1) core capacity consumed”…remember my low OS CPU utilization in AWR report and in OEM grid control?