What happened to the batch?

On our slow week I was primarily concerned about the batch. It was reaching into the day and it was several days behind. We thought it might be the reason for slowness at business hours.

The main sql in the batch did not change: we had the same execution plan and similar data set for 4/18 and 4/27. Yet on 4/27 morning it finished in 255 minutes instead of 180 minutes, a 40% increase.

Please see the queries I used in this post in Whathappennedtothebatch.txt.

I checked execution statistics in DBA_HIST_SQLSTAT for the dominant sql in batch.  It had more and slower physical reads, hence we see an increase in IOWAIT_SEC. But CPU used (CPU_SEC) remained about the same. CPU was much smaller relative to elapsed time.

Further evidence of disk problem is the increase in synchronous read speed. I list ‘db file sequential read’ from DBA_HIST_SYSTEM_EVENT in mining_awr.xlsx, sheet “whole db”, column SEQR_ms. It is around 2.5msec on the 18th; it fluctuates between 4 and 14 msec on the 27th morning during batch hours.

At this point I was thinking batches were slower from more physical reads and also from slower disks.

But analyzing ASH data showed a different picture. The sql was running on CPU much longer (85% of the time) then waiting on disk . This is not obvious from looking at the CPU used (CPU_SEC) in sqlstat. No surprise there, we should know it better by now,  the CPU numbers in sqlstat are core based (on different/smaller scale).

Per ASH data the dominant sql was all about CPU!
This realization did not come easy. I researched disk IO, checked on the volume of read/writes, IOPS, single/multi block reads, etc.
I used Karl Arao’s scripts from https://karlarao.wordpress.com/scripts-resources/ to decide on the day I will use as baseline. On 4/18 we had normal DB performance with similar or bigger disk IO, this helped me to avoid the “more disk IO made us slower” trap.

Load profile from awrdiff_0418_0427_dbsidxxx.html

diff_load_profile

We do have 29% more “Read IO requests” per second (IOPS) but it is well within normal range for the SAN. 5000 read requests/sec is OK for this environment. On our slow day (4/27) with similar disk IO profile ‘db file sequential read’ times increased from 2.75ms to 6.11ms. This allowed me to eliminate more IO load as reason for disk and DB slowness, that was a major side track at the beginning of my analysis. What happened with the disks outside of the DB (on lpar, on shared SAN) is not known and therefore the possibility for causing problems is there, I just do not have data on it.

The sql in batch is mostly running on CPU, I have a better candidate to explain slowness.

 --4/18
 SESSION EVENT                                                            COUNT(*)*10          %
 ------- ---------------------------------------------------------------- ----------- ----------
 WAITING latch free                                                             29500         17
 ON CPU                                                                        143000         83

 --4/27
 SESSION EVENT                                                            COUNT(*)*10          %
 ------- ---------------------------------------------------------------- ----------- ----------
 WAITING db file sequential read                                                20000         15
 ON CPU                                                                        112630         85

Let’s crosscheck CPU data from ASH with instrumentation. Above counts are from sampled ASH data for the sql in batch running on logical CPUs for an hour. Let’s convert it to core based usage (another fuzzy math) so we can compare with CPU used by the same sql in sqlstat (CPU_SEC in Whathappennedtothebatch.txt). Assuming we are mostly using CPU on the 4 threads while running with SMT4… we have around 35000 (=143000/4) core seconds for an hour. That is around 9000 seconds of CPU usage for 15 min by the sqls in the batch. I think the batch was the main CPU user in the DB CPU at morning hours of 4/18.
Let’s check DB CPU usage for the whole DB from time model: DBCPU_sec on “whole db” sheet of mining_awr.xlsx,  data points there are from 15 min snaps. We see around 10000 seconds at the time of the morning batch.
We have a fairly good match for CPU used between instrumented core based data and sampled ASH data.

During batch hours OS load jumped from 65 on 4/18 to around 90 on 4/27. Please check OS_LOAD arewemaxedoncpu.jpg or in mining_awr.xls, sheet “arewemaxedoncpu”. The extra demand is coming from the DB, CPU_ORA_DEMAND correlates with OS load. The DB had 30 more CPU heavy sessions during our slow batch day.

According to power8.xls, “utilization” sheet, with 65 processes we are running with SMT4. Adding 30 more switches half of the cores to SMT8. Relative thread response time will increase from 2.11 to 4 when all cores switched to the higher SMT level. With half of them remaining on SMT4 on average we can expect 0.95 (=4/2.1/2) fold increase in response time. That is about 45% (0.45=0.95/2.1) slower compared to running all Oracle batch sessions on threads with SMT4. You can grill me on this fuzzy math, I would just run tests to prove it.

Batch is CPU heavy, therefore changes in CPU behavior directly influence its speed. The higher load on lpar forced many of the virtual processors to use SMT8 instead of SMT4. That change resulted in higher response times for processes executing on a logical processor assigned to threads with SMT8 mode. That can explain the reason batch finishing 40% slower on 4/27.

BTW, in mining_awr.xlsx, sheet “whole db”, data points for “app_tx/s” show sql throughput for the sql in batch: it changed from around 17 to 12 executions/sec… a decrease of 41%.

Later that week we tinkered with DB parameters. We also disabled an sql profile. related to the sql in batch. Those changes completely altered the load profile of the sql making it harder to diagnose real root cause. Little we know there was something else lurking under the hood.

By now the batch is gone from the environment. I could just make something similar up and mass execute (like the batch did), I may do it if you convince me the batch had something to do with DB slowness. At this point it looks more like the victim not the root cause.

Request for help:
Is there a way to see what SMT level is used by the processors on the lpar? Currently I am inferring from OS load…

Advertisements