In previous post I introduced “hog”, let’s put it to work. I tested an lpar with 10 virtual processors, POWER7, smt=4, 40 logical CPUs.
hog plsql_loop nobind '0 50 1'
Please see data collected in power7.xls.
During my test there was always some spare core capacity available in the shared pool, “app” never reached 0 while “physc” went up.
Core capacity used by the lpar was maxed out at around 10, it has 10 virtual processors backed by sufficient core capacity from the shared pool.
DB and session CPU consumptions “db_cpu”/”sess_cpu” are matching core capacity used “physc” of the machine, the database is responsible for the load on the machine.
“sess_cpu” is matching “purr” which is core used by all hog processes as it captured by ‘pprof’, therefore Oracle session CPU usage is expressed in core capacity.
At the same time hog process CPU consumption expressed as time spent on CPU in “time” column is matching the number of hogs. Each hog is designed to max out (constantly running on) the logical CPU/thread.
“os_cpu” is matching “time” from ‘pprof’, so we know OS CPU utilization is time based. OS is busy from hogs.
On OEM’s performance page database utilization is all green: the load is from CPU.
Core based DB CPU utilization is the dark green area, it is maxed out at around 10, the number of virtual processors assigned to the lpar.
The difference between time and core based CPU utilization is the ‘AIX CPU gap’. It is the light green part up until the 40 line. With 40 logical CPUs we have 40 CPU seconds the most, “time” column maxes out at 40 even when more hogs are running.
Within the database logical CPU usage is included in ASH data “ash_cpu”. “ash_cpu” is measuring demand for CPU, it may go beyond logical CPU consumed when demand is higher than the available logical CPUs. Therefore the light green part above the 40 line is representing real ‘CPU WAIT’ on this lpar.
We have a good match between theoretical and test data, pls compare “utilization” and “lpar test” sheets in power7.xls. Therefore it should not come as a surprise that AIX CPU and utilization gap graphs are looking similar to calculated graphs in “utilization” sheet.
The main difference is that the test lpar had 10 virtual processors, core based consumption levels at 10 (vs 8 virtual processors with my theoretical lpar). Also, I spawned more than 40 hogs to demonstrate the “CPU Wait” on AIX.
Why IBM moved away from the well known/accepted time based CPU utilization?
To have a more predictable relationship between throughput and CPU capacity used.
According to Understanding CPU utilization on AIX “The general intent is to provide a measure of CPU utilization wherein there is a linear relationship between the current throughput (e.g., transactions per second) and the CPU utilization being measured for that level of throughput.”
For example, we got 329 tx/s for 20% core utilization, 661 tx/s for around 40%, and around 1000 tx/s for 60%… the data has pretty good linear relationship between throughput and core based utilization.
Another aspect of POWER processor…as more threads are active it can produce more work (throughput goes up), but each thread (oracle session running on that thread, and the sql in that session) slows down. The 1st is good, but the 2nd can be problematic for those looking for predictable response time for each execution. See “tx_s” and “res_ms” graphs in “lpar test” sheet.
After this little detour into IBM POWER/AIX/SMT I will revisit my slow DB problem. Understanding better the hardware the DB is running on can help to explain some of the slowness.