CPU utilization on AIX is different

In previous posts I had trouble interpreting CPU utilization numbers from OS and DB – it is time to get some answers.

My DB is running on an AIX lpar (logical partition). It is a virtual machine with dynamically allocated CPU resource. The OS (and the DB from the OS) thinks the machine has 128 CPUs all the time.

In reality it is running on 16 virtual processors mapped to physical cores. Each of those processors can split into max 8 threads (on POWER8, max 4 threads on POWER7) giving the DB the 128 logical CPUs.

In addition, CPU capacity is allocated dynamically in tiny chunks, the virtual machine grabs as much as it needs when the load on the machine demands for it. Getting more CPU also depends on configuration for shared/dedicated, the entitlement, and capped or uncapped.lpar,

But the real issue here is that Oracle on AIX reports CPU utilization as PURR adjusted core capacity used. On the other hand, the rest of the World (applications, monitoring tools, me included, etc.) still thinks CPU used as the time spent on CPUs.

Core based capacity numbers maxed out at 16. It can be even less if capped sooner or if shared and nothing left in the shared pool. But the OS still calculates available CPU capacity based on the 128 logical CPUs.

Imagine the growing difference between these two numbers as the DB starts using more logical CPUs. Going to the extreme – when we max out the CPU of the lpar – the core based CPU capacity is 16 and the OS based capacity will be 128. Such difference described as the “unaccounted-for time” in the DB trace. I like to call it the “AIX CPU gap”, I will write more about it later.

smt2

The case study a few posts ago is an example of a CPU heavy load from hard parsing. Even when the DB was the main user of the machine, there was a big gap between the DB and the OS CPU utilization numbers. Real workloads rarely use CPU alone, they are some mix of CPU and waits. Even my workload had the mix of batch, normal day to day activity, and the sessions with hard parsing. The size of the gap is proportional to the CPU part of the workload.

Wow! A lot to digest. Understanding CPU utilization on AIX  explain it better. Let’s just say, calculating/reporting/interpreting/comparing CPU utilization becomes a challenge in virtualized environment running on AIX/POWER/SMT/lpar.

I am not the first to notice something is off with CPU utilization on AIX:

The last link by Jeremy Schneider is a great summary; it served me as a starting point for my tests. I will tickle the CPU on these lpar’s so we can check into CPU utilization in more detail.

Let the fun begin!

Advertisements