While researching CPU behavior on AIX/POWER/lpar I found Karl Arao’s article on cores vs threads, v2 vs x2. The article refers the cputoolkit tool from https://karlarao.wordpress.com/scripts-resources/.
Soon I realized I need to compare CPU usage from several sources. This approach allowed me to observe both time spent on CPU and core based usage. I ended up creating my own script (hog) to generate load and capture CPU related information.
I was wondering how hog compares to cputoolkit. I added the driver sql from cputoolkit so I was expecting something similar.
--lotslios, borrowed from Karl Arao's cputoolkit, I think he got it from a script by Tanel Poder --calibrated where clause for around 0.3 sec run time on test lpar select /*+ cputoolkit ordered use_nl(b) use_nl(c) use_nl(d) full(a) full(b) full(c) full(d) */ count(*) into a from sys.obj$ a, sys.obj$ b, sys.obj$ c, sys.obj$ d where a.owner# = b.owner# and b.owner# = c.owner# and c.owner# = d.owner# and rownum <= 3000000;
In cputoolkit the data is gathered by sampling v$sql every few seconds, the tool is picking the last sample from each set. I was wondering if all logical IOs (LIOs) are captured/reported this way.
In hog I am calibrating the sessions to run for set time and catching the number of transactions (number of executions of the driver sqls) and LIOs from v$mystat on exit. This approach guarantees exact counts and also helps with smoother transition between sets, I do not have to kill Oracle sessions.
I opted to report data normalized on time. This way I can easily compare information from ASH, instrumentation, and OS.
I picked an lpar with 10 virtual processors, smt=4, 40 logical CPU-s. During the tests there was enough avail core capacity in shared pool to fully utilize all processors.
sh cputoolkit 1 1 50 $ORACLE_SID
Workload starts with 179526 LIOs/sec for the 1st session. I see an almost linear increase up until around 10 sessions (make sense, we have 10 virtual processors). After that we see diminishing increase from additional sessions. Max workload is around 2.3 million LIO/sec at 40 sessions. I calculated LIOs/sec by multiplying LIOS_ELAP with number of sessions. It is matching ‘Logical Reads Per Sec’ from loadprof.sql which is recorded by cputoolkit during the test. I uploaded logs in cputoolkit_031717.zip
The 1st session can produce 179526 LIOs per second. By the time we run 40 concurrent sessions performance for each session degrades to 57910 LIOs.
At the sql level as we run more concurrent sessions each sql runs a bit longer; ELAP_EXEC is up from 0.35 to 0.92 seconds with 40 sessions. This means less number of executions by each session; EXEC/sec/CPU is down from 2.37 to 1.09 executions per second.
You can see all data points in cputoolkit.xlsx
For some reason LIOs per execution (LIOS_EXEC column in cputoolkit.xlsx) is slightly decreasing. This would mean our driver sql executes with less logical IOs… not likely. I suspect measuring error from periodic sampling of buffer_gets from v$sql. Data points from Karl also show decreasing LIOS_EXEC resulting in sloping down LIOs/Elap even when running less sessions than cores.
But the real issue on AIX/POWER/lpar is thet cpu_time in v$sql is not representing time spent on CPU. It is core capacity used. Therefore CPU WAIT can’t be calculated by subtracting cpu_time from elapsed_time.
hog lio nobind '0 50 1'
Workload starts with 179000 LIOs/sec for the 1st session. Similarly to cputoolkit I see an almost linear increase up until around 10 sessions. After that there is diminishing returns from additional sessions. Max workload is 2.2 million LIOs/sec at 40 sessions. I calculated LIOs/sec from slio_k which is thousands of ‘session logical reads’ from v$mystat. It is matching with test results from cputoolkit.
The 1st session can deliver 179000 LIOs/sec which degrades to 67650 by the time we run 40 concurrent sessions.
Workload graphs beyond 20 sessions are jerky. I think it is from POWER processor switching between SMT levels to run more sessions, i.e. when a processor changes from SMT2 to SMT4 there will be a slight degradation in performance until all threads are utilized… check out power7.xlsx utilization sheet to see how that can happen.
At the sql level as we run more concurrent sessions each sql runs a bit longer: res_ms (brown line) is up from 357 msec to 932 msec with 40 sessions (this was very similar, between 0.35 to 0.92 seconds with cputoolkit). This means less number of sql executions by each hog: hog throughput (tx_s/hog, blue line) is down from 2.8 to 1.07 executions per second with 40 concurrent hogs running.
With hog I collect enough information to calculate the data reported by cputoolkit. “time” from pprof is the sum of time spent on CPU by Oracle sessions. Assuming sqls fully utilize the sessions it is equivalent to CPU used by the driver sqls (see hog sheet in cputoolkit.xlsx, time -> CPUSECS). ELAP_EXEC (brown) is the duration of the test divided by the number of sql executions. From these I was able to calculate CPU WAIT (gray line). Waits start to show up after all 40 logical CPUs were saturated, beyond that elapsed start to diverge from CPU used.
Converting/substituting core based CPU usage to time spent on CPU allowed me to calculate CPU WAIT on AIX/POWER/lpar. I obtained time spent on CPU from pprof. I could use ASH ‘ON CPU’ but ASH would continue to grow beyond 40 as we run more hogs. On the other hand, pprof CPU usage is maxed out at 40 since we have 40 logical CPUs providing maximum 40 seconds of CPU time each wall clock second.
LIOS_EXEC on hog sheet is flat and slightly decreasing with cputoolkit. I think we have some collection problem with cputoolkit, that might be the reason for having not perfectly linear throughput growth even when number of sessions are less than number of cores.
With hog up until around 10 sessions each process gets 0.63 core resource. Yet response time/performance also has a slight degradation even under 10 sessions. I wonder if some friction inside the DB or at OS/lpar level kicks in. In theory I would expect (more/perfectly) linear growth even with our logical read heavy sql up until all processors are running single threaded. This calls for more tests, at this point I do not know the reason.
In the past I used PL/SQL loop to drive up the load:
a := 1; begin for i in 1..40000 loop a := ( a + i )/11; end loop; end;
With that load type I measured a more linear performance increase with flatter response times until 10 hogs, check out lpar_test sheet in cputoolkit.xlsx.
I started with testing cputoolkit on AIX/POWER/lpar. Noted problems with CPU WAIT due to cpu_time is on different scale…it is not time on AIX/POWER. Then I moved on to test with hog. I updated it to use the same sql to drive the load. It produced similar results for lpar throughput and sql response time. Hog collects both core capacity used and time spent on CPU by Oracle sessions. With that information I was able to calculate CPU WAIT on AIX/POWER/lpar.