There is always the next level.
I am still learning. What I wrote here mostly exists in bits & pieces on the Internet. Not much new… it was new for me. During my research I focused on CPU utilization on AIX/POWER/lpar. I felt the CPU used by the DB somehow was not accounted for in AWR reports, nor was reflected in CPU utilization at the OS level.
I built on findings by Marcin Przepiorowski in his article on Oracle on AIX – where’s my cpu time ?. I also benefited from Jeremy Schneider’s excellent summary Understanding CPU on AIX Power SMT Systems. Another great source was IBM AIX POWER7 CPU Usage and ThroughputCPU Usage and Throughput by ksun. Applying the information allowed me to explain low CPU utilization of the machine/lpar, and the low CPU usage numbers by the DB. It also led me to the root cause of slow DB performance.
Writing my experience down forced me to think it over. I reached a whole new level of understanding. I hope you will learn something by reading this. I also hope someone out there will point out mistakes in my posts, that will give me another push to dig deeper.
The DB is not operating in thin air. It is executing queries on behalf of the applications. It is responding to those requests by using resources from the OS/HW. Neither side is controlled by the DBA. Yet we are frequently called upon to fix slowness in the DB.
Rule #1: the DBA is guilty. The application is slow from the DB unless proven otherwise. The DB is using too much resource on the OS, DBA is guilty again. But how can a DBA know the app and the OS/HW? In today’s specialized world you can’t cover it all. So people supporting the app and the OS are your friends. They can help you find unusual patterns of activity affecting the DB. Do not expect them to find you, you need to reach out… you are lucky if they are approachable.
Management may not recognize this; they think pushing the DBA harder is the best way to solve the problem. This can quickly become a distraction: you may be asked to participate on fire calls and report progress every few hours. That time would be better spent on thinking, or consulting with the few people relevant to the problem. All hands on deck can be counter productive. Ultimately the client is telling you what to do: you may let them know it is not helping, but I see no point sweating this too much.
Engaging too many “experts” and feeding them with information is another sure way to waste some time. While until we do not know what hit us any help is welcome, we just need to balance it somehow.
As a DBA we are responsible to keep the DB humming. If something unusual is going on you need to recognize it. Do yourself a favor, know your DB. Know the numbers in your load profile. Know the type of workload the application is requesting. Understand the limitations of the HW the DB is running on. Basic stuff, like 1000 sql executions/sec, 10 hard parses/sec, 2msec average db file sequential reads (which can’t possibly come from physical disk), max 10000 IOPS for the SAN, max 11 seconds of CPU core capacity on lpar for the main DB (which is on different scale than the 128 CPU seconds the OS thinks the lpar has)…
So far I did not post about dead-ends/side tracks. Those are the necessary evil. No way to know they are dead-ends until you pursue them. You may conclude faster they are irrelevant (victims instead of root causes) by learning from others.
Some of the pitfalls we had:
– using only AWR reports to explain slowness…not everything is in there
– trying to shoestring the problem into one of the seemingly possible scenarios based on data in AWR reports
– picking the wrong day (with less load) for baseline to “prove” slowness was from increased workload
– picking the wrong day (with sudden jump in number of soft parses) for baseline to “prove” slowness was from increased soft parse
– in the heat of the battle making changes on “expert” advice wo/verifying or understanding
– disk IO/speed trap on machine with CPU starvation
– allowing yourself to be distracted by constant/useless conference calls
– not following through with application owners, explain reason DB can’t cope, work together on solution
At the end…
Our week-long DB slowness ended before we knew what hit us. At that time a lot of attention was devoted to batches. We thought they were the reason why day time activity was slower. The vendor stopped churning through the whole data set and started to process only the subset of accounts (those matter). The fastest sql is the one not running 🙂
Little did we know that the batches were the victims of hard parsing by other sessions. At around the same time when they fixed the batch, the sqls stopped coming. So we were back to normal. Couple of months later I tracked down the application responsible. By that time management moved on and was not interested to pursue it further.
The same application still puts out bursts of similar sqls few times a year. During my research I created a tool (lprof) to catch hard parsing as it happens. Couple of months ago the tool helped me identify the issue on the spot. It was “just” a 6 hour burst at business hours, it did not interfere with overnight batches, users did not really noticed it (there were no complains). I notified management and they acknowledged it. I am out of the environment by now but this blog may help DBAs the next time it happens again.
Thanks for reading.