Assessing a Busy System - Advanced SQL Engine

Assessing a Busy System - Advanced SQL Engine - Teradata Database

Database Administration

Product

Advanced SQL Engine

Teradata Database

Release Number

17.10

Published

July 2021

Language

English (United States)

Last Update

2021-07-27

dita:mapPath

upb1600054424724.ditamap

dita:ditavalPath

upb1600054424724.ditaval

dita:id

B035-1093

lifecycle

Product Category

Teradata Vantage™

CPU Saturation

Systems that run frequently at or near capacity are often the subject of assessments in which attempts are made to determine the extent of resource exhaustion.

When CPU is the binding factor and the scarce resource on a system, it is useful to determine the highest level of saturation of the system, that is, the point to which you can drive the system before it becomes so overloaded that it appears to be hanging.

Once a system reaches this level of saturation, the system should still be able to work itself out, but that may require extra time to do so.

Without appropriate performance monitoring, users may start, once the system appears to be hanging, to:

Abort jobs that could cause rollbacks, or
Submit duplicate queries that create more work on an already-exhausted system.

While ResUsage data provides the bulk of the information needed to know that a system is at 100% busy with respect to CPU or CPU+I/O Wait, other information may also be needed in order to examine the extent of CPU saturation and system congestion. In other words, one can know that the system is running at capacity, but not the extent of the overload.

When the system becomes so busy that logons become slow or hung, performance monitoring is not able to determine whether the system is actually hung or simply overloaded without using other tools.

Suggested Monitoring Techniques

Since the goal is to be able to drive the system as hard as possible without overloading it, some techniques for assessing the level of busy can be used when CPU usage is high:

Check AWT utilization. If the number is constantly at or near maximum, then
Check the message flow control. If there are tasks apparently in flow control, then
Check run queue ResUsageSawt MailBoxDepth data. If the run queue is growing longer and longer, the system is too busy and will slow down dramatically.

While it is not unusual to see a busy system with high AWT counts, the presence of flow control means that some tasks are currently being blocked from sending more work to busy AMPs.

Finding a Saturated Resource

Use Resource Check Tools, located in the /usr/pde/bin directory, to check for saturated resources.

IF …	THEN …
mboxchk is not already running a background task	run mboxchk to check current response time.
the mboxchk log shows a slow response or timeout	run syscheck to get a report showing attributes that falls below the specified danger level.
no attribute is reported at the WARN level	check disk and AMP CPU usage.

For information about mboxchk and other PDE resource check tools, such as nodecheck and syscheck, see man pages or pdehelp.

High I/O Wait

Teradata recommends configurations with CPU to I/O bandwidth ratios according to typical database workload demands in order to avoid CPU starvation. If these guidelines are not followed, or customer workload is unusually heavy with respect to I/O, it is possible that CPU starvation may still occur, as reflected by high I/O WAIT on the system.

If ResUsage data shows that I/O WAIT is rising while CPU busy is falling, it is an indication that the system is not able to use the available CPU because I/O has become the limiter.

If the onset of high I/O wait is sudden:

Determine if the high I/O wait is due to disk I/O, waiting for AMPs from other nodes (because of skewing or coexistence imbalance), low system demand, or BYNET I/O.
CPU+WIO less than 90% may suggest low system demand without a true I/O bottleneck. Look at node efficiency to determine if the I/O wait is due to node waiting.
Look at the actual disk I/O wait queue using sar –d, and examine:
- avwait
  Average time in milliseconds that transfer requests wait idly on queue for response (in the FibreChannel driver queue or disk queue).
- avserv
  Average time to be serviced (includes seek, rotational latency and data transfer times).
For more information on the sar utility, see the Linux operating system documentation.