Node Failure Recovery (MPP Only) - Teradata Software for Azure

Teradata Vantageā„¢ on Azure (DIY) Installation and Administration Guide

Product
Teradata Vantage on Azure
Release Number
5.01
Published
July 2018
Language
English (United States)
Last Update
2018-07-18
dita:mapPath
kmk1523992471627.ditamap
dita:ditavalPath
TeradataAzure_PubCloud_5.01_5.01.01.ditaval
dita:id
B035-2810
lifecycle
previous
Product Category
Cloud

This node failure recovery process applies only to Teradata Database MPP systems.

Node failures may be related to hardware, software, operating system, network, or Azure platform issues. When a node fails, the failed node is automatically stopped/deallocated, and diagnostic information may be lost. A failed node cannot be recovered unless it is stopped/deallocated because NICs cannot be detached due to restrictions of the Azure platform.

When a node fails, a replacement node (similar to a hot standby node) automatically spins up, detaches the network-attached storage and NICs of the failed node, reattaches the network-attached storage and NICs to the new VM, and the configuration is reinstated. The replacement node is based on a snapshot of a healthy operating system disk of the currently active (control) node. The replacement node has the same private IPs and public IP as the replaced node.

Node failures are handled differently when the VM has local storage. When a node fails, the data is lost. Although the node is replaced and comes back online, the AMPs on the recovered VM display as FATAL and offline. The other vprocs on the system are online and in the configuration. To fully restore a VM that has local storage, you must run Fallback Recovery and rebuild the AMPs. For assistance, contact Teradata Customer Support.

Node failure recovery takes longer than a typical TPA reset. There are dozens of reasons for a node failure and it may be difficult for you to determine the cause. However, if your node does not automatically recover after 10 to 15 minutes, first check the deployment logs in your Azure resource group. For additional assistance, contact Teradata Customer Support.

Before a Node Failure Occurs

Before a node failure occurs, you have the option of setting the VM to terminate instead of stopping the VM if a node failure occurs.

When a Single Node Failure Occurs

When a single node fails, do the following:
  1. Create an Azure Active Directory application and the service principal.
  2. Enable the node failure recovery feature.

When Multiple Node Failures Occur

When two or more nodes fail at the same time, all nodes can be replaced at the same time as long as one node remains running to act as the node failure recovery control node. However, the database stops if both BYNET relay nodes in the database are unavailable during node failure recovery. If one or more replacement VMs cannot be spun up, the database is stopped.

If two or more nodes fail at different times while node failure recovery is in progress, contact Teradata Customer Support.

Replacing a Node When a Teradata Database System is Running on Fallback

In Teradata Software for Azure 5.01 and later, when the node failure recovery process fails to replace the downed node, the Teradata Database system keeps running on Fallback. To replace the downed node, log in to https://access.teradata.com and search for KCS009816.

If Using Mainframe Connectivity

If a node failure occurs, the node failure recovery process copies the entire configuration of a healthy node to the replacement node, including mainframe configurations.