This node failure recovery process applies only to Teradata Database MPP systems.
Node failures may be related to hardware, software, operating system, network, or Azure platform issues. When a node fails, the failed node is automatically stopped/deallocated, and diagnostic information may be lost. A failed node cannot be recovered unless it is stopped/deallocated because NICs cannot be detached due to restrictions of the Azure platform.
When a node fails, a replacement node (similar to a hot standby node) automatically spins up, detaches the network-attached storage and NICs of the failed node, reattaches the network-attached storage and NICs to the new VM, and the configuration is reinstated. The replacement node is based on a snapshot of a healthy operating system disk of the currently active (control) node. The replacement node has the same private IPs and public IP as the replaced node.
Node failures are handled differently when the VM has local storage. When a node fails, the data is lost. Although the node is replaced and comes back online, the AMPs on the recovered VM display as FATAL and offline. The other vprocs on the system are online and in the configuration. To fully restore a VM that has local storage, you must run Fallback Recovery and rebuild the AMPs. See Rebuilding AMPs after Failure and Running the Script to Rebuild AMPs. For assistance, contact Teradata Customer Support.
Node failure recovery takes longer than a typical TPA reset. There are dozens of reasons for a node failure and it may be difficult for you to determine the cause. However, if your node does not automatically recover after 10 to 15 minutes, first check the deployment logs in your Azure resource group. For additional assistance, contact Teradata Customer Support.
For troubleshooting issues with BYOL, see Licensing Issues If Nodes Fail.
Before a Node Failure Occurs
Before a node failure occurs, you have the option of setting the VM to terminate instead of stopping the VM if a node failure occurs. See Configuring the VM State for Node Failure Recovery.
When a Single Node Failure Occurs
- Create an Azure Active Directory application and the service principal. See Creating an Azure Active Directory Application and the Service Principal.
- Enable the node failure recovery feature. See Enabling Node Failure Recovery.
When Multiple Node Failures Occur
When two or more nodes fail at the same time, all nodes can be replaced at the same time as long as one node remains running to act as the node failure recovery control node. However, if both BYNET relay nodes in the database are unavailable during node failure recovery, the database stops. If one or more replacement VMs cannot be spun up, the database is stopped.
If two or more nodes fail at different times while node failure recovery is in progress, contact Teradata Customer Support.