On a Vantage multi-node system, NFR automatically replaces the failed nodes with the same number of available hot standby nodes (HSNs). If there are no HSNs or not enough available HSNs, NFR automatically spins up one or more replacement nodes, detaches the network-attached storage of the failed node, reattaches the network-attached storage to the new VM, migrates the secondary IPs from the failed node NIC to the new VM NIC, and reinstates the configuration. The replacement node is deployed from a snapshot of the active (control) node.
NFR takes longer than a typical TPA reset. If your node does not automatically recover after 10 to 15 minutes, check the deployment logs in your Azure resource group.
For assistance, contact Teradata Services.
Node Recovery Process for Multi-Node Clique (MNC deployment):
For a Multi-Node Clique Teradata system on Azure, NFR does not activate automatically when a node fails. If the failure is due to an issue on the cloud service provider, they usually restores the node without manual intervention. However, if restoration fails, the customer must manually recover the instance. Follow the below procedure to detect and recover failed nodes (via invoking NFR).
- Identify the node that is down using bam -i
- Run tdc-nodestart-mnc
- Reboot the node that was down (identified in step 1) but brought up now by the step 2. You can do this either from the Azure portal or the command prompt of the node itself.
If NFR is triggered, logs are available in /var/log/NFR/tdc-nodestart.log.
By default, the NFR will "Stop + Start" the failed instance. If this fails to recover the node, redeploy the node manually from the portal.
After manual recovery, if the database state shows "PDE state: DOWN/TDMAINT" on the recovered node. Run the command: psh -N <recovered node bynet ID> 'tdmaint -up nodestart. For example, psh -N byn001-02 'tdmaint -up nodestart.
Running the previous command changes the PDE state to "DOWN/HARDSTOP".
After the NFR command recovers the failed node, the node is visible in the BYNET but remains in the DOWN/HARDSTOP state. The HSN node is added as the active TPA.