Failure detection and recovery (FDR) detects failures and attempts to recover from those failures automatically, thereby significantly reducing cluster downtime.
FDR is turned on by default and cannot be disabled.
Alerts are sent to inform you about ongoing ML Engine FDR activities and cluster status. See Automatic Incident Creation Events.