Writing Load Scripts for Restartability and Availability
One of the main challenges for data warehousing design is how to recover from a failure as quickly as possible. Recovery usually involves fixing the client or server systems, changing configuration parameters or system resources, restarting the interrupted jobs based on their last checkpoints, and bringing the system back to normal without resorting to rigorous manual efforts or writing piece-meal recovery procedures.
Most of the time, jobs may also be required to perform "catch up" so that transactions that were accumulated during the "failure window" can be applied to the target systems as quickly as possible.
To this end, Teradata PT provides some unique features that allow you to speed up the recovery process without resorting to changing job scripts after a job failure. These features include:
To take advantage of the above features for restartabilty, some best practices for designing and implementing job scripts are necessary. The best practices presented below speak to reusability and manageability of job scripts, the flexibility of building and enhancing them to deal with ever increasing data volumes and changes in execution environments, and restartability after job failures. These practices can also be regarded as standard guidelines in building data warehousing processes.
Restarting a Job from a Job Failure
Automatic Restart
An automatic restart means that a job can restart on its own without manual resubmission. With the default start-of-data and end-of-data checkpoints, a job can automatically restart itself when a retryable error occurs, such as a database restart or deadlock before, during, or after data loading. Consider the following when dealing with automatic restarts:
Manual Restart
If a job fails and terminates, you can manually restart it by resubmitting the same job with the original job-launching command. By default, all Teradata PT jobs are checkpoint-restartable using one of the two checkpoints taken before data loading and after data loading. When jobs have multiple steps, a checkpoint is created for each successful step, allowing a job to restart from the failed step.
Restarting a Job “Catch Up”
Here are the steps for switching the load protocol to perform “catch up”:
1 Terminate the current job with the TERMINATE command. This forces the job to take a checkpoint before it terminates.
2 Switch the load protocol by either changing the operator in the job variables file or by using another job variables file that has the new operator. The latter method is highly recommended because it prevents users from modifying existing job variables files.
3 Resubmit the same job with the same command options.
Note: Do not cleanup the Teradata PT checkpoint files left from the previous run.
The steps above can be easily automated because performing "catch up" is very similar to restarting a job. In most of the "catch-up" cases, you do not need to modify the original scripts. This is all due to the advantages of having a single script language, external job variables to isolate changes to one place, and a common protocol for checkpoint restart across operators.