Scenario 2: Loading in a Production Environment - Aster Client

Teradata Aster® Client Guide

Product
Aster Client
Release Number
7.00
Published
May 2017
Language
English (United States)
Last Update
2018-04-13
dita:mapPath
hki1475000360386.ditamap
dita:ditavalPath
Generic_no_ie_no_tempfilter.ditaval
dita:id
B700-2005
lifecycle
previous
Product Category
Software
Teradata Aster recommends that, before you bulk load data into your production system, you first verify your loading procedure on a test system. In this section, we’ll focus on the differences between a data loading job in a testing environment and one in a production system.

In a production deployment, we recommend that you follow the steps below to ensure you take advantage of the bulk loading and error logging tools in Aster Database:

  1. Define quote and escape characters that don’t conflict with the data.

    If you have double-quote marks inside your data, the easiest way to have them kept as literals is to choose a character that does not occur anywhere in the input data file and use that as both the quote character and the escape character. Once the double-quote character is no longer treated as the quote character, it will be treated as a literal and will be preserved. So, for example, if your file has no tab characters in it (as either data or as a column separator), then you can preserve embedded double-quote characters by specifying

    -q '\t' -e '\t'

    on the ncluster_loader command line.

  2. Avoid extra spaces in your data, because they can be significant.

    Quotes can be used on any portion of a data field, typically around special characters. For example, with the default CSV mode, this is the usual way to handle commas within a string:

    1,2,"red,blue,green"

    But putting quotation marks around each comma is equivalent:

    1,2,red","blue","green

    This can introduce problems when working with varchar columns, because many people put space between the comma and the quote, and that space is considered significant. For example:

    1,2, "red,blue,green"

    In this example, the third column will be loaded with a space before the r character.

  3. Consider performance when loading autopartitioned tables.

    Beginning in Aster Database version 5.0, loading of logically partitioned tables has much lower overhead and performs better than previously. Specifically, performance is improved when loading data into a table with many child partitions.

  4. Run ANALYZE after you load each reasonably sized batch.

    When bulk loading data, it’s important to run ANALYZE regularly. In ncluster_loader, use the --auto-analyze (or -z) flag to do this. With this flag present, ncluster_loader will run ANALYZE on the target table(s) after it loads the data. This flag also sets the hint bits on the table(s).

    While running, ANALYZE requires an amount of disk space on the cluster proportional to the amount of new data in the table being analyzed. For this reason, Teradata Aster recommends that your run ANALYZE after every load, rather than waiting for multiple loads to finish. Waiting too long can result in a large amount of the cluster’s disk space being consumed during the running of ANALYZE. Daily child tables offer a good example: When you load to daily child tables, run ANALYZE on each child table after you load the day’s data into that table.

  5. Stay informed about cluster activity.

    As with any other production system component, it is important to stay informed about any irregular activity that is going on in the cluster, in this case in particular what is happening in the data loading path.

    If you want to make sure that large loads with too many malformed rows do not complete successfully, you can set an error logging limit for your load operation (--el-enabled --el-limit = '100000'). You can specify a limit clause with a value that is small enough to allow most loads to complete successfully while forcing the failure of loads with unexpected percentages of malformed rows. Exact values will depend on the size of the files being loaded and the expected malformed row ratios.

    You should use an alert system to monitor the size of your error logging tables.

  6. Keep an eye on the volume of data in the load error logging tables.

    The system does not perform automatic truncation of the error logging tables! The error logging tables are regular tables in Aster Database. The same operations that an administrator would apply to regular tables thus need to be applied to error logging tables. Tasks include the appropriate setup of child table hierarchies for efficient data processing, regular VACUUM operations to free up disk space occupied by dead rows, and the monitoring of table growth over time.

  7. Use an error logging file for faster re-try of rows that failed.

    Have ncluster_loader create a file containing the row data that failed to load. To do this, pass the argument, --el-errfile <my_error_file.txt>to the tool. Upon completion of the load, you can inspect the contents of the error logging file, fix issues you find there, and reload from that file.

  8. Watch your load error statistics.

    To monitor error rates, check the system statistics tables, nc_all_errorlogging_stats and nc_user_errorlogging_stats. These tables show you the malformed and well-formed row count for each load operation. Records from the statistics tables can be correlated with the malformed rows in the load error logging tables. Both the error logging and statistics tables contain an error logging label which can be used to identify a particular loading operation.

  9. Trace malformed input data to the data source.

    An important aspect of a production system is that it must provide the user with the means to find the problem source when an operation fails. When loading data, the user can use the error information about malformed rows (SQL error code and error message, sqlerrcode and errmessage, respectively, in the load error logging table) to diagnose the problem in the input data. To do this, you must invoke your data loading operation with the error logging feature enabled (--el-enabled). As described above, this information can then be correlated with the error logging statistics table to find the source of the malformed data. To facilitate this process, you may wish to apply custom error logging labels to each load operation (--el-enabled --el-label = 'my_data_source').