Considerations for Defining Tables | VantageCloud Lake - Considerations for Defining Tables - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
ft:locale
en-US
ft:lastEdition
2024-12-11
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905

A table acquires data attributes when you define its columns in a CREATE TABLE or ALTER TABLE statement with at least a name and a data type phrase.

Data attributes control the internal representation of stored data and determine how that data is presented to a user. You can use the FORMAT phrase to change the external representation returned by a SELECT query.

The table definition directly affects the performance of applications that access that table. Proper planning helps you define a table correctly at the time of creation. The following table lists things to consider when planning your tables.

Issue Considerations
Compression

(Algorithmic)

Specify the COMPRESS USING udfname option of the CREATE TABLE or ALTER TABLE statement to compress character data by table column and minimize storage. This option allows you to choose your own compression or decompression algorithms. Algorithmic compression works best on seldom-used data. See Algorithmic Compression.
Compression

(Block-level)

Specify the BLOCKCOMPRESSION = block_compression_option of the CREATE TABLE or ALTER TABLE statement to compress data blocks of primary and fallback tables to minimize storage. Software-based block-level compression works best on large volumes of seldom-accessed tables. See Software-Based Block-Level Compression.
For Object File System tables, BLOCKCOMPRESSION has a default value that you cannot change.
Compression

(Multivalue)

Specify the COMPRESS phrase to compress up to 255 specific values or NULLs to zero space. (Nullable columns with nulls are automatically compressed, when COMPRESS is specified.)

Use ALTER TABLE to immediately add, change, or delete compression on a column, column partition, or a table. See Multivalue Compression.

Default value of a column When you create or alter a table, you can specify a default value for a column by using the DEFAULT function. Specifying a default value allows the system to return a value associated with the column. If you do not specify a value, the default is null.
Default or NULL for aggregation results During aggregation, the unspecified value represented by a:
  • NULL, which is ignored in the calculation.
  • Default value, which is included in the calculation.

For example, assume you want the average salary of employees in Department 300 as follows:

SELECT DeptNo, AVG(Salary)
FROM Employee
GROUP BY DeptNo
WHERE DeptNo = 300 ;
If a salary is not known, the result differs depending on how the column is defined. If an unspecified value in the Salary column is defined with the following:
  • A DEFAULT value, then the result includes the default value as the actual value. This may be far from the average result if all values were known.
  • A NULL, then the result is the average of the salaries that are known. NULL is ignored. This may be closer to the actual average.
You can do the following to further control how unknowns are handled during aggregation:
  • Exclude substitutions by making NULL the default value for an unknown, then use the NULLIF function definition; for example:
    NULLIF(SALARY, defaultvalue )

    The defaultvalue is ignored in computing the average.

  • Change an unknown represented by a NULL into the default value, then use the

    COALESCE function definition. For example:

    COALESCE(Salary, defaultvalue )

    where defaultvalue is the value to be used in computing the average.

Defining consistent data types and column names Columns for the same data in different tables must have the same type and name for ease of identification. For example, a column named EmpNo in one table must be named EmpNo in another table. Consistency of data across tables is critical. Considerations include:
  • Changing data in one table.
    If you change data in one table, data in other tables may be affected. For example, updating a department number in the Department table also affects the Employee table, which contains a DeptNo column. To maintain data integrity, you can use:
    • A macro, a stored procedure, a trigger, or an application program to update all the tables affected by a change.
    • RI constraints. See Referential Integrity Constraints .
      Referential integrity is supported only on the Block File System on the primary cluster, not on the Object File System.
  • Joining the columns of one table with the columns of another table. Make sure of the following:
    • Join columns are of the same data type and size.
    • Data is consistent in both tables. For example, to join two tables on the Employee Name column, the notation must be identical in both tables (for example, last name first initial, such as “Smith H”).
    • All row-level security-protected tables referenced in a join contain the same constraints.
  • Using tables to store large data objects. Create a table with LOB type columns. This includes:
    • Character Large Object (CLOB)- A CLOB column can store character data, such as simple text, HTML, or XML documents.
    • Binary Large Object (BLOB)- A BLOB column can store binary objects, such as graphics, video clips, files, and documents.
IDENTITY column IDENTITY is an optional attribute used to generate a number for every row inserted into the table. An identity column does not have to be the first column in the table or defined as an index. The IDENTITY attribute may be specified such that the generated number is unique.
Identity column is supported on both the Block File System and Object File System. However, you can only run queries related to an identity column table on the primary cluster.
Joins involving NULLs If a column used in a join contains either NULLs or generic values, the results may be misleading. For example, assume you want to join the Employee and Department tables to get a listing of employees and their workplaces:
SELECT Name, Loc
FROM Employee, Department
WHERE Employee.DeptNo = Department.DeptNo;

Undesired results may occur. The nulls in each table prevent matches. Employees with an unknown department number (Employee.DeptNo is NULL) and departments without employees are not listed.

To list an employee with an unknown department number and a department with no employees, use a different default value to represent unknowns in the DeptNo column, as follows:

  1. In the Employee table, add a mock “unknown” employee who has a DeptNo equal to the value of the default used for the Department table.
  2. To the Department table, add a mock “unknown” department that has a DeptNo equal to the value of the default used for the Employee table.
In the example query, a full outer join can be used instead of the inner join to get information for non-matching rows when there is a NULL in DeptNo. (Also, as noted previously, you may use the COALESCE function to change a NULL to a value.) However, using outer joins to get non-matching rows may be more efficient.

For further discussion on the implications of using NULL and for information on how to design efficient joins, see Join Indexes.

NULL and the GROUP BY clause If you use a column defined as NULL for grouping aggregate results, the results may be confusing. For example, assume you submit this query to find the average salary for each department in the organization:
SELECT DeptNo, AVG(Salary) 
FROM Employee
GROUP BY DeptNo;
The results can differ, depending on the definition of the DeptNo column:
  • If DeptNo is allowed to be NULL, and two employees have not yet been assigned a department number, then the result lists the computed average for those two employees under a NULL department number. This may be confusing.
  • If DeptNo is defined with DEFAULT and the specified constant is meaningful (such as Unknown), this result is more valid.
You can convert a NULL into a default value (with NULLIF) and a default value into NULL (with COALESCE), as explained previously for aggregated results.
Nullability of a column Specifying a value when inserting a new data row may not be possible. For example, a new employee may not immediately have a job title. You can explicitly define the nullability of a column with a NULL or NOT NULL phrase as follows:
If no value is given for a column when a row is inserted, and the nullability of a column is the following:
  • Defined as NOT NULL and a DEFAULT value is not specified, then the INSERT statement returns an error.
  • Not defined, then a NULL is supplied automatically.
Teradata recommends specifying NOT NULL for a column with a default value (if a default value is appropriate) unless there is a valid reason for allowing the column to be nullable. This saves space and may allow more efficient query plans. If the column does need to be nullable, be sure to determine, understand, and document the meaning of NULL for this column, because null has multiple interpretations.

If an application program requests data from a column (without NULL indicators) that allows nulls and a NULL is found, a substitute value that is compatible with the data type of the column is returned to the application instead of a null. The substitute value (a zero, blank, or zero-length element string) may be misleading, because NULL is an unknown.

Do use the phrase NOT IN on nullable columns, use NOT EXISTS. NOT IN makes the database work much harder.

Any comparison to NULL results is an UNKNOWN value. NOT EXISTS, however, treats UNKNOWN values as FALSE, as shown in EXPLAIN.

For example:

1 in (1,2,NULL) -> 1=1 or 1=2 or 1=NULL -> TRUE OR FALSE OR UNKNOWN -> TRUE

1 not in (1,2,NULL) -> 1<>1 and 1<>2 and 1<>NULL -> TRUE and FALSE and UNKNOWN -> UNKNOWN

See Behavior of Nulls for NOT IN .

Other considerations Consider the following:
  • Whether to have a primary index. (See the NoPI row in this table.)
  • The columns to use in the primary index and whether the PI is unique or not.
  • If the table has a primary index, whether to have one or more levels of row partitioning.
  • If there is no primary index, whether to have column partitioning and, optionally, one or more levels of row partitioning.
  • What secondary indexes to define, if any, and whether the secondary indexes are unique.
  • What constraints are needed on each of the columns.
  • Row header compression for column partitions that have COLUMN format can reduce storage costs and enhance system performance. However, wide column partitions do not benefit from autocompression, so ROW format (no row header compression) may be better.
NoPI table Use a NoPI table when either you want to column-partition the table, a primary index does not add any value, you do not know yet what primary index to specify, or you need a staging table and restrictions on NoPI tables are acceptable.

To create a staging table, use a NoPI table which is a table without a PI. Using a NoPI table for staging purposes improves performance when loading data from FastLoad and TPump jobs because the table is not hashed by a PI and the rows do not have to be sorted in a specified order or redistributed to a specified AMP.

NoPI tables are useful for applications that load data into a staging table that must go through transformation or standardization. The data must be converted before being stored in another table. You can then apply the data from the staging table to a target table through an INSERT ... SELECT, UPDATE-FROM or MERGE INTO statement.

Explicitly specify a PI or NoPI instead of relying on the default for the PI. The default may not be appropriate and may cause skewing of the data on the AMPs.

See CREATE TABLE and CREATE TABLE AS .