16.10 - Usage Notes - Teradata Database

Teradata Database International Character Set Support

prodname
Teradata Database
vrm_release
16.10
created_date
June 2017
category
Configuration
User Guide
featnum
B035-1125-161K
  • You may have existing processes in place to cleanse Unicode data before loading it into Teradata Database, such as access modules and UDFs. To take full advantage of Unicode Pass Through, you must change or eliminate prior methods of cleansing Unicode data before loading. See the Teradata Parallel Transporter (TPT) documentation for details on configuring how TPT uses access modules. For details about configuring access modules to allow pass through characters, see Teradata Tools and Utilities Access Module Reference, in particular the sections regarding automatic character conversions.
  • PTCs have default character properties; therefore, case sensitivity, object name support, collation, and inequality comparisons are not supported.
  • Unicode delimited character literals support PTCs.

    For example: SELECT feed FROM twitter WHERE feed LIKE U&'%#+01F602%' UESCAPE '#';

  • PTCs in SMP require two 16-bit UTF-16 code units, or 4 bytes. Therefore, VARCHAR(2) or CHAR(2) are the minimum sizes required to store a PTC.
  • CHARACTER(4) is the minimum size required in order to import a PTC using a USING clause.
  • PTCs have no impact on compression. The TransUnicodeToUTF8 compression function will compress surrogate pairs to a single 4 byte UTF8 encoded value. TransUTF8ToUnicode will decompress the value to surrogate pairs.
  • The SDF language used by the tdlocaledef utility to define the system level localization will only support PTCs in the BMP.

Truncation

In sessions where Unicode Pass Through is enabled, truncation of a PTC occurs when the destination buffer does not have 4 bytes to contain the entire surrogate pair. If only 2 bytes remain in a fixed CHAR string, then the string is padded with a SPACE (U+0020) in Teradata mode. In ANSI mode, the truncation mode error is returned.

Format Strings

The length of an ‘X’ format string is in UTF-16 code units. So FORMAT ‘X(2)’ is the minimum size format string required in order to display a PTC.

Export Width

The representation of supplementary code points is four bytes long in the UTF8 and UTF16 character sets. This might suggest that the export width for the session must be four bytes per character to correctly process these characters. However, the representation internally is UTF-16 as two code units. Export width will see this as two ‘characters’, and so unless a site-defined export width is specified, export width will reserve four bytes for two characters exported to UTF16 and six bytes when exported to UTF8. Therefore there is no export width issue for supplementary code points unless a site-tailored export width is employed. For sessions with tailored export width that might cause truncation of strings with supplementary code points, truncation was already possible unless the repertoire is known. If the repertoire of characters is being managed, pass through sessions are not an issue.

Collation

For Unicode data, PTCs sort at the position of the error character but in Unicode code point order. Characters off the BMP, such as those represented as surrogate pairs, would collate based on their UTF-16 code point order. This is true even for CHARSET_COLL collation for UTF8 sessions.

Unicode Pass Through does not provide full support for PTCs. With respect to collation, pass through surrogate pairs sort as if they are composed of two characters. In the current system, for all ASCII, EBCDIC, JIS_COLL, CHARSET_COLL, and MULTINATIONAL collations, code points in the range 0xD800 to 0xDFFF sort before U+FFFF. For example, under ASCII case-specific collation, a pass through surrogate pair 0xD800 0xDC00 is treated as two Unicode UTF-16 characters, U+D800 and U+DC00 and will be sorted before U+FFFF.

Comparisons

  • Equality and the LIKE operators work correctly with PTCs.
  • Inequality predicates may not work correctly with PTCs.

String Functions

PTCs are not affected by most string functions, but the following are exceptions:
  • Indexing and counting functions operate on UTF-16 code units.
  • Substring can split surrogate pairs resulting in U+FFFD.
  • UNICODE_TO_UNICODE_NF(K)C and UNICODE_TO_UNICODE_NF(K)D TRANSLATE functions support the normalization of PTCs producing a normalized result string.
  • The TRANSLATE_CHK function used with Unicode to Unicode translations will not stop processing when a PTC is encountered and will only stop when a U+FFFD is encountered.
  • CASE translations may not work correctly.

Stored Procedures

In a session where UPT is disabled, you cannot create a stored procedure that references a PTC inside the stored procedure body. Such a procedure can only be created within a Pass Through session.

When calling a stored procedure that is defined with a PTC parameter or references a PTC in the stored procedure body, the following applies:
  • In a session where UPT is disabled, you cannot call a stored procedure that is defined with a PTC parameter.
  • In a Pass Through session, you can successfully call a stored procedure that is defined with a PTC parameter and pass a PTC literal to it as an argument.
  • In all sessions (UPT enabled or disabled), you can call a stored procedure that references a PTC in the body of the the stored procedure.
The following table summarizes the usage for stored procedures with PTCs.
Stored Procedure Creation or Execution Current Session Type Expected Result
Creating a stored procedure that references a PTC in the procedure body Pass Through session Successful
UPT is disabled Error
Calling a stored procedure that is defined with a PTC parameter Pass Through session Successful
UPT is disabled Error
Calling a stored procedure that references a PTC in the procedure body Pass Through session Successful
UPT is disabled Successful
Calling a stored procedure that is defined with a PTC parameter and also references a PTC in the procedure body Pass Through session Successful
UPT is disabled Error