Usage Notes | UNICODE PASS THROUGH | VantageCloud Lake - UNICODE PASS THROUGH Usage Notes - Teradata Vantage

Teradata® VantageCloud Lake

Deployment
VantageCloud
Edition
Lake
Product
Teradata Vantage
Published
January 2023
ft:locale
en-US
ft:lastEdition
2024-12-11
dita:mapPath
phg1621910019905.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
phg1621910019905
  • You may have existing processes in place to cleanse Unicode data before loading it into VantageCloud Lake, such as access modules and UDFs. To take full advantage of Unicode Pass Through, you must change or eliminate prior methods of cleansing Unicode data before loading. See the Teradata Parallel Transporter (TPT) documentation for details on configuring how TPT uses access modules. For details about configuring access modules to allow pass through characters, see Teradata® Tools and Utilities Access Module Reference, B035-2425, especially the sections regarding automatic character conversions.
  • PTCs have default character properties; therefore, case sensitivity, object name support, collation, and inequality comparisons are not supported.
  • Unicode delimited character literals support PTCs.

    For example: SELECT feed FROM twitter WHERE feed LIKE U&'%#+01F602%' UESCAPE '#';

  • PTCs in SMP require two 16-bit UTF-16 code units, or 4 bytes. Therefore, VARCHAR(2) or CHAR(2) are the minimum sizes required to store a PTC.
  • CHARACTER(4) is the minimum size required to import a PTC using a USING clause.
  • PTCs do not affect compression. The TransUnicodeToUTF8 compression function compresses surrogate pairs to a single 4-byte UTF8 encoded value. TransUTF8ToUnicode decompresses the value to surrogate pairs.

Truncation

In sessions where Unicode Pass Through is enabled, truncation of a PTC occurs when the destination buffer does not have 4 bytes to contain the entire surrogate pair. If only 2 bytes remain in a fixed CHAR string, then the string is padded with a SPACE (U+0020) in Teradata mode. In ANSI mode, the truncation mode error is returned.

Format Strings

The length of an ‘X’ format string is in UTF-16 code units. So, FORMAT ‘X(2)’ is the minimum size format string required to display a PTC.

Export Width

The representation of supplementary code points is four bytes long in the UTF8 and UTF16 character sets. This may suggest that the export width for the session must be four bytes per character to correctly process these characters. However, the representation internally is UTF-16 as two code units. Export width sees this as two characters, so unless a site-defined export width is specified, export width reserves four bytes for two characters exported to UTF16 and six bytes when exported to UTF8.

Therefore, there is no export width issue for supplementary code points unless you use a site-tailored export width. For sessions with tailored export width that may cause truncation of strings with supplementary code points, truncation was already possible unless the repertoire is known. If the repertoire of characters is being managed, pass-through sessions are not an issue.

Collation

For Unicode data, PTCs sort at the position of the error character but in Unicode code point order. Characters off the BMP, such as those represented as surrogate pairs, collate based on their UTF-16 code point order. This is true even for CHARSET_COLL collation for UTF8 sessions.

UNICODE PASS THROUGH does not provide full support for PTCs. With respect to collation, pass through surrogate pairs sort as if composed of two characters. In the current system, for all ASCII, EBCDIC, JIS_COLL, CHARSET_COLL, and MULTINATIONAL collations, code points in the range 0xD800 to 0xDFFF sort before U+FFFF. For example, under ASCII case-specific collation, a pass through surrogate pair 0xD800 0xDC00 is treated as two Unicode UTF-16 characters, U+D800 and U+DC00 and are sorted before U+FFFF.

Comparisons

  • Equality and the LIKE operators work correctly with PTCs.
  • Inequality predicates may not work correctly with PTCs.

String Functions

PTCs are not affected by most string functions, but the following are exceptions:
  • Indexing and counting functions operate on UTF-16 code units.
  • Substring can split surrogate pairs resulting in U+FFFD.
  • UNICODE_TO_UNICODE_NF(K)C and UNICODE_TO_UNICODE_NF(K)D TRANSLATE functions support the normalization of PTCs producing a normalized result string.
  • The TRANSLATE_CHK function used with Unicode-to-Unicode translations stops processing only when encountering a U+FFFD.
  • CASE translations may not work correctly.

Stored Procedures

In a session where UPT is disabled, you cannot create a stored procedure that references a PTC inside the stored procedure body. Such a procedure can only be created within a Pass Through session.

When calling a stored procedure that is defined with a PTC parameter or references a PTC in the stored procedure body, the following applies:
  • In a session where UPT is disabled, you cannot call a stored procedure that is defined with a PTC parameter.
  • In a Pass Through session, you can successfully call a stored procedure that is defined with a PTC parameter and pass a PTC literal to it as an argument.
  • In all sessions (UPT enabled or disabled), you can call a stored procedure that references a PTC in the body of the the stored procedure.
The following table summarizes the usage for stored procedures with PTCs.
Stored Procedure Creation or Execution Current Session Type Expected Result
Creating a stored procedure that references a PTC in the procedure body Pass Through session Successful
UPT is disabled Error
Calling a stored procedure that is defined with a PTC parameter Pass Through session Successful
UPT is disabled Error
Calling a stored procedure that references a PTC in the procedure body Pass Through session Successful
UPT is disabled Successful
Calling a stored procedure that is defined with a PTC parameter and also references a PTC in the procedure body Pass Through session Successful
UPT is disabled Error