UTF8 Client Character Set Support | VantageCloud Lake - UTF8 Client Character Set Support - Teradata VantageCloud Lake

Lake - Database Reference

Deployment
VantageCloud
Edition
Lake
Product
Teradata VantageCloud Lake
Release Number
Published
February 2025
ft:locale
en-US
ft:lastEdition
2025-11-21
dita:mapPath
ohi1683672393549.ditamap
dita:ditavalPath
pny1626732985837.ditaval
dita:id
ohi1683672393549

The Teradata Universal Coded Character Set Transformation Format (UTF8) client character set supports UTF8, a standard way of encoding Unicode character data that is optimized for backward compatibility with ASCII. This character set is usable for all languages. In Teradata UTF8, a character can consist of one to three bytes.

The UTF8 client character set is permanently enabled for use in Database Engine 20.

Length of Byte in UTF8 String Meaning
Less than 0x80 Byte represents the same character defined by standard ASCII.
Greater than or equal to 0x80 Byte is part of a multibyte sequence and is not a standard ASCII character.

UTF8 Multibyte Sequences

To determine the length of a byte sequence in UTF8, examine the first byte.

First Byte Byte Sequence Length
High order bit is zero. One byte.

This leaves seven bits to encode information.

A character that has a Unicode value that can be represented in seven bit is represented as a byte containing the Unicode value. For example, Unicode value 0x0041 is transformed to UTF8 byte 0x41.

Three high order bits are 110. Two bytes.

The second byte has the two high order bits set to 10. There are five free bits in the first byte and six free bits in the second byte. This allows eleven bits to represent a numeric value.

A character that has a Unicode value that can be represented in eleven bits, and cannot be represented by a shorter UTF8 sequence, is represented as two bytes, where the free bits contain the Unicode value. For example, Unicode value 0x03F1 is transformed to UTF8 byte sequence 0xCF 0xB1.

Four high order bits are 1110. Three bytes.

The second and third bytes have the two high order bits set to 10. There are four free bits in the first byte and six free bits in each of the second and third bytes. This allows sixteen bits to represent a numeric value.

A character that has a Unicode value that can be represented in sixteen bits, and can not be represented by a shorter UTF8 sequence, is represented as three bytes, where the free bits contain the Unicode value (for example, Unicode value 0x3000 is transformed to UTF8 byte sequence 0xE3 0x80 0x80).