UTF8 Multibyte Sequences - Advanced SQL Engine

UTF8 Multibyte Sequences - Advanced SQL Engine - Teradata Database

International Character Set Support

Product

Advanced SQL Engine

Teradata Database

Release Number

17.05

17.00

Published

June 2020

Language

English (United States)

Last Update

2021-01-23

dita:mapPath

ywb1588027283948.ditamap

dita:ditavalPath

lze1555437562152.ditaval

dita:id

B035-1125

lifecycle

Product Category

Teradata Vantage™

To determine the length of a byte sequence in UTF8, examine the first byte.

IF the...	THEN the sequence is...
high order bit is zero	one byte long. This leaves seven bits to encode information. If a character has a Unicode value that can be represented in seven bits, it is represented as a byte containing the Unicode value. For example, Unicode value 0x0041 is transformed to UTF8 byte 0x41.
three high order bits are 110	a two-byte sequence. The second byte has the two high order bits set to 10. There are five free bits in the first byte and six free bits in the second byte. This allows eleven bits to represent a numeric value. If a character has a Unicode value that can be represented in eleven bits, and cannot be represented by a shorter UTF8 sequence, then it is represented as two bytes, where the free bits contain the Unicode value. For example, Unicode value 0x03F1 is transformed to UTF8 byte sequence 0xCF 0xB1.
four high order bits are 1110	a three byte sequence. The second and third bytes have the two high order bits set to 10. There are four free bits in the first byte and six free bits in each of the second and third bytes. This allows sixteen bits to represent a numeric value. If a character has a Unicode value that can be represented in sixteen bits, and can not be represented by a shorter UTF8 sequence, it is represented as three bytes, where the free bits contain the Unicode value (for example, Unicode value 0x3000 is transformed to UTF8 byte sequence 0xE3 0x80 0x80).

IF the...

THEN the sequence is...

high order bit is zero

one byte long.

This leaves seven bits to encode information.

If a character has a Unicode value that can be represented in seven bits, it is represented as a byte containing the Unicode value. For example, Unicode value 0x0041 is transformed to UTF8 byte 0x41.

three high order bits are 110

a two-byte sequence.

The second byte has the two high order bits set to 10. There are five free bits in the first byte and six free bits in the second byte. This allows eleven bits to represent a numeric value.

If a character has a Unicode value that can be represented in eleven bits, and cannot be represented by a shorter UTF8 sequence, then it is represented as two bytes, where the free bits contain the Unicode value. For example, Unicode value 0x03F1 is transformed to UTF8 byte sequence 0xCF 0xB1.

four high order bits are 1110

a three byte sequence.

The second and third bytes have the two high order bits set to 10. There are four free bits in the first byte and six free bits in each of the second and third bytes. This allows sixteen bits to represent a numeric value.

If a character has a Unicode value that can be represented in sixteen bits, and can not be represented by a shorter UTF8 sequence, it is represented as three bytes, where the free bits contain the Unicode value (for example, Unicode value 0x3000 is transformed to UTF8 byte sequence 0xE3 0x80 0x80).