Glossary - Teradata Database

International Character Set Support

Product
Teradata Database
Release Number
15.10
Language
English (United States)
Last Update
2018-09-25
dita:id
B035-1125
lifecycle
previous
Product Category
Teradata® Database

A

AMP

Access Module Process

ANSI

American National Standards Institute

AWS

Administration Workstation

B

BTEQ

Basic Teradata Query

BYNET

Banyan Network

C

canonical form

An encoding of data where the meaning of the data can be determined unambiguously.

The practical implication is that a canonical character set can support heterogeneous clients while a noncanonical character set cannot.

A canonical character set is stored in one form only (“canonically”) in the database and can be retrieved by any client that supports those characters in its repertoire.

KANJI1 is the only noncanonical server character set supported by the Teradata Database. KANJI1 is non-canonical because the meaning of the stored data is dependent upon the character set of the client that entered the data; therefore, it cannot always be shared with a client that uses a different form-of-use.

character repertoire

Synonym: Repertoire.

A defined set of characters available for use on either the client system or the Teradata Database.

For example, ASCII and EBCDIC share essentially the identical character repertoire.

CHARACTER SET clause

An optional clause in the data definition for a character column that defines how the data for that column is to be stored on the server.

If you omit the CHARACTER SET clause, then the default server character set for the column depends on how the user accessing the data in the table is defined in the DEFAULT CHARACTER SET clause of the CREATE USER statement.

The available server character sets are:

  • LATIN
  • UNICODE
  • KANJISJIS
  • KANJI1
  • GRAPHIC
  • CJK

    Chinese Japanese Korean

    code page

    A coded character set specific to a particular language or computer or both.

    collation

    Process of ordering character strings according to a collation sequence.

    collation sequence

    A well-defined ordering of characters.

    Compatibility Zone

    Unicode zone that spans the area between U+FE00 and U+FFEF, inclusive, that contains characters that can be mapped to other characters defined by the Unicode standard but which require specific Unicode values to maintain compatibility with legacy character standards.

    This zone contains half- and fullwidth variants of standardized Japanese characters including Hankaku, Katakana, and fullwidth ASCII.

    Receivers of compatibility zone characters are free to replace those characters with corresponding “regular” Unicode characters.

    conversion

    Synonym: Translation.

    Process of translating one form-of-use to another, for example translating IBM Kanji that uses SO/SI encoding to the equivalent EUC form.

    cs0, cs1, cs2, cs3

    Four code sets (codeset 0, 1, 2, and 3) used in EUC encoding.

    E

    E2I

    External-to-Internal

    Applies to character conversions between the external client representation and the internal server representation of character strings.

    endianness

    The byte ordering convention of data that is represented with multiple bytes.

    The ordering method is either big endian or little endian. For example, the big endian method indicates the number 256 as the sequence 0x01 0x00. The little endian method indicates the number 256 as 0x00 0x01.

    EUC

    Extended UNIX Code

    This document generally refers to Japanese EUC.

    Note that EUC is a family of definitions. For example, Chinese, Japanese, and Korean EUC definitions are different from one another.

    G

    Gaiji

    Japanese term referring to characters not defined in the standard. They could be user defined characters.

    H

    Hankaku

    Literally “half square”.

    Standard orthography for Japanese has each character occupy roughly a square. European orthography tends to have characters that are narrower than their height. Placing two “half square” characters per square allows both orthographies to coexist.

    Many encodings include both Hankaku and Zenkaku (“full square”) versions of Katakana and other characters independently.

    Hiragana

    Japanese cursive alphabet used to write Japanese words that have no Kanji representation.

    I

    I2E

    Internal-to-External

    Applies to character conversions between the internal server representation and the external client representation of character strings.

    IBM Kanji

    IBM mainframe character sets used for Kanji.

    Their definitions include single-byte Latin, single-byte digits, Katakana, double-byte Kanji, and other common Japanese characters.

    ideograph

    For this document, the definition refers to Japanese Katakana, Kanji, and Hiragana characters.

    ISO 8859

    A set of standards for eight-bit character sets.

    Code points 20-7F are essentially identical to ASCII across all 8859 definitions, while code points 00-1F and 80-9F are not defined.

    ISO 8859-1, the Latin1 standard, and ISO 8859-15, the Latin9 standard, are designed for Western Europe and includes various diacritical characters.

    ISO 10646

    Character set defined to encode virtually every language.

    The Unicode standard is kept in sync with ISO 10646.

    J

    JIS

    Japanese Industrial Standard

    JIS X 0201

    JIS 8-bit character codes used to represent Romaji, Katakana, and control characters.

    JIS X 0208

    JIS 16-bit character code used to represent the combined Kanji, Katakana, Hiragana, Romaji, Graphic, Russian, and Greek alphabets.

    JIS X 0212

    JIS 16-bit character code used to represent supplemental Kanji, Latin diacriticals, Greek diacriticals, Cyrillic characters, and other miscellaneous alphabetic characters and symbols.

    K

    Kanji

    Japanese ideographic writing system based on the Chinese writing system.

    Katakana

    Japanese block-style phonetic symbols used to write foreign words in Japanese.

    Many platforms have these two Katakana encodings:

  • 8-bit Hankaku
  • 16-bit Zenkaku
  • L

    Latin1 Repertoire

    Character set repertoire containing all characters defined by the ISO 8859-1 Latin1 standard.

    Latin9 Repertoire

    Character set repertoire containing all characters defined by the ISO 8859-15 Latin9 standard.

    N

    nonspacing character

    Character with a display width of 0 that is positioned with reference to a preceding base character.

    P

    pad character

    Character used to indicate a space.

    Different character encodings use different pad characters.

    PDE

    Parallel Database Extensions

    R

    Romaji

    Informal name applied by Japanese to the Roman character set.

    S

    Script Area

    Unicode zone (U+0000 - U+2000) in which characters for script alphabets like Latin, Greek, Cyrillic, and Russian are defined.

    server character set

    The Teradata Database storage definition for a character set. Also called server data type, or storage type.

    Defined by a character representation (repertoire) and its hexadecimal representation or Unicode code point (form-of-use).

    The following server character sets are defined for the Teradata Database:

  • GRAPHIC
  • KANJI1
  • KANJISJIS
  • LATIN
  • UNICODE
  • Shift-JIS

    Microsoft-defined encoding scheme for Japanese characters that incorporates both the JIS X 0201 and JIS X 0208 standards by folding JIS X 0208 into the undefined columns of JIS X 0201.

    SO/SI

    Shift-Out/Shift-In

    An encoding scheme designed to permit both single- and multibyte characters within the same string by demarcating the multibyte characters with reserved characters.

    Typically, the SO character indicates that all characters up to the next SI character are multibyte, and the SI character indicates that characters following are single-byte up to the next SO character (if any).

    T

    Teradata LATIN

    Teradata-defined character repertoire that combines the ISO 8859 Latin1 and Latin9 standards.

    terminal symbols

    SQL symbols defined in seven-byte ASCII including Latin letters, digits, numeric operators, punctuation marks, and so on.

    two-level collation

    Collation that orders character strings according to a two-level comparison. Characters in the strings to be compared are first partitioned into equivalence classes that have the same collating value. The relative ordering of classes and characters within a class is significant. The characters within each class are ordered using criteria defined for the collation sequence, and compared.

    The MULTINATIONAL Norwegian standard collation sequence is an example of a two-level collation.

    U

    U+

    Prefix for Unicode code points, which are expressed in the format U+xxxx, where xxxx represents a four-digit hexadecimal number.

    UCS

    Universal Coded Character Set, specified by International Standard ISO/IEC 10646.

    UCS-2

    Encoding of the Unicode repertoire that uses two bytes per character.

    UTF

    Universal Coded Character Set Transformation Format

    UTF8

    A Teradata Database predefined client character set that supports the UTF-8 standard character encoding.

    UTF-8

    Mixed single- and multibyte encoding of the Unicode repertoire that preserves seven-bit ASCII characters.

    UTF16

    A Teradata Database predefined client character set that supports the UTF-16 standard character encoding.

    UTF-16

    Encoding of the Unicode repertoire based upon 16-bit units. UTF-16 is backward compatible with UCS-2.

    Unicode

    Standard that represents text by specifying a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.

    Unicode is defined by the Unicode Consortium and synchronized with the International Organization for Standardization ISO/IEC 10646 character set standard.

    The UNICODE server character set supports the Unicode 6.0 standard.

    Unicode letter

    Characters defined as letters by the Unicode standard, including letters in the script and compatibility zone areas.

    Z

    Zenkaku

    Literally "full square."

    Standard orthography for Japanese has each character occupy roughly a square. European orthography tends to have characters that are narrower than their height. Placing two "half square" characters per square allows both orthographies to coexist.

    Many encodings include both Hankaku ("half square") and Zenkaku versions of Katakana and other characters independently.