Japanese Character Sort Order Considerations - Analytics Database - Teradata Vantage

SQL Data Manipulation Language

Analytics Database
Teradata Vantage
Release Number
June 2022
English (United States)
Last Update
Product Category
Teradata Vantage™

If character strings are to be sorted NOT CASESPECIFIC, only lowercase simple letters, a through z in the Latin alphabet, are converted to uppercase before a comparison or sorting operation is done. NOT CASESPECIFIC is the default in Teradata session mode. See SET SESSION COLLATION in Teradata Vantage™ - SQL Data Definition Language Syntax and Examples, B035-1144.

Case differences are ignored in NOT CASESPECIFIC collations. The case of the characters is not critical for NOT CASESPECIFIC collation.

Any non-Latin single-byte character, any multibyte character, and any byte indicating a transition between single-byte characters and multibyte characters is excluded from this function.

If the character strings are to be sorted CASESPECIFIC, which is the default in ANSI session mode, then the case of the characters is critical for collation.

For Kanji1 character data in CASESPECIFIC mode, the letters in any alphabet that uses casing, such as Latin, Greek, or Cyrillic, are considered to be matched only if the letters are identical with the same case.

The system does not consider FULLWIDTH and HALFWIDTH characters to be matched whether in CASESPECIFIC or NOT CASESPECIFIC mode.

Vantage uses one of four server character sets to support Japanese characters:
  • Unicode
  • KanjiSJIS
  • Kanji1 [Deprecated]
  • Graphic
KANJI1 support is deprecated. KANJI1 is not allowed as a default character set. The system changes the KANJI1 default character set to the UNICODE character set. Creation of new KANJI1 objects is highly restricted. Although many KANJI1 queries and applications may continue to operate, sites using KANJI1 should convert to another character set as soon as possible.
For Japanese character sites, you can set your session collation as follows:
  • For character data stored on the Vantage platform as KanjiSJIS or Unicode, the best way to order the session character set is to use the CHARSET_COLL collation.

    For character data stored on the Vantage platform as either KanjiSJIS or Unicode, the CHARSET_COLL collation provides the collation that is closest to sorting on the client.

  • The JIS_COLL collation also provides an adequate collation, and also provides the same collation regardless of the session character set.
  • The CHARSET_COLL and JIS_COLL collation sequences are not designed to support Kanji1 character data.

    For Kanji1 character data, the ASCII collation provides a collation similar to that of the client, assuming the session character set is KanjiSJIS_0S, KanjiEUC_0U, or something very similar. However, ASCII collation does not sort like the client if the data is stored on the Vantage platform using the ShiftJIS or Unicode server character sets.

  • Users under the KATAKANAEBCDIC, KANJIEBCDIC5026_0I, or KANJIEBCDIC5035_0I character sets who want to store character data on the Vantage platform as Kanji1, and who want to collate in the session character set should install either KATAKANAEBCDIC, KANJIEBCDIC5026_0I, or KANJIEBCDIC5035_0I, respectively, at start-up time, and use MULTINATIONAL collation.

    Each character set requires a different definition for its MULTINATIONAL collation to collate properly.

Vantage handles character data collation for the different Japanese server character sets as follows:
  • Under the KanjiEUC character set, the ss3 0x8F is converted to 0xFF. This means that a user-defined KanjiEUC codeset 3 are not ordered properly with respect to other KanjiEUC code sets. The order of other KanjiEUC code sets is proper, that is, ordering is the same as the binary ordering on the client system.

    ASCII collation collates Kanji1 data in binary order, but handles case matching of single-byte or HALFWIDTH Latin characters. See International Character Set Support. This matches collation on the client for KanjiSJIS_0S sessions, is close for KanjiEUC_0U, and is only reasonably close for double-byte data for KanjiEBCDIC sessions. KanjiEUC_0U puts code set 3 after code set 1 rather than before code set 1 as KanjiEBCDIC does.

  • For Kanji1 data, characters identified as multibyte characters remain in the client encoding and are collated based on their binary values. This explains why ASCII collation works for double-byte characters in KanjiEBCDIC sessions.

    Multibyte Kanji1 characters do not remain in the client encoding for KanjiEUC_0U sessions.

For details, see International Character Set Support, B035-1125 and Teradata Vantage™ - Data Types and Literals, B035-1143.