Directives

Each statement contains a directive or is associated with a directive that identifies the purpose of the statement. A directive is analogous to a command but different terminology is used to prevent confusion with true TDP commands.

The following directives are supported:

Table 3: Directives
Directive	Function
CHAR	Defines the syntactic characters.
CHARSET	Explicitly begins a definition and possibly the encoding scheme.
END	Ends processing of records in the file.
MONOCASE	Defines characters that have both lower and upper case.
NUMERICS	Defines the numeric characters.
SANITIZE	Defines valid characters for TDP messages sent using operating system facilities.
UNICODE	Defines the syntactic characters and characters that have both lower and upper case.

A file describes one or more character sets, although only one description is used by each SET USERCS command.

When multiple descriptions are present, each begins with a CHARSET directive and ends with the next CHARSET directive, the END directive, or the last record in the file.

The CHAR, MONOCASE, NUMERICS, SANITIZE, and UNICASE directives can appear in any order within a description. If a CHAR, MONOCASE, NUMERICS, SANITIZE, or UNICODE directive appears before a CHARSET directive, then a character set description is implicitly begun -- in effect, a CHARSET directive with no operands is assumed.

The following sections provide information and syntax diagrams for each directive. Refer to Appendix D: “How to Read the Syntax Diagrams,” for additional information on syntax diagrams.

CHAR

The CHAR directive defines the syntactic characters of importance to TDP.

Syntax

Usage Notes

The length of each value is determined by the encoding scheme for the character set. For the characters of interest to TDP, the length is always two except for UTF16 encoding, for which the length is four. The CHAR directive can be specified more than once for each character set.

If the same character is defined more than once for a character set (either on a single CHAR directive, on multiple CHAR directives, or on a CHAR and a UNICODE directive), the last value is used. All four characters must be defined to form a complete character set description.

If no CHARSET directive precedes CHAR, then a character set description is implicitly begun -- in effect, a CHARSET directive with no operands is assumed.

Example

Define the relevant syntactic characters for IBM Code Page 833.

CHAR SPACE 40 COMMA 6B APOSTROP 7D DBLQUOTE 7F

CHARSET

The CHARSET directive explicitly begins a definition and possibly the encoding scheme.

Syntax

Usage Notes

NAME identifies the character set to which the description applies. The name might include a standard suffix that defines the encoding scheme. The standard suffix consists of an underscore, a number not relevant to CLIv2, the encoding character (A, E, I, R, S, T, or U), and an optional character not relevant to CLIv2. Each suffix corresponds to an ENCODING operand value:

E - EBCDIC

I - IBMSOSI

A - ASCII

R - BIGFIVE

S - SJIS

T - EUC-CN or EUC-KR

U - EUC-JP

ENCODING optionally identifies the encoding scheme for the character set. If omitted, the character set must contain a standard suffix that indicates the encoding. If such a suffix exists, then the encoding cannot be overridden using this operand. The following character sets are available in TDP.

ENCODING	Meaning	Characteristics
EBCDIC	Extended Binary-Coded-Decimal Interchange Code	Single-byte (EBCDIC) codepoints: X'00' through X'FF'
IBMOSI	IBM Shift-out/Shift-in	Single-byte (EBCDIC) codepoints: X'00' through X'FF' Double-byte (EBDCIC) codepoints: Shift-out (X'0E') through Shift-in (X'0E')
ASCII	American Standard Code for Information Interchange	Single-byte (ASCII) codepoints: X'00' through X'FF'
BIGFIVE	Big Five Plus	Single-byte (ASCII) codepoints: X'00' through X'80', and X'FF' Double-byte (ASCII) codepoints: X'81' through X'FE'
EUC-CN	Extended Unix Code - China	Single-byte (ASCII) codepoints: X'00' through X'7F' Double-byte (ASCII) codepoints: X'80' through X'FF'
EUC-JP	Extended Unix Code - Japan	Single-byte (ASCII) codepoints: X'00' through X'8D' X'90' through X'FF' Double-byte (ASCII) codepoints: Single-shift1 (X'8E') Triple-byte (ASCII) codepoints: Single-shift2 (X'8F)'
EUC-KR	Extended Unix Code - Korea	Single-byte (ASCII) codepoints: X'00' through X'7F' Double-byte (ASCII) codepoints: X'80' through X'FF'
SJIS	Shift-JIS (Japanese Industrial Standard)	Single-byte (ASCII) codepoints: X'00' through X'80' X'A0' through X'DF' X'FD' through X'FF' Double-byte (ASCII) codepoints: X'81' through X'9F' X'E0' through X'FC'
UHC	Unified Hangul Code	Single-byte (ASCII) codepoints: X'00' through X'80', and X'FF' Double-byte (ASCII) codepoints: X'81' through X'FE'
UTF8	UCS (Universal Character Set) Transformation Format 8-bit	Single-byte (Unicode®) codepoints: X'00' through X'7F' Double-byte (Unicode®) codepoints: X'C0' through X'DF' (Most) triple-byte (Unicode®) codepoints: X'E0' through X'FE' Most four-byte codepoints (X'F0' through X'F4') are not supported by Teradata Database.
UTF16	UCS (Universal Character Set) Transformation Format- 16-bit	Single-byte (Unicode®) codepoints: X'0000' through X'D7FF' X'E000' through X'FFFF' Surrogates (four-byte codepoints that begin or end with the two-byte codpoints X'D800' through X'DBFF') are not supported by Teradata Database.

When the NAME operand is specified, if this name does not match the character set name specified on the SET USERCS command, this directive and all directives until the next CHARSET directive are ignored. When the NAME operand is not specified, then this directive is used, which implies that any subsequent CHARSET directives in the file will never be processed since this one will always be used.

While all codepoints are reflected to and from Teradata Database, for character sets that allow mixtures of single and multi-byte characters, only the single-byte characters are meaningful in TDP command syntax.

Example

Begin definition for IBM Code Page 833, the single-byte component for IBM CCSID 933.

CHARSET NAME KOREAN_EBCDIC933 ENCODING IBMSOSI

END

The END directive ends processing of records in the file.

Syntax

Usage Notes

Any remaining records in the file are not read.

Example

END

MONOCASE

The MONOCASE directive optionally defines characters that have both lower and upper case. If this information is not supplied, then no monocasing is performed.

Syntax

Usage Notes

The actual monocase information is contained on statements that immediately follow the MONOCASE directive. Each such statement has the following syntax:

target_codepoint1<-target_codepoint2>: data_codepoint ...

where:

Syntax Element	Function
target_codepoint1	Specifies the first character defined on this statement.
target_codepoint2	Optionally specifies the last character defined on this statement.
data_codepoint	Defines the upper case equivalent for the associated target_codepoint character.

A codepoint is the hexadecimal representation of a character. The number of characters needed to specify a codepoint is dependent on the encoding scheme for the character set. With the current TDP support, the length is always two except for UTF16 encoding, for which the length is four.

If the second target codepoint is specified, then one data codepoint is required for each character in the range between the two target codepoints. If the second target codepoint is omitted, then any number of data codepoints can be specified, each associated with codepoint one greater than the previous.

All statements after the MONOCASE directive that contain a colon are associated with the MONOCASE directive. Lack of a colon indicates that the statement is a new directive and ends that MONOCASE directive.

The only codepoints that need be specified are those for which upper case equivalents exist.

The MONOCASE directive can be specified only once for each character set.

The order of data codepoints among different statements is not significant.

If the same character is defined more than once for a character set (either on a MONOCASE directive, or on a MONOCASE and a UNICODE directive), the last value is used.

If no CHARSET directive precedes MONOCASE, then a character set description is implicitly begun -- in effect, a CHARSET directive with no operands is assumed.

Example

Define the monocase information for IBM Code Page 833, the single-byte component for IBM CCSID 933.

MONOCASE

81-89: C1 C2 C3 C4 C5 C6 C7 C8 C9

91-99: D1 D2 D3 D4 D5 D6 D7 D8 D9

A2-A9: E2 E3 E4 E5 E6 E7 E8 E9

NUMERICS

The NUMERICS directive defines codepoints for the ten numeric characters, zero through nine.

Syntax

Each ‘xxn’ specifies a codepoint for one of the ten numeric characters. The first codepoint is for the number zero, each subsequent codepoint is the next ascending number, up to the number nine.

Usage Notes

The NUMERICS directive can be specified only once for each character set.

If the numerics are defined both by a NUMERICS and a UNICODE directive, the last is used.

If no CHARSET directive precedes NUMERICS, then a character set description is implicitly begun - in effect a CHARSET directive with no operands is assumed.

SANITIZE

The SANITIZE directive optionally defines valid characters for TDP messages sent using operating system facilities. Since all such facilities support only EBCDIC, the sanitizing process ensures that unsupported or non-EBCDIC characters are replaced by an acceptable character (the Hyphen character (hexadecimal 60) is the TDP convention). If this information is not supplied, then a default is chosen based on the encoding scheme.

Syntax

Usage Notes

The actual sanitize information is contained on statements that immediately follow the SANITIZE directive. Each such statement has the following syntax:

target_codepoint1<-target_codepoint2>: data_codepoint ...

where:

Syntax Element	Function
target_codepoint1	Specifies the first character defined on this statement
target_codepoint2	Optionally specifies the last character defined on this statement, and data_codepoint defines the replacement character for the associated target_codepoint character.

A codepoint is the hexadecimal representation of a character. The number of characters needed to specify a codepoint is dependent on the encoding scheme for the character set. For the characters of interest to TDP, the length is always two except for UTF16 encoding, for which the length is four.

All statements after the SANITIZE directive that contain a colon are associated with the SANITIZE directive. Lack of a colon indicates that the statement is a new directive and ends that SANITIZE directive.

The SANITIZE directive can be specified only once for each character set.

The order of data codepoints among different statements is not significant. If the same character is defined more than once for a character set, the last value is used.

If no CHARSET directive precedes SANITIZE, then a character set description is implicitly begun -- in effect, a CHARSET directive with no operands is assumed.

Example

Provide the sanitize information for IBM Code Page 833, the single-byte component for IBM CCSID 933. The valid characters which do not correspond to standard EBCDIC are converted to Hyphens

SANITIZE

0E-0F: 4C 6E

42-49: 60 60 60 60 60 60 60 60

52-59: 60 60 60 60 60 60 60 60

62-69: 60 60 60 60 60 60 60 60

72-78: 60 60 60 60 60 60 60

8A-8F: 60 60 60 60 60 60

9A-9F: 60 60 60 60 60 60

AA-AF: 60

B2: 60

BA-BC: 60 60 60

E0: 60

UNICODE

The UNICODE directive defines the syntactic characters and characters that have both lower and upper case. It might be possible to use it to provide the same information as the CHAR, MONOCASE, and NUMERICS directives. Since UNICODE is required to add a user-defined character set to CLIv2, it is also supported by TDP to potentially simplify use of user-defined character sets. The relevant syntactic characters in the character set are those that have the Unicode® codepoints of 0020 (Space), 0022 (Quotation Mark), 0025 (Percent), 0027 (Apostrophe), 002C (Comma), 002E (Period), 002F (Slash), 0030 through 0039 (Numerics Zero through Nine), 003A (Colon), 005B (Left Bracket), and 005D (Right Bracket). The monocase information in the character set are those that have the Unicode® codepoints of 0061 through 007A (lower case) and 0041 through 005A (upper case). Codepoints beyond those relevant to CHAR, MONOCASE, and NUMERICS are ignored. If these are not the characteristics of the character set, then CHAR, MONOCASE, and NUMERICS must be used instead of UNICODE.

Syntax

Usage Notes

The actual information is contained on statements that immediately follow the UNICODE directive. Each such statement has the following syntax:

target_codepoint1<-target_codepoint2>: data_codepoint ...

where:

Syntax Element	Function
target_codepoint1	Specifies the first character in the user-defined character set that is defined on this statement.
target_codepoint2	Optionally specifies the last character defined on this statement, and data_codepoint defines the equivalent character in Unicode®.

A codepoint is the hexadecimal representation of a character. The number of characters needed to specify a target codepoint is dependent on the encoding scheme for the character set. For the characters of interest to TDP, the length is always two except for UTF16 encoding, for which the length is four. The length of a data codepoint is always four.

All statements after the UNICODE directive that contain a colon are associated with the UNICODE directive. Lack of a colon indicates that the statement is a new directive and ends that UNICODE directive.

The order of data codepoints among different statements is not significant.

The UNICODE directive can be specified only once for each character set.

If the same character is defined for the same purpose more than once for a character set (using a CHAR, MONOCASE, NUMERICS, or UNICODE directive), the last value is used.

If no CHARSET directive precedes UNICODE, then a character set description is implicitly begun -- in effect, a CHARSET directive with no operands is assumed.

Example

Define the Unicode® equivalents for IBM Code Page 833, the single-byte component for IBM CCSID 933.

UNICODE

40-47: 0020 001A 115F 1100 1101 1115 1102 11AC

48-4F: 11AD 1103 00A2 002E 003C 0028 002B 007C

50-57: 0026 001A 1104 1105 11B0 11B1 11B2 11B3

58-5F: 11B4 11B5 0021 0024 002A 0029 003B 00AC

60-67: 002D 002F 11B6 1106 1107 1108 1121 1109

68-6F: 110A 110B 00A6 002C 0025 005F 003E 003F

70-77: 005B 001A 110C 110D 110E 110F 1110 1111

78-7F: 1112 0060 003A 0023 0040 0027 003D 0022

80-87: 005D 0061 0062 0063 0064 0065 0066 0067

88-8F: 0068 0069 1161 1162 1163 1164 1165 1166

90-97: 001A 006A 006B 006C 006D 006E 006F 0070

98-9F: 0071 0072 1167 1168 1169 116A 116B 116C

A0-A7: 00AF 007E 0073 0074 0075 0076 0077 0078

A8-AF: 0079 007A 116D 116E 116F 1170 1171 1172

B0-B7: 005E 001A 005C 001A 001A 001A 001A 001A

B8-BF: 001A 001A 1173 1174 1175 001A 001A 001A

C0-C7: 007B 0041 0042 0043 0044 0045 0046 0047

C8-CF: 0048 0049 001A 001A 001A 001A 001A 001A

D0-D7: 007D 004A 004B 004C 004D 004E 004F 0050

D8-DF: 0051 0052 001A 001A 001A 001A 001A 001A

E0-E7: 20A9 001A 0053 0054 0055 0056 0057 0058

E8-EF: 0059 005A 001A 001A 001A 001A 001A 001A

F0-F7: 0030 0031 0032 0033 0034 0035 0036 0037

F8-FF: 0038 0039 001A 001A 001A 001A 001A 001A