Directives
Each statement contains a directive or is associated with a directive that identifies the purpose of the statement. A directive is analogous to a command but different terminology is used to prevent confusion with true TDP commands.
The following directives are supported:
Directive |
Function |
Defines the syntactic characters. |
|
Explicitly begins a definition and possibly the encoding scheme. |
|
Ends processing of records in the file. |
|
Defines characters that have both lower and upper case. |
|
Defines the numeric characters. |
|
Defines valid characters for TDP messages sent using operating system facilities. |
|
Defines the syntactic characters and characters that have both lower and upper case. |
A file describes one or more character sets, although only one description is used by each SET USERCS command.
When multiple descriptions are present, each begins with a CHARSET directive and ends with the next CHARSET directive, the END directive, or the last record in the file.
The CHAR, MONOCASE, NUMERICS, SANITIZE, and UNICASE directives can appear in any order within a description. If a CHAR, MONOCASE, NUMERICS, SANITIZE, or UNICODE directive appears before a CHARSET directive, then a character set description is implicitly begun -- in effect, a CHARSET directive with no operands is assumed.
The following sections provide information and syntax diagrams for each directive. Refer to Appendix D: “How to Read the Syntax Diagrams,” for additional information on syntax diagrams.
CHAR
The CHAR directive defines the syntactic characters of importance to TDP.
Syntax
Usage Notes
The length of each value is determined by the encoding scheme for the character set. For the characters of interest to TDP, the length is always two except for UTF16 encoding, for which the length is four. The CHAR directive can be specified more than once for each character set.
If the same character is defined more than once for a character set (either on a single CHAR directive, on multiple CHAR directives, or on a CHAR and a UNICODE directive), the last value is used. All four characters must be defined to form a complete character set description.
If no CHARSET directive precedes CHAR, then a character set description is implicitly begun -- in effect, a CHARSET directive with no operands is assumed.
Example
Define the relevant syntactic characters for IBM Code Page 833.
CHAR SPACE 40 COMMA 6B APOSTROP 7D DBLQUOTE 7F
CHARSET
The CHARSET directive explicitly begins a definition and possibly the encoding scheme.
Syntax
Usage Notes
NAME identifies the character set to which the description applies. The name might include a standard suffix that defines the encoding scheme. The standard suffix consists of an underscore, a number not relevant to CLIv2, the encoding character (A, E, I, R, S, T, or U), and an optional character not relevant to CLIv2. Each suffix corresponds to an ENCODING operand value:
ENCODING optionally identifies the encoding scheme for the character set. If omitted, the character set must contain a standard suffix that indicates the encoding. If such a suffix exists, then the encoding cannot be overridden using this operand. The following character sets are available in TDP.
ENCODING |
Meaning |
Characteristics |
EBCDIC |
Extended Binary-Coded-Decimal Interchange Code |
X'00' through X'FF' |
IBMOSI |
IBM Shift-out/Shift-in |
X'00' through X'FF' Shift-out (X'0E') through Shift-in (X'0E') |
ASCII |
American Standard Code for Information Interchange |
X'00' through X'FF' |
BIGFIVE |
Big Five Plus |
X'00' through X'80', and X'FF' X'81' through X'FE' |
EUC-CN |
Extended Unix Code - China |
X'00' through X'7F' X'80' through X'FF' |
EUC-JP |
Extended Unix Code - Japan |
X'00' through X'8D' X'90' through X'FF' Single-shift1 (X'8E') Single-shift2 (X'8F)' |
EUC-KR |
Extended Unix Code - Korea |
X'00' through X'7F' X'80' through X'FF' |
SJIS |
Shift-JIS (Japanese Industrial Standard) |
X'00' through X'80' X'A0' through X'DF' X'FD' through X'FF' X'81' through X'9F' X'E0' through X'FC' |
UHC |
Unified Hangul Code |
X'00' through X'80', and X'FF' X'81' through X'FE' |
UTF8 |
UCS (Universal Character Set) Transformation Format 8-bit |
X'00' through X'7F' X'C0' through X'DF' X'E0' through X'FE' Most four-byte codepoints (X'F0' through X'F4') are not supported by Teradata Database. |
UTF16 |
UCS (Universal Character Set) Transformation Format- 16-bit |
X'0000' through X'D7FF' X'E000' through X'FFFF' Surrogates (four-byte codepoints that begin or end with the two-byte codpoints X'D800' through X'DBFF') are not supported by Teradata Database. |
When the NAME operand is specified, if this name does not match the character set name specified on the SET USERCS command, this directive and all directives until the next CHARSET directive are ignored. When the NAME operand is not specified, then this directive is used, which implies that any subsequent CHARSET directives in the file will never be processed since this one will always be used.
While all codepoints are reflected to and from Teradata Database, for character sets that allow mixtures of single and multi-byte characters, only the single-byte characters are meaningful in TDP command syntax.
Example
Begin definition for IBM Code Page 833, the single-byte component for IBM CCSID 933.
CHARSET NAME KOREAN_EBCDIC933 ENCODING IBMSOSI
END
The END directive ends processing of records in the file.
Syntax
Usage Notes
Any remaining records in the file are not read.
Example
END
MONOCASE
The MONOCASE directive optionally defines characters that have both lower and upper case. If this information is not supplied, then no monocasing is performed.
Syntax
Usage Notes
The actual monocase information is contained on statements that immediately follow the MONOCASE directive. Each such statement has the following syntax:
target_codepoint1<-target_codepoint2>: data_codepoint ...
where:
Syntax Element |
Function |
target_codepoint1 |
Specifies the first character defined on this statement. |
target_codepoint2 |
Optionally specifies the last character defined on this statement. |
data_codepoint |
Defines the upper case equivalent for the associated target_codepoint character. |
A codepoint is the hexadecimal representation of a character. The number of characters needed to specify a codepoint is dependent on the encoding scheme for the character set. With the current TDP support, the length is always two except for UTF16 encoding, for which the length is four.
If the second target codepoint is specified, then one data codepoint is required for each character in the range between the two target codepoints. If the second target codepoint is omitted, then any number of data codepoints can be specified, each associated with codepoint one greater than the previous.
All statements after the MONOCASE directive that contain a colon are associated with the MONOCASE directive. Lack of a colon indicates that the statement is a new directive and ends that MONOCASE directive.
The only codepoints that need be specified are those for which upper case equivalents exist.
The MONOCASE directive can be specified only once for each character set.
The order of data codepoints among different statements is not significant.
If the same character is defined more than once for a character set (either on a MONOCASE directive, or on a MONOCASE and a UNICODE directive), the last value is used.
If no CHARSET directive precedes MONOCASE, then a character set description is implicitly begun -- in effect, a CHARSET directive with no operands is assumed.
Example
Define the monocase information for IBM Code Page 833, the single-byte component for IBM CCSID 933.
MONOCASE
81-89: C1 C2 C3 C4 C5 C6 C7 C8 C9
91-99: D1 D2 D3 D4 D5 D6 D7 D8 D9
A2-A9: E2 E3 E4 E5 E6 E7 E8 E9
NUMERICS
The NUMERICS directive defines codepoints for the ten numeric characters, zero through nine.
Syntax
Each ‘xxn’ specifies a codepoint for one of the ten numeric characters. The first codepoint is for the number zero, each subsequent codepoint is the next ascending number, up to the number nine.
Usage Notes
The NUMERICS directive can be specified only once for each character set.
If the numerics are defined both by a NUMERICS and a UNICODE directive, the last is used.
If no CHARSET directive precedes NUMERICS, then a character set description is implicitly begun - in effect a CHARSET directive with no operands is assumed.
SANITIZE
The SANITIZE directive optionally defines valid characters for TDP messages sent using operating system facilities. Since all such facilities support only EBCDIC, the sanitizing process ensures that unsupported or non-EBCDIC characters are replaced by an acceptable character (the Hyphen character (hexadecimal 60) is the TDP convention). If this information is not supplied, then a default is chosen based on the encoding scheme.
Syntax
Usage Notes
The actual sanitize information is contained on statements that immediately follow the SANITIZE directive. Each such statement has the following syntax:
target_codepoint1<-target_codepoint2>: data_codepoint ...
where:
Syntax Element |
Function |
target_codepoint1 |
Specifies the first character defined on this statement |
target_codepoint2 |
Optionally specifies the last character defined on this statement, and data_codepoint defines the replacement character for the associated target_codepoint character. |
A codepoint is the hexadecimal representation of a character. The number of characters needed to specify a codepoint is dependent on the encoding scheme for the character set. For the characters of interest to TDP, the length is always two except for UTF16 encoding, for which the length is four.
If the second target codepoint is specified, then one data codepoint is required for each character in the range between the two target codepoints. If the second target codepoint is omitted, then any number of data codepoints can be specified, each associated with codepoint one greater than the previous.
All statements after the SANITIZE directive that contain a colon are associated with the SANITIZE directive. Lack of a colon indicates that the statement is a new directive and ends that SANITIZE directive.
The SANITIZE directive can be specified only once for each character set.
The order of data codepoints among different statements is not significant. If the same character is defined more than once for a character set, the last value is used.
If no CHARSET directive precedes SANITIZE, then a character set description is implicitly begun -- in effect, a CHARSET directive with no operands is assumed.
Example
Provide the sanitize information for IBM Code Page 833, the single-byte component for IBM CCSID 933. The valid characters which do not correspond to standard EBCDIC are converted to Hyphens
SANITIZE
0E-0F: 4C 6E
42-49: 60 60 60 60 60 60 60 60
52-59: 60 60 60 60 60 60 60 60
62-69: 60 60 60 60 60 60 60 60
72-78: 60 60 60 60 60 60 60
8A-8F: 60 60 60 60 60 60
9A-9F: 60 60 60 60 60 60
AA-AF: 60
B2: 60
BA-BC: 60 60 60
E0: 60
UNICODE
The UNICODE directive defines the syntactic characters and characters that have both lower and upper case. It might be possible to use it to provide the same information as the CHAR, MONOCASE, and NUMERICS directives. Since UNICODE is required to add a user-defined character set to CLIv2, it is also supported by TDP to potentially simplify use of user-defined character sets. The relevant syntactic characters in the character set are those that have the Unicode® codepoints of 0020 (Space), 0022 (Quotation Mark), 0025 (Percent), 0027 (Apostrophe), 002C (Comma), 002E (Period), 002F (Slash), 0030 through 0039 (Numerics Zero through Nine), 003A (Colon), 005B (Left Bracket), and 005D (Right Bracket). The monocase information in the character set are those that have the Unicode® codepoints of 0061 through 007A (lower case) and 0041 through 005A (upper case). Codepoints beyond those relevant to CHAR, MONOCASE, and NUMERICS are ignored. If these are not the characteristics of the character set, then CHAR, MONOCASE, and NUMERICS must be used instead of UNICODE.
Syntax
Usage Notes
The actual information is contained on statements that immediately follow the UNICODE directive. Each such statement has the following syntax:
target_codepoint1<-target_codepoint2>: data_codepoint ...
where:
Syntax Element |
Function |
target_codepoint1 |
Specifies the first character in the user-defined character set that is defined on this statement. |
target_codepoint2 |
Optionally specifies the last character defined on this statement, and data_codepoint defines the equivalent character in Unicode®. |
A codepoint is the hexadecimal representation of a character. The number of characters needed to specify a target codepoint is dependent on the encoding scheme for the character set. For the characters of interest to TDP, the length is always two except for UTF16 encoding, for which the length is four. The length of a data codepoint is always four.
If the second target codepoint is specified, then one data codepoint is required for each character in the range between the two target codepoints. If the second target codepoint is omitted, then any number of data codepoints can be specified, each associated with codepoint one greater than the previous.
All statements after the UNICODE directive that contain a colon are associated with the UNICODE directive. Lack of a colon indicates that the statement is a new directive and ends that UNICODE directive.
The order of data codepoints among different statements is not significant.
The UNICODE directive can be specified only once for each character set.
If the same character is defined for the same purpose more than once for a character set (using a CHAR, MONOCASE, NUMERICS, or UNICODE directive), the last value is used.
If no CHARSET directive precedes UNICODE, then a character set description is implicitly begun -- in effect, a CHARSET directive with no operands is assumed.
Example
Define the Unicode® equivalents for IBM Code Page 833, the single-byte component for IBM CCSID 933.
UNICODE
40-47: 0020 001A 115F 1100 1101 1115 1102 11AC
48-4F: 11AD 1103 00A2 002E 003C 0028 002B 007C
50-57: 0026 001A 1104 1105 11B0 11B1 11B2 11B3
58-5F: 11B4 11B5 0021 0024 002A 0029 003B 00AC
60-67: 002D 002F 11B6 1106 1107 1108 1121 1109
68-6F: 110A 110B 00A6 002C 0025 005F 003E 003F
70-77: 005B 001A 110C 110D 110E 110F 1110 1111
78-7F: 1112 0060 003A 0023 0040 0027 003D 0022
80-87: 005D 0061 0062 0063 0064 0065 0066 0067
88-8F: 0068 0069 1161 1162 1163 1164 1165 1166
90-97: 001A 006A 006B 006C 006D 006E 006F 0070
98-9F: 0071 0072 1167 1168 1169 116A 116B 116C
A0-A7: 00AF 007E 0073 0074 0075 0076 0077 0078
A8-AF: 0079 007A 116D 116E 116F 1170 1171 1172
B0-B7: 005E 001A 005C 001A 001A 001A 001A 001A
B8-BF: 001A 001A 1173 1174 1175 001A 001A 001A
C0-C7: 007B 0041 0042 0043 0044 0045 0046 0047
C8-CF: 0048 0049 001A 001A 001A 001A 001A 001A
D0-D7: 007D 004A 004B 004C 004D 004E 004F 0050
D8-DF: 0051 0052 001A 001A 001A 001A 001A 001A
E0-E7: 20A9 001A 0053 0054 0055 0056 0057 0058
E8-EF: 0059 005A 001A 001A 001A 001A 001A 001A
F0-F7: 0030 0031 0032 0033 0034 0035 0036 0037
F8-FF: 0038 0039 001A 001A 001A 001A 001A 001A