You can specify tagging rules with either the TaggingRules syntax element or a Rules table.
Rules for Rule Operations Table
- The operand opn (where n is 1, 2, or 3) can be any of the following:
opn Rules for opn String literal Enclose string literal in double quotation marks (for example, "Start countdown").
If string literal contains double quotation marks, precede each double quotation mark with two backslashes (for example, "\\"Start countdown\\"").
Do not use the empty string ("").
If an operation has only string literal operands, matches are case-insensitive and do not consider overlapping.
Java regular expression (regex"exp") An operation with one or more Java regular expression operands uses fuzzy matching. Fuzzy matching evaluates original text input; that is, matching is case-sensitive and text is not tokenized. [superdist operation only] List of string literals or Java regular expressions For details, see description of superdist operation in following table.
- The operands lower and upper are nonnegative integers.
You can omit either lower or upper, but not both. For example, all of the following are valid syntax for the contain operation:
contain (col, op1, lower, upper) contain (col, op1, lower,) contain (col, op1,, upper)
If x is the number of times that op1 appears in col, then the preceding operations have the following meanings, respectively:
lower <= x <= upper
lower <= x
x <= upper
The meanings of lower, x, and upper depend on the operation.
This table summarizes the operations that a rule can use. For simplicity, the table shows only the syntax that specifies both lower and upper.
|equal (col, op1)||Returns 'true' if the text in column col and the value of op1 are equal; 'false' otherwise.|
|contain (col, op1, lower, upper)||Returns 'true' if, in column col, the number of times that the value of op1 appears is in the range [lower, upper]; 'false' otherwise.|
|dist (col, op1, op2, lower, upper)||Returns 'true' if, in column col, the distance between the values of op1 and op2 (that is, the number of words between them) is in the range [lower, upper]; 'false' otherwise.
The distance computation depends on the InputLanguage and UseTokenizer syntax elements.
By default, InputLanguage is 'en' (English) and UseTokenizer is 'false', and words are delimited by whitespace characters.
If InputLanguage is 'zh_cn' (Simplified Chinese) or 'zh_tw' (Traditional Chinese) and UseTokenizer is 'true', then the function performs word segmentation before computing the distance between words.
|superdist (col, op1, op2, con1, op3, con2)||Returns 'true' if, in column col, the values of op1, op2, and op3 satisfy the context rules con1 and con2; 'false' otherwise.
The rules con1 and con2 specify the context for inclusion and exclusion, as the following table shows.
The distance computation depends on the InputLanguage and UseTokenizer syntax elements (for details, see the description of the dist operation).
A paragraph ends with either "\n" or "\r\n". A sentence ends with either period (.), question mark (?), or exclamation mark (!). The function fragments the input into paragraphs or sentences and then checks the context rule on each piece of text. If one piece satisfies the rule, then the function tags the whole input.
opn (where n is 1, 2, or 3) can be a list of words. Enclose the list in double quotation marks and separate the words with semicolons. For example: "good;bad;neutral"
If opn is a Java regular expression, then exp can be a list. Separate the items with semicolons. For example: regex"invest[\w]*;volatil[\w]*;risk"
When a list appears in an inclusion context, the rule is satisfied if at least one item appears in the context. When a list appears in an exclusion context, the rule is satisfied if no item appears in the context.
The operand-context pairs after op1 are optional; that is, the following are valid syntax:
superdist(col, op1, op2, con1,,)
superdist(col, op1,,, op3, con2)
superdist(col, op1, op2, con1, op3, con2)
The final syntax in the preceding list returns 'true' if op1 appears in col.
|dict (col, "[schema/]dictionary",lower, upper)||Returns 'true' if, in column col, the number of items (lines in the dictionary file) is in the range [lower, upper]; 'false' otherwise.
This operation requires that dictionary file [schema.] dictionary is installed on ML Engine. Dictionary name, dictionary, is case-sensitive. If dictionary is in public schema, you can omit schema name, schema.
|operation1 and operation2||Returns 'true' if both operation1 and operation2 return 'true'; 'false' otherwise.|
|operation1 or operation2||Returns 'true' if one or both operation1 or operation2 returns 'true'; 'false' otherwise.|
|not operation||Returns 'true' if operation returns 'false'; 'false' if operation returns 'true'.|