Defining Tagging Rules - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
8.10
1.1
Published
October 2019
Language
English (United States)
Last Update
2019-12-31
dita:mapPath
ima1540829771750.ditamap
dita:ditavalPath
jsj1481748799576.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢

You can specify tagging rules with either the TaggingRules syntax element or a Rules table.

Rules for Rule Operations Table

  • The operand opn (where n is 1, 2, or 3) can be any of the following:
    opn Rules for opn
    String literal Enclose string literal in double quotation marks (for example, "Start countdown").

    If string literal contains double quotation marks, precede each double quotation mark with two backslashes (for example, "\\"Start countdown\\"").

    Do not use the empty string ("").

    If an operation has only string literal operands, matches are case-insensitive and do not consider overlapping.

    Java regular expression (regex"exp") An operation with one or more Java regular expression operands uses fuzzy matching. Fuzzy matching evaluates original text input; that is, matching is case-sensitive and text is not tokenized.
    [superdist operation only] List of string literals or Java regular expressions For details, see description of superdist operation in following table.
  • The operands lower and upper are nonnegative integers.

    You can omit either lower or upper, but not both. For example, all of the following are valid syntax for the contain operation:

    contain (col, op1, lower, upper)
    contain (col, op1, lower,)
    contain (col, op1,, upper)

    If x is the number of times that op1 appears in col, then the preceding operations have the following meanings, respectively:

    lower <= x <= upper

    lower <= x

    x <= upper

    The meanings of lower, x, and upper depend on the operation.

Rule Operations

This table summarizes the operations that a rule can use. For simplicity, the table shows only the syntax that specifies both lower and upper.

Syntax Description
equal (col, op1) Returns 'true' if the text in column col and the value of op1 are equal; 'false' otherwise.
contain (col, op1, lower, upper) Returns 'true' if, in column col, the number of times that the value of op1 appears is in the range [lower, upper]; 'false' otherwise.
dist (col, op1, op2, lower, upper) Returns 'true' if, in column col, the distance between the values of op1 and op2 (that is, the number of words between them) is in the range [lower, upper]; 'false' otherwise.

The distance computation depends on the InputLanguage and UseTokenizer syntax elements.

By default, InputLanguage is 'en' (English) and UseTokenizer is 'false', and words are delimited by whitespace characters.

If InputLanguage is 'zh_cn' (Simplified Chinese) or 'zh_tw' (Traditional Chinese) and UseTokenizer is 'true', then the function performs word segmentation before computing the distance between words.

superdist (col, op1, op2, con1, op3, con2) Returns 'true' if, in column col, the values of op1, op2, and op3 satisfy the context rules con1 and con2; 'false' otherwise.
The rules con1 and con2 specify the context for inclusion and exclusion, as the following table shows.
con1 or con2 Value con1 Meaning con2 Meaning
nwn op2 appears n or fewer words before or after op1. op3 does not appear n or fewer words before or after op1.
nrn op2 appears n or fewer words after op1. op3 does not appear n or fewer words after op1.
para op2 appears in the same paragraph as op1. op3 does not appear in the same paragraph as op1.
sent op2 appears in the same sentence as op1. op3 does not appear in the same sentence as op1.

The distance computation depends on the InputLanguage and UseTokenizer syntax elements (for details, see the description of the dist operation).

A paragraph ends with either "\n" or "\r\n". A sentence ends with either period (.), question mark (?), or exclamation mark (!). The function fragments the input into paragraphs or sentences and then checks the context rule on each piece of text. If one piece satisfies the rule, then the function tags the whole input.

opn (where n is 1, 2, or 3) can be a list of words. Enclose the list in double quotation marks and separate the words with semicolons. For example: "good;bad;neutral"

If opn is a Java regular expression, then exp can be a list. Separate the items with semicolons. For example: regex"invest[\w]*;volatil[\w]*;risk"

When a list appears in an inclusion context, the rule is satisfied if at least one item appears in the context. When a list appears in an exclusion context, the rule is satisfied if no item appears in the context.

The operand-context pairs after op1 are optional; that is, the following are valid syntax:

superdist(col, op1,,,,)
superdist(col, op1, op2, con1,,)
superdist(col, op1,,, op3, con2)
superdist(col, op1, op2, con1, op3, con2)
superdist(col, op1,,,,)

The final syntax in the preceding list returns 'true' if op1 appears in col.

dict (col, "[schema/]dictionary",lower, upper) Returns 'true' if, in column col, the number of items (lines in the dictionary file) is in the range [lower, upper]; 'false' otherwise.

This operation requires that dictionary file [schema.] dictionary is installed on ML Engine. Dictionary name, dictionary, is case-sensitive. If dictionary is in public schema, you can omit schema name, schema.

operation1 and operation2 Returns 'true' if both operation1 and operation2 return 'true'; 'false' otherwise.
operation1 or operation2 Returns 'true' if one or both operation1 or operation2 returns 'true'; 'false' otherwise.
not operation Returns 'true' if operation returns 'false'; 'false' if operation returns 'true'.