Defining Tagging Rules - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product

Teradata Vantage

Release Number

8.10

1.1

Published

October 2019

Language

English (United States)

Last Update

2019-12-31

dita:mapPath

ima1540829771750.ditamap

dita:ditavalPath

jsj1481748799576.ditaval

dita:id

B700-4003

lifecycle

Product Category

Teradata Vantage™

You can specify tagging rules with either the TaggingRules syntax element or a Rules table.

Rules for Rule Operations Table

The operand opn (where n is 1, 2, or 3) can be any of the following:

opn	Rules for opn
String literal	Enclose string literal in double quotation marks (for example, "Start countdown"). If string literal contains double quotation marks, precede each double quotation mark with two backslashes (for example, "\\"Start countdown\\""). Do not use the empty string (""). If an operation has only string literal operands, matches are case-insensitive and do not consider overlapping.
Java regular expression (regex"exp")	An operation with one or more Java regular expression operands uses fuzzy matching. Fuzzy matching evaluates original text input; that is, matching is case-sensitive and text is not tokenized.
[superdist operation only] List of string literals or Java regular expressions	For details, see description of superdist operation in following table.

opn

Rules for opn

String literal

Enclose string literal in double quotation marks (for example, "Start countdown").

If string literal contains double quotation marks, precede each double quotation mark with two backslashes (for example, "\\"Start countdown\\"").

Do not use the empty string ("").

If an operation has only string literal operands, matches are case-insensitive and do not consider overlapping.

Java regular expression (regex"exp")

An operation with one or more Java regular expression operands uses fuzzy matching. Fuzzy matching evaluates original text input; that is, matching is case-sensitive and text is not tokenized.

[superdist operation only] List of string literals or Java regular expressions

For details, see description of superdist operation in following table.

The operands lower and upper are nonnegative integers.
You can omit either lower or upper, but not both. For example, all of the following are valid syntax for the contain operation:
```
contain (col, op1, lower, upper)
contain (col, op1, lower,)
contain (col, op1,, upper)
```
If x is the number of times that op1 appears in col, then the preceding operations have the following meanings, respectively:

lower <= x <= upper

lower <= x

x <= upper

The meanings of lower, x, and upper depend on the operation.

Rule Operations

This table summarizes the operations that a rule can use. For simplicity, the table shows only the syntax that specifies both lower and upper.

Syntax

Description

equal (col, op1)

Returns 'true' if the text in column col and the value of op1 are equal; 'false' otherwise.

contain (col, op1, lower, upper)

Returns 'true' if, in column col, the number of times that the value of op1 appears is in the range [lower, upper]; 'false' otherwise.

dist (col, op1, op2, lower, upper)

Returns 'true' if, in column col, the distance between the values of op1 and op2 (that is, the number of words between them) is in the range [lower, upper]; 'false' otherwise.

The distance computation depends on the InputLanguage and UseTokenizer syntax elements.

By default, InputLanguage is 'en' (English) and UseTokenizer is 'false', and words are delimited by whitespace characters.

If InputLanguage is 'zh_cn' (Simplified Chinese) or 'zh_tw' (Traditional Chinese) and UseTokenizer is 'true', then the function performs word segmentation before computing the distance between words.

superdist (col, op1, op2, con1, op3, con2)

Returns 'true' if, in column col, the values of op1, op2, and op3 satisfy the context rules con1 and con2; 'false' otherwise.

The rules con1 and con2 specify the context for inclusion and exclusion, as the following table shows.

con1 or con2 Value	con1 Meaning	con2 Meaning
nwn	op2 appears n or fewer words before or after op1.	op3 does not appear n or fewer words before or after op1.
nrn	op2 appears n or fewer words after op1.	op3 does not appear n or fewer words after op1.
para	op2 appears in the same paragraph as op1.	op3 does not appear in the same paragraph as op1.
sent	op2 appears in the same sentence as op1.	op3 does not appear in the same sentence as op1.

The distance computation depends on the InputLanguage and UseTokenizer syntax elements (for details, see the description of the dist operation).

A paragraph ends with either "\n" or "\r\n". A sentence ends with either period (.), question mark (?), or exclamation mark (!). The function fragments the input into paragraphs or sentences and then checks the context rule on each piece of text. If one piece satisfies the rule, then the function tags the whole input.

opn (where n is 1, 2, or 3) can be a list of words. Enclose the list in double quotation marks and separate the words with semicolons. For example: "good;bad;neutral"

If opn is a Java regular expression, then exp can be a list. Separate the items with semicolons. For example: regex"invest[\w]*;volatil[\w]*;risk"

When a list appears in an inclusion context, the rule is satisfied if at least one item appears in the context. When a list appears in an exclusion context, the rule is satisfied if no item appears in the context.

The operand-context pairs after op1 are optional; that is, the following are valid syntax:

superdist(col, op1,,,,)

superdist(col, op1, op2, con1,,)

superdist(col, op1,,, op3, con2)

superdist(col, op1, op2, con1, op3, con2)

superdist(col, op1,,,,)

The final syntax in the preceding list returns 'true' if op1 appears in col.

dict (col, "[schema/]dictionary",lower, upper)

Returns 'true' if, in column col, the number of items (lines in the dictionary file) is in the range [lower, upper]; 'false' otherwise.

This operation requires that dictionary file [schema.] dictionary is installed on ML Engine. Dictionary name, dictionary, is case-sensitive. If dictionary is in public schema, you can omit schema name, schema.

operation1 and operation2

Returns 'true' if both operation1 and operation2 return 'true'; 'false' otherwise.

operation1 or operation2

Returns 'true' if one or both operation1 or operation2 returns 'true'; 'false' otherwise.

not operation

Returns 'true' if operation returns 'false'; 'false' if operation returns 'true'.