Defining Tagging Rules - Aster Analytics

Teradata Aster Analytics Foundation User Guide

Product
Aster Analytics
Release Number
6.21
Published
November 2016
Language
English (United States)
Last Update
2018-04-14
dita:mapPath
kiu1466024880662.ditamap
dita:ditavalPath
AA-notempfilter_pdf_output.ditaval
dita:id
B700-1021
lifecycle
previous
Product Category
Software

You can specify tagging rules with either the Rules argument or a rules table.

The following table explains the operations that a rule can use. In the table:

  • The operand opn (where n is 1, 2, or 3) can be any of the following:
    • A string literal

      You must enclose the string literal in double quotation marks (for example, "Start countdown").

      If the string literal contains double quotation marks, then you must precede each double quotation mark with two backslashes (for example, "\\"Start countdown\\"").

      The empty string ("") is not allowed.

      If an operation has only string literal operands, matches are case-insensitive and do not consider overlapping.

    • A Java regular expression (regex"exp")

      An operation with one or more Java regular expression operands uses fuzzy matching. Fuzzy matching evaluates the original text input; that is, matching is case-sensitive and the text is not tokenized.

    • In the superdist operation only, a list of string literals or Java regular expressions

      For details, see the description of the superdist operation in the following table.

  • The operands lower and upper are nonnegative integers.

    You can omit either lower or upper, but not both. For example, all of the following are valid syntax for the contain operation:

    contain (col, op1, lower, upper)
    contain (col, op1, lower,)
    contain (col, op1,, upper)

    If x is the number of times that op1 appears in col, then the preceding operations have the following meanings, respectively:

    lower <= x <= upper

    lower <= x

    x <= upper

    The meanings of lower, x, and upper depend on the operation.

    For simplicity, the following table shows only the syntax that specifies both lower and upper.

Rule Operations
Syntax Description
equal (col, op1)
Returns 'true' if the text in column col and the value of op1 are equal; 'false' otherwise.
contain (col, op1,
lower, upper)
Returns 'true' if, in column col, the number of times that the value of op1 appears is in the range [lower, upper]; 'false' otherwise.
dist (col, op1, op2,
lower, upper)
Returns 'true' if, in column col, the distance between the values of op1 and op2 (that is, the number of words between them) is in the range [lower, upper]; 'false' otherwise.

The distance computation depends on the Language and UseTokenizer arguments.

By default, Language is 'en' (English) and UseTokenizer is 'false', and words are delimited by whitespace characters.

If Language is 'zh_cn' (Simplified Chinese) or 'zh_tw' (Traditional Chinese) and UseTokenizer is 'true', then the function performs word segmentation before computing the distance between words.

superdist (col, op1, op2, con1, op3, con2)
Returns 'true' if, in column col, the values of op1, op2, and op3 satisfy the context rules con1 and con2; 'false' otherwise.

The rule con1 specifies the context for inclusion. The possible values of con1 and their meanings are:

nwn: op2 appears n or fewer words before or after op1.

nrn: op2 appears n or fewer words after op1.

para: op2 appears in the same paragraph as op1.

sent: op2 appears in the same sentence as op1.

The rule con2 specifies the context for exclusion. The possible values of con2 and their meanings are:

nwn: op3 does not appear n or fewer words before or after op1.

nrn: op3 does not appear n or fewer words after op1.

para: op3 does not appear in the same paragraph as op1.

sent: op3 does not appear in the same sentence as op1.

The distance computation depends on the Language and UseTokenizer arguments (for details, see the description of the dist operation).

A paragraph ends with either "\n" or "\r\n". A sentence ends with either period (.), question mark (?), or exclamation mark (!). The function fragments the input into paragraphs or sentences and then checks the context rule on each piece of text. If one piece satisfies the rule, then the function tags the whole input.

opn (where n is 1, 2, or 3) can be a list of words. Enclose the list in double quotation marks and separate the words with semicolons. For example: "good;bad;neutral"

If opn is a Java regular expression, then exp can be a list. Separate the items with semicolons. For example: regex"invest[\w]*;volatil[\w]*;risk"

When a list appears in an inclusion context, the rule is satisfied if at least one item appears in the context. When a list appears in an exclusion context, the rule is satisfied if no item appears in the context.

The operand-context pairs after op1 are optional; that is, the following are valid syntax:

superdist(col, op1,,,,)
superdist(col, op1, op2, con1,,)
superdist(col, op1,,, op3, con2)
superdist(col, op1, op2, con1, op3, con2)
superdist(col, op1,,,,)

The final syntax in the preceding list returns 'true' if op1 appears in col.

dict (col,
"[schema/]dictionary",lower, upper)
Returns 'true' if, in column col, the number of items (lines in the dictionary file) is in the range [lower, upper]; 'false' otherwise.
This operation requires that the dictionary file [schema.] dictionary is installed on your Aster Database cluster. The dictionary name, dictionary, is case-sensitive. If the dictionary is in the public schema, then you can omit the schema name, schema.
operation1 and operation2
Returns 'true' if both operation1 and operation2 return 'true'; 'false' otherwise.
operation1 or operation2
Returns 'true' if one or both operation1 or operation2 returns 'true'; 'false' otherwise.
not operation
Returns 'true' if operation returns 'false'; 'false' if operation returns 'true'.