NGrams Arguments - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
8.00
1.0
Published
May 2019
Language
English (United States)
Last Update
2019-11-22
dita:mapPath
blj1506016597986.ditamap
dita:ditavalPath
blj1506016597986.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantage™
TextColumn
Specify the name of the column that contains the input text. This column must have a SQL string data type.
Delimiter
[Optional] Specify, with a regular expression, the character or string that separates words in the input text.
Default: "\\s+" (all whitespace characters—space, tab, newline, carriage return and others)
Grams
Specify the length, in words, of each n-gram (that is, the value of n). A value_range has the syntax integer1-integer2, where integer1 <= integer2. The values of n, integer1, and integer2 must be positive.
OverLapping
[Optional] Specify whether the function allows overlapping n-grams.
Default: 'true' (Each word in each sentence starts an n-gram, if enough words follow it in the same sentence to form a whole n-gram of the specified size. For information on sentences, see the Reset argument description.)
ToLowerCase
[Optional] Specify whether the function converts all letters in the input text to lowercase.
Default: 'true'
Reset
[Optional] Specify, with a regular expression, the character or string that ends a sentence. At the end of a sentence, the function discards any partial n-grams and searches for the next n-gram at the beginning of the next sentence. An n-gram cannot span sentences.
The function applies the Reset argument before the Punctuation argument; that is, it splits the input into sentences before removing punctuation characters.
Default: '[.,?!]'
Punctuation
[Optional] Specify, with a regular expression, the punctuation characters for the function to remove before evaluating the input text.
The function applies the Reset argument before the Punctuation argument; that is, it splits the input into sentences before removing punctuation characters.
Default: '[`~#^&*()-]'
TotalGramCount
[Optional] Specify whether the function returns the total number of n-grams in the document (that is, in the row) for each length n specified in the Grams argument. If you specify 'true', the TotalCountColumn argument determines the name of the output table column that contains these totals.
The total number of n-grams is not necessarily the number of unique n-grams.
Default: 'false'
TotalCountColumn
[Optional] Specify the name of the output table column that appears if the value of the TotalGramCount argument is 'true'.
Default: 'totalcnt'
Accumulate
[Optional] Specify the names of the input table columns to copy to the output table for each n-gram. These columns cannot have the same names as those specified by the arguments NGramColumn, NumGramsColumn, and TotalCountColumn.
Default: All input columns for each n-gram
NGramColumn
[Optional] Specify the name of the output table column that is to contain the created n-grams.
Default: 'ngram'
NumGramsColumn
[Optional] Specify the name of the output table column that is to contain the length of n-gram (in words).
Default: 'n'
FrequencyColumn
[Optional] Specify the name of the output table column that is to contain the count of each unique n-gram (that is, the number of times that each unique n-gram appears in the document).
Default: 'frequency'