TextParser Syntax Elements - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
9.02
9.01
2.0
1.3
Published
February 2022
Language
English (United States)
Last Update
2022-02-10
dita:mapPath
rnn1580259159235.ditamap
dita:ditavalPath
ybt1582220416951.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantage™
TextColumn
Specify the name of the input column with contents to tokenize.
ConvertToLowerCase
[Optional] Specify whether to convert input text to lowercase.
The function ignores this syntax element if the StemTokens syntax element has the value 'true'.
Default: 'true'
StemTokens
[Optional] Specify whether to stem the tokens—that is, whether to apply the Porter2 stemming algorithm to each token to reduce it to its root form. Before stemming, the function converts the input text to lowercase and applies the RemoveStopWords syntax element.
Default: 'false'
Delimiter
[Optional] Specify a regular expression that represents the word delimiter.
The function uses only specified characters as delimiters. For example, if you specify Delimiter ('-'), the function uses only the hyphen character as a delimiter.To use the hyphen and the default delimiters, specify Delimiter ('[- \t\f\r\n]+').
Default: '[ \t\f\r\n]+'
OutputTotalWords
[Optional] Specify whether to output a column that contains the total number of words in the input document.
Default: 'false'
Punctuation
[Optional] Specify a regular expression that represents the punctuation characters to remove from the input text. With StemTokens ('true'), the recommended value is '[\\\[.,?\!:;~()\\\]]+'.
Default: '[.,!?]'
Accumulate
[Optional] Specify the names of the input columns to copy to the output table.
No accumulate_column can be the same as token_column or total_column.
Default: All input columns
TokenColName
[Optional] Specify the name of the output column that contains the tokens.
Default: 'token'
FrequencyColName
[Optional] Specify the name of the output column that contains the frequency of each token.
The function ignores this syntax element if the OutputByWord syntax element has the value 'false'.
Default: 'frequency'
TotalColName
[Optional] Specify the name of the output column that contains the total number of words in the input document.
Default: 'total_count'
RemoveStopWords
[Optional] Specify whether to remove stop words from the input text before parsing.
Default: 'false'
PositionColName
[Optional] Specify the name of the output column that contains the position of a word within a document.
Default: 'location'
ListPositions
[Optional] Specify whether to output the position of a word in list form.
The function ignores this syntax element if the OutputByWord syntax element has the value 'false'.
Default: 'false' (The function outputs a row for each occurrence of the word.)
OutputByWord
[Optional] Specify whether to output each token of each input document in its own row in the output table. If you specify 'false', then the function outputs each tokenized input document in one row of the output table.
Default: 'true'
StemExceptions
[Optional] Specify the location of the file that contains the stemming exceptions. A stemming exception is a word followed by its stemmed form. The word and its stemmed form are separated by white space. Each stemming exception is on its own line in the file. For example:
bias bias 
news news 
goods goods 
lying lie 
ugly ugli 
sky sky 
early earli
The words 'lying', 'ugly', and 'early' are to become 'lie', 'ugli', and 'earli', respectively. The other words are not to change.
Default: No stemming exceptions
StopWordsList
[Optional, disallowed with StopWordsTable.] Specify the location of the file that contains the stop words (words to ignore when parsing text). Each stop word is on its own line in the file. For example:
a 
an 
the 
and 
this 
with 
but 
will
Alternatively, you can specify StopWordsTable.
Default: No stop words