Text_Parser Arguments - Aster Analytics

Teradata Aster® Analytics Foundation User GuideUpdate 2

Product

Aster Analytics

Release Number

7.00.02

Published

September 2017

Language

English (United States)

Last Update

2018-04-17

dita:mapPath

uce1497542673292.ditamap

dita:ditavalPath

AA-notempfilter_pdf_output.ditaval

dita:id

B700-1022

lifecycle

Product Category

Software

TextColumn

Specifies the name of the input column whose contents are to be tokenized.

ToLowerCase

[Optional] Specifies whether to convert input text to lowercase. Default: 'true'.

The function ignores this argument if the Stemming argument has the value 'true'.

Stemming

[Optional] Specifies whether to stem the tokens—that is, whether to apply the Porter2 stemming algorithm to each token to reduce it to its root form. Before stemming, the function converts the input text to lowercase and applies the RemoveStopWords argument. Default: 'false'.

Delimiter

[Optional] Specifies a regular expression that represents the word delimiter. Default: '[\t\f\r\n]+').

TotalWordsNum

[Optional] Specifies whether to output a column that contains the total number of words in the input document. Default: 'false'.

Punctuation

[Optional] Specifies a regular expression that represents the punctuation characters to remove from the input text. With Stemming ('true'), the recommended value is '[\\\[.,?\!:;~()\\\]]+'.

Default: '[.,!?]'.

Accumulate

[Optional] Specifies the names of the input columns to copy to the output table. Default: All input columns.

No accumulate_column can be the same as token_column or total_column.

TokenColumn

[Optional] Specifies the name of the output column that contains the tokens. Default: 'token'.

FrequencyColumn

[Optional] Specifies the name of the output column that contains the frequency of each token. Default: 'frequency'.

The function ignores this argument if the OutputByWord argument has the value 'false'.

TotalColumn

[Optional] Specifies the name of the output column that contains the total number of words in the input document. Default: 'total_count'.

RemoveStopWords

[Optional] Specifies whether to remove stop words from the input text before parsing. Default: 'false'.

PositionColumn

[Optional] Specifies the name of the output column that contains the position of a word within a document. Default: 'position'.

ListPositions

[Optional] Specifies whether to output the position of a word in list form. Default: 'false' (the function outputs a row for each occurrence of the word).

The function ignores this argument if the OutputByWord argument has the value 'false'.

OutputByWord

[Optional] Specifies whether to output each token of each input document in its own row in the output table. Default: 'true'. If you specify 'false', then the function outputs each tokenized input document in one row of the output table.

StemmingExceptions

[Optional] Specifies the location of the file that contains the stemming exceptions. A stemming exception is a word followed by its stemmed form. The word and its stemmed form are separated by white space. Each stemming exception is on its own line in the file. For example:

bias bias 
news news 
goods goods 
lying lie 
ugly ugli 
sky sky 
early earli

The words 'lying', 'ugly', and 'early' are to become 'lie', 'ugli', and 'earli', respectively. The other words are not to change.

Default: No stemming exceptions.

StopWords

[Optional] Specifies the location of the file that contains the stop words (words to ignore when parsing text). Each stop word is on its own line in the file. For example:

a 
an 
the 
and 
this 
with 
but 
will

Default: No stop words.