Optional Syntax Elements for TD_WordEmbeddings

Optional Syntax Elements for TD_WordEmbeddings - Analytics Database

Database Analytic Functions

Deployment

VantageCloud

VantageCore

Edition

Enterprise

IntelliFlex

VMware

Product

Analytics Database

Release Number

17.20

Published

June 2022

Language

English (United States)

Last Update

2024-04-06

dita:mapPath

gjn1627595495337.ditamap

dita:ditavalPath

ayr1485454803741.ditaval

dita:id

jmh1512506877710

Product Category

Teradata Vantage™

SecondaryColumn

Name of the input table column that contains the text. This field is applicable for the token2token-similarity and doc2doc-similarity operations only.

Accumulate

List of columns to be added to the output from the input table. This is not applicable with the token-embedding operation.

Operation

Operation to be performed on the data. Options are:

token-embedding: Emits vectors to all tokens in the column. Each token present in the specified text column is mapped to a vector of real numbers that represents the semantic meaning of that token. For example, the word "dog" might be represented by the vector [0.1, 0.2, 0.3, 0.4, 0.5], where each number represents a different aspect of the meaning of the word.
doc-embedding: Vectorizes each token in the document and combines them. For example, the document "The dog ran across the street" might be represented by the vector [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], where each number represents a different aspect of the meaning of the document.
token2token-similarity: Computes the similarity between tokens and quantifies the result value. It measures how similar or related two tokens are based on their word embeddings. If the word embeddings of two tokens are close in the multi-dimensional space, the similarity value will be higher, indicating a semantic similarity between the tokens. For example, the similarity between the words "dog" and "cat" would be higher than the similarity between the words "dog" and "table".
doc2doc-similarity: Computes the similarity between documents and quantifies the result value. It considers the embeddings of two entire documents, which are created using the "doc-embedding" operation. The similarity value reflects how similar or related two documents are in terms of their content. For example, the doc2doc-similarity between the documents "The dog ran across the street" and "The cat sat on the mat" would be higher than the doc2doc-similarity between the documents "The dog ran across the street" and "The apple fell from the tree".

Default value: token-embedding

RemoveStopWords

Stop words in English include words such as "the", "and", "in", "of", "to", "is", "it", "on", "at", and so on. All stop words present in the input table text are removed before any operation is performed. Applicable to all operations except token2token-similarity. Default is False.

ConvertToLowerCase

All operations are performed after converting input table text to lowercase letters. Default is True.

StemTokens

Converts word to its root word in the input table, such as converting going to go. Default is False.