Word embedding is the representation of a word/token in multi-dimensional space such that words/tokens with similar meanings have similar embeddings. Each word/token is mapped to a vector of real numbers that represent the word/token. The Analytics Database function TD_WordEmbeddings produces vectors for each piece of text and can find the similarity between the texts. The options are token-embedding, doc-embedding, token2token-similarity, and doc2doc-similarity.
The ModelTable contains pretrained words/tokens and their corresponding vector mappings in multidimensional space. You can use pre-defined vectors from Word Vectors, or train your own using packages such as GloVe or Word2Vec. Note that the ModelTable format expects the vectors in GloVe format (one word/token vector pair per row ). To convert a Word2Vec file, simply delete the first row which contains the number of words or tokens and the number of dimensions.
- This function supports CHARACTER SET LATIN.
- This function does not support CHARACTER SET UNICODE.