TFIDF (ML Engine) - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
8.10
1.1
Published
October 2019
Language
English (United States)
Last Update
2019-12-31
dita:mapPath
ima1540829771750.ditamap
dita:ditavalPath
jsj1481748799576.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantageā„¢

TF-IDF stands for "term frequency-inverse document frequency," a technique for evaluating the importance of a specific term in a specific document in a document set. Term frequency (tf) is the number of times that the term appears in the document and inverse document frequency (idf) is the number of times that the term appears in the document set. The TF-IDF score for a term is tf *idf. A term with a high TF-IDF score is especially relevant to the specific document.

The TFIDF function can do either of the following:

  • Take any document set and output the inverse document frequency (IDF) and term frequency- inverse document frequency (TF-IDF) scores for each term.
  • Use the output of a previous run of the TFIDF function on a training document set to predict TFIDF scores of an input (test) document set.
You can use the TF-IDF scores as input for many document clustering and classification algorithms, including:
  • Cosine-similarity
  • Latent Dirichlet allocation
  • K-means clustering
  • K-nearest neighbors

You can use the TF-IDF scores derived from a training document set to create a model in a classification function (for example, SVMSparse (ML Engine)) and then use the resulting TF-IDF scores in a classification prediction function (for example, SVMSparsePredict_MLE (ML Engine)).

The TFIDF function represents each document as an N-dimensional vector, where N is the number of terms in the document set (therefore, the document vector is usually very sparse). Each entry in the document vector is the TF-IDF score of a term.