TFIDF Syntax Elements - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
8.10
1.1
Published
October 2019
Language
English (United States)
Last Update
2019-12-31
dita:mapPath
ima1540829771750.ditamap
dita:ditavalPath
jsj1481748799576.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantage™
Formula
[Optional] Specify the formula for calculating the term frequency (tf) of term t in document d:
Option Description
'normal' (Default) Normalized frequency:

tf(t,d) = f ((t,d) / sum {w,wd}

This value is rf divided by number of terms in document.

'bool' Boolean frequency:

tf((t,d) = 1 if t occurs in d; otherwise, tf((t,d) = 0.

'log' Logarithmically-scaled frequency:

tf((t,d) = log(f((t,d)+1)

where f((t,d) is the number of times t occurs in d (that is, raw frequency, rf).

'augment' Augmented frequency, which prevents bias towards longer documents:

tf((t,d) = 0.5 + (0.5 × f ((t,d) / max {f(w,d) : wd })

This value is rf divided by maximum raw frequency of any term in document.

When using the output of a previous run of the TFIDF function on a training document set to predict TFIDF scores on an input document set, use the same Formula value for the input document set that you used for the training document set.