The TF_IDF function always requires as input the output of the TF function. The input for the TF function is the document set. The other TF_IDF input tables depend on your reason for running the function:
- If you are running TF_IDF to output the IDF and TF-IDF values for each term in the document set, then TF_IDF also requires the input table doccount and has optional input table docperterm.
- If you are running the function to predict TF_IDF values, then TF_IDF also requires the input table idf. The table idf is the output of an earlier call to TF_IDF, using the training document set as input to the TF function, the doccount table, and optionally, the docperterm table.
If you omit the docperterm table, the function creates it by processing the entire document set, which can require a large amount of memory. If there is not enough memory to process the entire document set, then the docperterm table is required.
TF Input Table (Document Set) Schema
Column Name |
Data Type |
Description |
docid
|
Any |
Document identifier. |
term
|
VARCHAR |
Term. |
count |
INTEGER |
Number of times that term appears in the document. |
TF Output and TF_IDF Input Table Schema
Column Name |
Data Type |
Description |
docid
|
Any |
Document identifier. |
term
|
VARCHAR |
Term. |
tf |
DOUBLE PRECISION |
Term frequency. |
count |
INTEGER |
Number of times that term appears in the document. |
TF_IDF doccount Table Schema
Column Name |
Data Type |
Description |
count |
BIGINT |
Number of documents in the document set. |
TF_IDF docperterm Table Schema
Column Name |
Data Type |
Description |
term
|
VARCHAR |
Term. |
count |
BIGINT |
Number of documents that contain term. |