TF_IDF Syntax - Aster Analytics

Teradata Aster® Analytics Foundation User GuideUpdate 2

Product
Aster Analytics
Release Number
7.00.02
Published
September 2017
Language
English (United States)
Last Update
2018-04-17
dita:mapPath
uce1497542673292.ditamap
dita:ditavalPath
AA-notempfilter_pdf_output.ditaval
dita:id
B700-1022
lifecycle
previous
Product Category
Software
TF_IDF version 2.3, TF version 1.1
SELECT * FROM TF_IDF (
  ON TF (
    ON { table | view | (query) } PARTITION BY docid 
      [ Formula ({ 'normal' | 'bool' | 'log' | 'augment' }) ]
  ) AS tf PARTITION BY term 
  [ ON (SELECT COUNT (DISTINCT docid)
    FROM doccount_table) AS doccount DIMENSION ]
  [ ON (SELECT term, COUNT (DISTINCT docid)
    FROM docperterm_table 
    GROUP BY term) AS docperterm PARTITION BY term ]
  [ ON (SELECT DISTINCT (term) AS term, idf
    FROM tf_idf_output_table) AS idf PARTITION BY term ]
);

Large Document Sets

For large documents sets, the docperterm_table is required.

For training, the syntax for large document sets is:

SELECT * FROM TF_IDF (
  ON TF (
    ON { table | view | (query) } PARTITION BY docid 
      [ Formula ({ 'normal' | 'bool' | 'log' | 'augment' }) ]
  ) AS tf PARTITION BY term 
  ON (SELECT COUNT (DISTINCT docid) FROM doccount_table 
  ) AS doccount DIMENSION
  ON (SELECT term, COUNT (DISTINCT docid)
    FROM docperterm_table 
    GROUP BY term) AS docperterm PARTITION BY term 
) ORDER BY docid;

For prediction, the syntax for large document sets is:

SELECT * FROM TF_IDF (
  ON TF (
    ON { table | view | (query) } PARTITION BY docid 
      [ Formula ({ 'normal' | 'bool' | 'log' | 'augment' }) ]
  ) AS tf PARTITION BY term 
  [ ON (SELECT term, COUNT (DISTINCT docid)
    FROM docperterm_table 
    GROUP BY term) AS docperterm PARTITION BY term ]
  [ ON (SELECT DISTINCT (term) AS term, idf
    FROM tf_idf_output_table) AS idf PARTITION BY term ]
) ORDER BY docid;

Small Document Sets

This syntax is acceptable for small document sets:

SELECT * FROM TF_IDF (
  ON TF (
    ON input_table PARTITION BY docid 
  ) AS tf PARTITION BY term 
  ON (SELECT COUNT (DISTINCT docid) FROM input_table 
  ) AS doccount DIMENSION
) ORDER BY docid;
Where input_table is:
{ table | view | (query) }