TF_IDF version 2.3, TF version 1.1
SELECT * FROM TF_IDF ( ON TF ( ON { table | view | (query) } PARTITION BY docid [ Formula ({ 'normal' | 'bool' | 'log' | 'augment' }) ] ) AS tf PARTITION BY term [ ON (SELECT COUNT (DISTINCT docid) FROM doccount_table) AS doccount DIMENSION ] [ ON (SELECT term, COUNT (DISTINCT docid) FROM docperterm_table GROUP BY term) AS docperterm PARTITION BY term ] [ ON (SELECT DISTINCT (term) AS term, idf FROM tf_idf_output_table) AS idf PARTITION BY term ] );
Large Document Sets
For large documents sets, the docperterm_table is required.
For training, the syntax for large document sets is:
SELECT * FROM TF_IDF ( ON TF ( ON { table | view | (query) } PARTITION BY docid [ Formula ({ 'normal' | 'bool' | 'log' | 'augment' }) ] ) AS tf PARTITION BY term ON (SELECT COUNT (DISTINCT docid) FROM doccount_table ) AS doccount DIMENSION ON (SELECT term, COUNT (DISTINCT docid) FROM docperterm_table GROUP BY term) AS docperterm PARTITION BY term ) ORDER BY docid;
For prediction, the syntax for large document sets is:
SELECT * FROM TF_IDF ( ON TF ( ON { table | view | (query) } PARTITION BY docid [ Formula ({ 'normal' | 'bool' | 'log' | 'augment' }) ] ) AS tf PARTITION BY term [ ON (SELECT term, COUNT (DISTINCT docid) FROM docperterm_table GROUP BY term) AS docperterm PARTITION BY term ] [ ON (SELECT DISTINCT (term) AS term, idf FROM tf_idf_output_table) AS idf PARTITION BY term ] ) ORDER BY docid;
Small Document Sets
This syntax is acceptable for small document sets:
SELECT * FROM TF_IDF ( ON TF ( ON input_table PARTITION BY docid ) AS tf PARTITION BY term ON (SELECT COUNT (DISTINCT docid) FROM input_table ) AS doccount DIMENSION ) ORDER BY docid;
Where input_table is:
{ table | view | (query) }