SQL Engine Analytic Functions - Analytics Database - Teradata Vantage

Teradata Vantageā„¢ - Analytics Database Release Summary - 17.20 What's New

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Analytics Database
Teradata Vantage
Release Number
17.20
Published
June 2022
Language
English (United States)
Last Update
2024-01-30
dita:mapPath
jva1628096041737.ditamap
dita:ditavalPath
qkf1628213546010.ditaval
dita:id
weq1472245453190
Product Category
Teradata Vantage
Function Name Description
TD_TextParser The function performs the following operations:
  • Splits the text in the specified column
  • Removes the punctuations from the text and converts the text to lowercase
  • Removes stop words from the text and converts the text to their root forms
  • Creates a row for each word in the output table
TD_OrdinalEncodingFit and TD_OrdinalEncodingTransform functions The TD_OrdinalEncodingFit function identifies distinct categorical values from the input table or a user-defined list and returns the distinct categorical values along with the ordinal value for each category.

The TD_OrdinalEncodingTransform function maps the categorical value to a specified ordinal value using the TD_OrdinalEncodingFit output.

TD_NonLinearCombineFit and TD_NonLinearCombineTransform functions TD_NonLinearCombineFit function returns the target columns and a specified formula which uses the non-linear combination of existing features.

TD_NonLinearCombineTransform generates the values of the new feature using the specified formula from the TD_NonLinearCombineFit function output.

TD_ANOVA Analysis of variance (ANOVA) is a statistical test that analyzes the difference between the means of more than two groups.

The null hypothesis (H0) of ANOVA is that there is no difference among group means. However, if any one of the group means is significantly different from the overall mean, then the null hypothesis is rejected.

You can use one-way Anova when you have data on an independent variable with at least three levels and a dependent variable.

For example, assume that your independent variable is insect spray type, and you have data on spray type A, B, C, D, E, and F. You can use one-way ANOVA to determine whether there is any difference in the dependent variable, insect count based on the spray type used.

TD_NaiveBayesTextClassifierTrainer The function calculates the conditional probabilities for token-category pairs, the prior probabilities, and the missing token probabilities for all categories. The trainer function trains the model with the probability values, and the predict function uses the values to classify documents into categories.
TD_RegressionEvaluator The function computes metrics to evaluate and compare multiple models and summarizes how close predictions are to their expected values.
TD_ClassificationEvaluator The function computes the Confusion matrix, precision, recall and F1-score based on the observed labels (true labels) and the predicted labels.

The function works for multi-class scenarios as well. In any case, the primary output table contains class-level metrics, whereas the secondary output table contains metrics that are applicable across classes.

TD_GetFutileColumns The function returns the futile column names if either of the conditions is met:
  • If all values in the columns are unique
  • If all the values in the columns are the same
  • If the count of distinct values in the columns divided by the count of the total number of rows in the input table is greater than or equal to the threshold value
TD_KMeans The K-means algorithm groups a set of observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid). This algorithm minimizes the objective function, that is, the total Euclidean distance of all data points from the center of the cluster as follows:
  1. Specify or randomly select k initial cluster centroids.
  2. Assign each data point to the cluster that has the closest centroid.
  3. Recalculate the positions of the k centroids.
  4. Repeat steps 2 and 3 until the centroids no longer move.

The algorithm doesn't necessarily find the optimal configuration as it depends significantly on the initial randomly selected cluster centers. You can run the function multiple times to reduce the effect of this limitation.

Also, this function returns the within-cluster-squared-sum, which you can use to determine an optimal number of clusters using the Elbow method.

TD_KMeansPredict The function uses the cluster centroids in the TD_KMeans function output to assign the input data points to the cluster centroids.
TD_Silhouette The Silhouette function refers to a method of interpretation and validation of consistency within clusters of data. The function determines how appropriately data is clustered and determines the separation distance between the resulting clusters.

The silhouette value determines the similarity of an object to its cluster (cohesion) compared to other clusters (separation). The silhouette plot displays a measure of how close each point in one cluster is to the points in the neighboring clusters and thus provides a way to assess parameters like the number of clusters.

TD_SentimentExtractor The function uses a dictionary model to extract the sentiment (positive, negative, or neutral) of each input document or sentence.
TD_ROC The Receiver Operating Characteristic (ROC) function accepts a set of prediction-actual pairs for a binary classification model and calculates the following values for a range of discrimination thresholds:
  • True-positive rate (TPR)
  • False-positive rate (FPR)
  • The area under the ROC curve (AUC)
  • Gini coefficient

A receiver operating characteristic (ROC) curve shows the performance of a binary classification model as its discrimination threshold varies. For a range of thresholds, the curve plots the true positive rate against the false-positive rate.

TD_VectorDistance The function accepts a table of target vectors and a table of reference vectors and returns a table that contains the distance between target-reference pairs.
TD_RandomProjectionMinComponents The function calculates the minimum number of components required for applying RandomProjection on the given dataset for the specified epsilon(distortion) parameter value.

The function estimates the minimum value of the NumComponents argument in the TD_RandomProjectionFit function for a given dataset. The function uses the Johnson-Lindenstrauss Lemma algorithm to calculate the value.

TD_RandomProjectionFit The function returns a random projection matrix based on the specified arguments.

The function returns the required parameters for transforming the input data into lower-dimensional data.

The TD_RandomProjectionTransform function uses the TD_RandomProjectionFit output to reduce the dimensionality of the input data.

TD_RandomProjectionTransform The function converts the high-dimensional input data to a lower-dimensional space using the TD_RandomProjectionFitfunction output.
TD_ColumnTransformer The function transforms the input table columns in a single operation. You only need to provide the FIT tables to the function, and the function runs all transformations that you require in a single operation.
TD_GLM The function is a generalized linear model (GLM) that performs regression and classification analysis on data sets, where the response follows an exponential family distribution. The function supports the following models:
  • Regression (Gaussian family). The loss function is squared error.
  • Binary Classification (Binomial family). The loss function is logistic and implements logistic regression. The only response values are 0 or 1.
TD_GLMPredict The function predicts target values (regression) and class labels (classification) for test data using a GLM model trained by the TD_GLM function.
TD_DecisionForest The function is an ensemble algorithm used for classification and regression predictive modeling problems. It is an extension of bootstrap aggregation (bagging) of decision trees.