Dimensional Input Use Case: Machine Learning - Aster Analytics

Teradata Aster® Analytics Foundation User GuideUpdate 2

Aster Analytics
Release Number
September 2017
English (United States)
Last Update
Product Category

Machine learning is a common use case for a SQL-MapReduce function with multiple inputs. In machine learning, you create or choose a model that takes a data set and predicts an outcome. The model is composed of mathematical and statistical algorithms created through observations of patterns found within a given data set. You typically test the model to determine its accuracy, fine-tuning it until its predictions fall within the desired margin of error.

For example, suppose that you have 10 million emails to "bucket" by subject. You use a function to generate an algorithm that parses emails and places each one in the appropriate bucket. The function might create the subject buckets by using statistical analysis to determine where clusters of data appear, and then put the emails into the buckets based on frequency of occurrence of certain words, work proximity, or grammatical analysis.
Functions that can generate predictive models include Naive Bayes, K-means nearest neighbor, decision trees, and logistic regression. The model generated by such a function is usually in JSON format. The model can be stored in a file system, but it is more commonly stored in a database.

To test the accuracy of your model, have a human classify a subset of the emails—for example, 1,000 emails—using the desired criteria. (The subset is called the sample data set and the human-generated result is called the known outcome.) You apply the model to the sample data set and compare the results to the known outcome. Now you know the reliability of the model (its margin of error) and can fine-tune it. After fine-tuning the model, you can use it to analyze new sets of emails, with a known margin of error.

For machine learning, the inputs to the SQL-MapReduce function are a predictive model (usually in JSON format) and one or more data sets to be analyzed using the model. The model must be applied to each row of input from the new data set; therefore, it is a dimensional input. The other inputs are partitioned inputs.