Predictive Model Markup Language (PMML) is an XML standard being developed by the Data Mining Group, a vendor-led consortium established in 1998 to develop data-mining standards. Teradata (at that time NCR) co-developed the initial PMML specification along with Angoss, Magnify, SPSS and The National Center for Data Mining at the University of Illinois at Chicago.
PMML enables the definition and subsequent sharing of predictive models between applications. It represents and describes data mining and statistical models, as well as some of the operations required for cleaning and transforming data prior to modeling. PMML aims to provide enough infrastructure for an application to be able to produce a model (the PMML producer) and another application to consume it (the PMML consumer) simply by reading the PMML data file. This means that a model developed in a desktop data-mining tool can be deployed or scored against an entire data warehouse.
The following table lists the major constructs of PMML-compliant XML documents.
|Data Dictionary||Defines the data to the model and specifies each data attribute’s type and value range.|
|Mining Schema||Defines attribute information specific to a certain model. It specifies an attribute's usage type, whether it be active or independent (an input of the model), predicted or dependent (an output of the model), or supplementary (descriptive information that is ignored by the model).|
|Transformation Dictionary||Contains simple algorithm-specific data transformations such as normalization (map values to numbers), discretization (map continuous values to discrete values), value mapping (map discrete values to discrete values) and aggregation (simple averages and counts).|
|Models||Identifies model parameters for regression models, cluster models, decision tree models, neural networks, Bayesian models, association rules and sequence models.|
Each PMML construct supports a mechanism for extending the content of a model. Liberal use of such “extensions” requires that vendors who produce PMML-based models collaborate closely with vendors who wish to consume that PMML. Refer to the Teradata Warehouse Miner Release Definition document, B035-2494, for details about the products and product versions supported for PMML consumption in Teradata ADS Generator and Teradata Warehouse Miner.
Although PMML is a great step forward, it has several flaws other than extensions, namely encapsulation of the process of cleaning, transforming and aggregating data. Teradata recognized this limitation early on—if the PMML document could not represent the analytic variables that were input to the analytic tools, it would be nearly impossible to consume PMML for scoring predictive models. This is because the deployment (scoring phase) of a predictive model requires the existence of the same variables upon which the model was built. For this reason, the PMML Scoring analysis is included in both the Teradata ADS Generator as well as Teradata Warehouse Miner.