Use case: You want to score a dataset with an existing model file trained externally from Vantage.
Scoring is the most commonly used operation that finds perfect application for scaling within VantageCloud Lake.
- In generic scenario, a model is trained on the basis of a training dataset. The model could be derived by applying any progression of case-specific analytical steps and ML/AI techniques to solve the problem at hand. The model may have been saved in one of the popular model storage formats, such as a Python binary pickle file, Predictive Model Markup Language (PMML), Open Neural Network Exchange (ONNX), and so on.
- In production environments, a model might be derived at periodic intervals, if there is retraining on the basis of new data.
To illustrate this use case, consider a scenario where a financial institution wishes to predict the propensity of their customers to apply a new credit card. To predict the probability, a random forest model is used, which has been previously pickled and is available in a binary file.
More specifically, to predict the customer propensity on the basis of customer data and past transactions, assume the following:
- The input test data for scoring are in a table "dataSco" that resides in the Primary Cluster Analytics Database.
- The pickled model to use for scoring is initially in the file "model.out" stored on the client.
For the present example, we assume the model has been created with the Python library scikit-learn v.1.1.3. The same or a compatible version of scikit-learn must be installed in the target user environment for the specific model to be used.
- The scoring algorithm is in a Python script "scoring.py" stored on the client.
Prerequisite steps:
- Connect from a client to a target VantageCloud Lake system where the scoring task will be performed.
- Import the necessary modules we'll need on the client for the current use case.
from teradatasqlalchemy.types import VARCHAR
- Specify as a variable the path where the script and model files are kept on the client, for convenience to avoid typing repetition.
path_to_files = '/Users/JaneDoe/OpeanAFexamples/scripts/'
- Create a teradataml DataFrame of the test data table.
scoringData = DataFrame.from_table("scoringData")