| |
Methods defined here:
- __init__(self, train=None, test=None, k=None, response_column=None, id_column=None, distance_features=None, voting_weight=0.0, customized_distance=None, force_mapreduce=False, parblock_size=None, partition_key=None, accumulate=None, output_prob=False, train_sequence_column=None, test_sequence_column=None, test_block_size=None, output_responses=None)
- DESCRIPTION:
The KNN function uses training data objects to map test data objects
to categories. The function is optimized for both small and large
training sets. The function supports user-defined distance metrics
and distance-weighted voting.
PARAMETERS:
train:
Required Argument.
Specifies the name of the teradataml DataFrame that contains the
training data. Each row represents a classified data object.
test:
Required Argument.
Specifies the name of the teradataml DataFrame that contains the test
data to be classified by the KNN algorithm. Each row represents a
test data object.
k:
Required Argument.
Specifies the number of nearest neighbors to use for classifying the
test data.
Types: int
response_column:
Required Argument.
Specifies the name of the training teradataml DataFrame column that
contains the class label or classification of the classified data
objects.
Types: str
id_column:
Required Argument.
Specifies the name of the testing teradataml DataFrame column that
uniquely identifies a data object.
Types: str
distance_features:
Required Argument.
Specifies the names of the training teradataml DataFrame columns that
the function uses to compute the distance between a test object and
the training objects. The test teradataml DataFrame must also have
these columns.
Types: str OR list of Strings (str)
voting_weight:
Optional Argument.
Specifies the voting weight of the distance between a test object and
the training objects. The voting_weight must be a nonnegative
integer. The function calculates distance-weighted voting, w, with this
equation: w = 1/POWER(distance, voting_weight) Where distance is the distance
between the test object and the training object.
Default Value: 0.0
Types: float
customized_distance:
Optional Argument.
This argument is currently not supported.
force_mapreduce:
Optional Argument.
Specifies whether to partition the training data. which causes the
KNN function to load all training data into memory and use only
the row function. If you specify True, the KNN function
partitions the training data and uses the map-and reduce function.
Default Value: False
Types: bool
parblock_size:
Optional Argument.
Specifies the partition block size to use with force_mapreduce
(True). The recommended value depends on training data size and
number of vworkers.
For example, if your training data size is 10 billion and you have 10 vworkers,
the recommended, partition_block_size is 1/n billion, where n is an integer that
corresponds to your vworker nodes memory. Omitting this argument or
specifying an inappropriate partition_block_size can degrade
performance.
Types: int
partition_key:
Optional Argument.
Specifies the name of the training teradataml DataFrame column that
partition data in parallel model. The default value is the first
column of distance_features.
Note: "partition_key" argument support is only available when teradataml
is connected to Vantage 1.0 Maintenance Update 2 version or later.
Types: str
accumulate:
Optional Argument.
Specifies the names of test teradataml DataFrame columns to copy to
the output teradataml DataFrame.
Note: "accumulate" argument support is only available when teradataml
is connected to Vantage 1.1 or later.
Types: str OR list of Strings (str)
output_prob:
Optional Argument.
Specifies whether to display output probability for the predicted
category.
Note: "output_prob" argument support is only available when teradataml
is connected to Vantage 1.1 or later.
Default Value: False
Types: bool
train_sequence_column:
Optional Argument, Required if 'partition_key' is specified.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "train". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
test_sequence_column:
Optional Argument, Required if 'partition_key' is specified.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "test". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
test_block_size:
Optional with when "force_mapreduce" is 'True', disallowed otherwise.
Specifies the partition block size of testing data to use when
"force_mapreduce" set to 'True'. Omitting this argument will start to
estimate the value automatically. Specifying an inappropriate
'test_block_size' can degrade performance.
Note:
"test_block_size" argument support is only available when teradataml is connected to Vantage 1.3.
Types: int
output_responses:
Optional when "output_prob" is 'True', disallowed otherwise.
Specify 'response_column' for which to output probability. If you specify output_prob=True and omit
'response_column', the function adds the column prob to the output teradataml DataFrame.
If you set "output_prob" to 'True' and specify 'response_column', then the function adds the specified
response columns to the output table Dataframe
Note:
"output_responses" argument support is only available when teradataml is connected to Vantage 1.3.
Types: str OR list of strs
RETURNS:
Instance of KNN.
Output teradataml DataFrames can be accessed using attribute
references, such as KNNObj.<attribute_name>.
Output teradataml DataFrame attribute name is:
1. output_table
2. output
RAISES:
TeradataMlException
EXAMPLES:
# Load the data to run the example
load_example_data("knn", ["computers_train1_clustered","computers_test1"])
# Create teradataml DataFrame objects.
# The "computers_train1_clustered" and "computers_test1" remote tables
# contains five attributes of personal computers price, speed, hard disk
# size, RAM, and screen size.
computers_train1_clustered = DataFrame.from_table("computers_train1_clustered")
computers_test1 = DataFrame.from_table("computers_test1")
# Example 1 - Map the test computer data to their respective categories
knn_out = KNN(train = computers_train1_clustered,
test = computers_test1,
k = 50,
response_column = "computer_category",
id_column = "id",
distance_features = ["price","speed","hd","ram","screen"],
voting_weight = 1.0
)
# Print the result DataFrame
print(knn_out)
- __repr__(self)
- Returns the string representation for a KNN class instance.
- get_build_time(self)
- Function to return the build time of the algorithm in seconds.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_prediction_type(self)
- Function to return the Prediction type of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_target_column(self)
- Function to return the Target Column of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- show_query(self)
- Function to return the underlying SQL query.
When model object is created using retrieve_model(), then None is returned.
|