| |
Methods defined here:
- __init__(self, data=None, topic_num=None, docid_column=None, word_column=None, alpha=0.1, eta=0.1, count_column=None, maxiter=50, convergence_delta=0.0001, seed=None, out_topicnum='all', out_topicwordnum='none', initmodeltaskcount=None, data_sequence_column=None)
- DESCRIPTION:
The LDA function uses training data and parameters to build a
topic model, using an unsupervised method to estimate the correlation
between the topics and words according to the topic number and other
parameters. Optionally, the function generates the topic distributions
for each training document. The function uses an iterative algorithm;
therefore, applying it to large data sets with a large number of
topics can be time-consuming.
PARAMETERS:
data:
Required Argument.
Specifies the name of the teradataml DataFrame or view that contains
the new documents.
topic_num:
Required Argument.
Specifies the number of topics for all the documents in the
teradataml DataFrame 'data', an int value in the range [2, 1000].
Types: int
docid_column:
Required Argument.
Specifies the name of the input column that contains the document
identifiers.
Types: str OR list of Strings (str)
word_column:
Required Argument.
Specifies the name of the input column that contains the words (one
word in each row).
Types: str OR list of Strings (str)
alpha:
Optional Argument.
Specifies a hyperparameter of the model, the prior smooth parameter
for the topic distribution over documents. As alpha decreases,
fewer topics are associated with each document.
Default Value: 0.1
Types: float
eta:
Optional Argument.
Specifies a hyperparameter of the model, the prior smooth parameter
for the word distribution over topics. As eta decreases, fewer
words are associated with each topic.
Default Value: 0.1
Types: float
count_column:
Optional Argument.
Specifies the name of the input column that contains the count
of the corresponding word in the row, a NUMERIC value.
Types: str OR list of Strings (str)
maxiter:
Optional Argument.
Specifies the maximum number of iterations to perform if the
model does not converge, a positive int value.
Default Value: 50
Types: int
convergence_delta:
Optional Argument.
Specifies the convergence delta of log perplexity, a NUMERIC
value in the range [0.0,1.0].
Default Value: 1.0E-4
Types: float
seed:
Optional Argument.
Specifies the seed with which to initialize the model, a int value.
Given the same seed, cluster configuration, and data, the
function generates the same model. By default, the function
initializes the model randomly.
Types: int
out_topicnum:
Optional Argument.
Specifies the number of top-weighted topics and their weights to
include in the output teradataml DataFrame for each training
document. The value out_topicnum must be a positive int. The value,
"all", specifies all topics and their weights.
Default Value: "all"
Types: str
out_topicwordnum:
Optional Argument.
Specifies the number of top topic words and their topic identifiers
to include in the output teradataml DataFrame for each training
document. The value out_topicwordnum must be a positive int.
The value "all" specifies all topic words and their topic
identifiers. The value, "none", specifies no topic words or
topic identifiers.
Default Value: "none"
Types: str
initmodeltaskcount:
Optional Argument.
Specifies the number of vWorkers that are adopted to generate
initialized model. By default, the function uses all the available
vworkers to initialize the model.
Note: This argument is available only when teradataml is connected to
Vantage 1.1.1 or later versions.
Types: int
data_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each
row of the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that
vary from run to run.
Types: str OR list of Strings (str)
RETURNS:
Instance of LDA.
Output teradataml DataFrames can be accessed using attribute
references, such as LDAObj.<attribute_name>.
Output teradataml DataFrame attribute names are:
1. model_table
2. doc_distribution_data
3. output
RAISES:
TeradataMlException
EXAMPLES:
# Load example data.
load_example_data("LDA", "complaints_traintoken")
# Create teradataml DataFrame objects.
# The training table is log of vehicle complaints. The 'category'
# column indicates whether the car has been in a crash.
complaints_traintoken = DataFrame.from_table("complaints_traintoken")
# Example 1 - Function uses training data and parameters to build a topic model.
LDA_out = LDA(data = complaints_traintoken,
topic_num = 5,
docid_column = "doc_id",
word_column = "token",
count_column = "frequency",
maxiter = 30,
convergence_delta = 1e-3,
seed = 2
)
# Print the result teradataml DataFrame
print(LDA_out)
- __repr__(self)
- Returns the string representation for a LDA class instance.
- get_build_time(self)
- Function to return the build time of the algorithm in seconds.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_prediction_type(self)
- Function to return the Prediction type of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_target_column(self)
- Function to return the Target Column of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- show_query(self)
- Function to return the underlying SQL query.
When model object is created using retrieve_model(), then None is returned.
|