| |
Methods defined here:
- __init__(self, data=None, time_data=None, count_rownumber=None, time_column=None, value_columns=None, time_interval=None, interpolation_type=None, aggregation_type=None, time_datatype=None, value_datatype=None, start_time=None, end_time=None, values_before_first=None, values_after_last=None, duplicate_rows_count=None, accumulate=None, data_sequence_column=None, time_data_sequence_column=None, count_rownumber_sequence_column=None, data_partition_column=None, count_rownumber_partition_column=None, data_order_column=None, time_data_order_column=None, count_rownumber_order_column=None)
- DESCRIPTION:
The Interpolator function calculates missing values in a time series,
using either interpolation or aggregation. Interpolation estimates
missing values between known values. Aggregation combines known
values to produce an aggregate value.
PARAMETERS:
data:
Required Argument.
Specifies the teradataml DataFrame that contains the input data.
data_partition_column:
Required Argument.
Specifies Partition By columns for data.
Values to this argument can be provided as a list, if multiple
columns are used for partition.
Types: str OR list of Strings (str)
data_order_column:
Required Argument.
Specifies Order By columns for data.
Values to this argument can be provided as a list, if multiple
columns are used for ordering.
Types: str OR list of Strings (str)
time_data:
Optional Argument.
Specifies the teradataml DataFrame name which contains time.
If you specify time_data then the function calculates an interpolated
value for each time point.
Note:
If you omit time_data, you must specify the time_interval
argument.
time_data_order_column:
Optional Argument.
Specifies Order By columns for time_data.
Values to this argument can be provided as a list, if multiple
columns are used for ordering.
Types: str OR list of Strings (str)
count_rownumber:
Optional Argument.
Specifies the teradataml DataFrame name which contains proportion
of time points.
Note:
It is only used with interpolation_type.
("loess"(weights ({constant | tricube}), degree ({0 | 1 | 2}), span(m))),
where m is between (x+1)/n and 1.
count_rownumber_partition_column:
Optional Argument.
Specifies Partition By columns for count_rownumber.
Values to this argument can be provided as a list, if multiple
columns are used for partition.
Types: str OR list of Strings (str)
count_rownumber_order_column:
Optional Argument.
Specifies Order By columns for count_rownumber.
Values to this argument can be provided as a list, if multiple
columns are used for ordering.
Types: str OR list of Strings (str)
time_column:
Required Argument.
Specifies the name of the input teradataml DataFrame data column that
contains the time points of the time series whose missing values are
to be calculated.
Types: str
value_columns:
Required Argument.
Specifies the names of input teradataml DataFrame data columns to
interpolate to the output teradataml DataFrame.
Types: str OR list of Strings (str)
time_interval:
Optional Argument. Required when time_data is not provided.
Specifies the length of time, in seconds, between calculated values.
If you specify time_interval then the function calculates an
interpolated value for a time point only if the value is missing
in the original time series; otherwise, the function copies the original value.
Note:
1. If you specify aggregation_type, the function ignores time_data or
time_interval and calculates the aggregated value for each point in the
time series.
2. Specify exactly one of time_data or time_interval.
Types: int or float
interpolation_type:
Optional Argument.
Specifies interpolation types for the columns that value_columns
specifies. If you specify interpolation_type, then it must be the
same size as value_columns. That is, if value_columns specifies n
columns, then interpolation_type must specify n interpolation types.
For i in [1, n], value_column_i has interpolation_type_i. However,
interpolation_type_i can be empty;
for example:
value_columns (c1, c2, c3)
interpolation_type ("linear", ,"constant")
An empty interpolation_type has the default value.
The function calculates the value for each missing time point using a
low-degree polynomial based on a set of nearest neighbors.
The possible values of interpolation_type are as follows.
* "linear" (default): The value for each missing time point is
determined using linear interpolation between the two nearest points.
* "constant": The value for each missing time point is set
to the nearest value.
You must use this option if value_column has SQL data type CHARACTER,
CHARACTER(n), or VARCHAR.
* "spline[(type(cubic))]": The value for each missing time point is
determined by fitting a cubic spline to the nearest three points.
* "median[(window(n))]": The value for each missing time point is set
to the median value of the nearest n time points.
n must be greater than or equal to 2.
The default value of n is 5.
* "loess[(weights({constant | tricube}), degree ({0 |1 |2}),
span(m))]":
* weights:
* constant: All time points are equally weighted.
* ricube: Time points closer to missing data point are more heavily
weighted than those farther away.
The default value is constant.
* degree: Degree of polynomial.
The default value is 1.
* m: Two choices:
* It is either an integer greater than 1 (which specifies the number of
neighboring points)
* Specifies proportion of time points to use in each fit.
You must provide count_rownumber, and m must be between (x+1)/n and 1,
where x is specified degree and n is number of rows in partition).
The default value of m is 5.
Note:
1. Specify only one of interpolation_type or aggregation_type.
2. If you omit both syntax elements, the function uses interpolation_type
with its default value, 'linear'.
3. For SQL data types CHARACTER, CHARACTER(n), and VARCHAR, you cannot use
aggregation_type. You must use interpolation_type, and interpolation_type
must be 'constant'.
4. In interpolation_type syntax, brackets do not indicate optional
elements - you must include them.
Types: str OR list of strs
aggregation_type:
Optional Argument.
Specifies the aggregation types of the columns that value_columns
specifies. If you specify aggregation_type, then it must be the same
size as value_columns. That is, if value_columns specifies n columns,
then aggregation_type must specify n aggregation types. For i in [1,
n], value_column_i has aggregation_type_i. However, aggregation_type_i
can be empty.
for example:
value_columns (c1, c2, c3)
aggregation_type (min, ,max)
An empty aggregation_type has the default value.
The syntax of aggregation_type is:
{ min | max | mean | mode | sum } [(window(n))]
The function calculates the aggregate value as the minimum, maximum,
mean, mode, or sum within a sliding window of length n. n must be
greater than or equal to 2.
The default value of n is 5.
The default aggregation method is min.
The Interpolator function can calculate the aggregates of values of
these SQL data types:
* int
* BIGINT
* SMALLINT
* float
* DECIMAL(n,n)
* DECIMAL
* NUMERIC
* NUMERIC(n,n)
Note:
1. Specify only one of aggregation_type or interpolation_type.
2. If you omit both syntax elements, the function uses interpolation_type
with its default value, 'linear'.
3. Aggregation calculations ignore the values in time_interval or in the
time_data. The function calculates the aggregated value for each value
in the time series.
4. In aggregation_type syntax, brackets do not indicate optional
elements - you must include them.
Types: str OR list of strs
time_datatype:
Optional Argument.
Specifies the data type of the output column that corresponds to the
input teradataml DataFrame data column that time_column specifies
(time_column).
If you omit this argument, then the function infers the data type of
time_column from the input teradataml DataFrame data and uses the inferred
data type for the corresponding output teradataml DataFrame column.
If you specify this argument, then the function can transform the input
data to the specified output data type only if both the input column
data type and the specified output column data type are in this list:
* int
* BIGINT
* SMALLINT
* float
* DECIMAL(n,n)
* DECIMAL
* NUMERIC
* NUMERIC(n,n)
Types: str
value_datatype:
Optional Argument.
Specifies the data types of the output columns that correspond to
the input teradataml DataFrame data columns that value_columns specifies.
If you omit this argument, then the function infers the data type of
each time_column from the input teradataml DataFrame data and uses the
inferred data type for the corresponding output teradataml DataFrame
column.
If you specify value_datatype, then it must be the same size as
value_columns. That is, if value_columns specifies n columns, then
value_datatype must specify n data types. For i in [1, n], value_column_i
has value_type_i. However, value_type_i can be empty;
for example:
value_columns (c1, c2, c3)
value_datatype (int, ,VARCHAR)
If you specify this argument, then the function can transform the
input data to the specified output data type only if both the input
column data type and the specified output column data type are
in this list:
* int
* BIGINT
* SMALLINT
* float
* DECIMAL(n,n)
* DECIMAL
* NUMERIC
* NUMERIC(n,n)
Types: str OR list of strs
start_time:
Optional Argument.
Specifies the start time for the time series.
The default value is the start time of the time series in input
teradataml DataFrame.
Types: str
end_time:
Optional Argument.
Specifies the end time for the time series.
The default value is the end time of the time series in input
teradataml DataFrame.
Types: str
values_before_first:
Optional Argument.
Specifies the values to use if start_time is before the start time of
the time series in input teradataml DataFrame. Each of these values
must have the same data type as its corresponding value_column. Values
of data type VARCHAR are case-insensitive.
If value_columns specifies n columns, then values_before_first must
specify n values. For in [1, n], value_column_i has the value
before_first_value_i. However, before_first_value_i can be empty;
for example:
value_columns (c1, c2, c3)
values_before_first (1, ,"abc")
If before_first_value_i is empty, then value_column_i has the value NULL.
If you do not specify values_before_first, then value_column_i has the
value NULL for i in [1, n].
Types: str OR list of strs
values_after_last:
Optional Argument.
Specifies the values to use if end_time is after the end time of the
time series in input teradataml DataFrame. Each of these values must
have the same data type as its corresponding value_column. Values of
data type VARCHAR are case-insensitive.
If value_columns specifies n columns, then values_after_last must
specify n values. For i in [1, n], value_column_i has the value
after_last_value_i. However, after_last_value_i can be empty;
for example:
value_columns (c1, c2, c3)
values_after_last (1, ,"abc")
If after_last_value_i is empty, then value_column_i has the value NULL.
If you do not specify values_after_last, then value_column_i has the
value NULL for i in [1, n].
Types: str OR list of strs
duplicate_rows_count:
Optional Argument.
Specifies the number of rows to duplicate across split boundaries if
you use the SeriesSplitter function.
If you specify only value1, then the function duplicates value1 rows
from the previous partition and value1 rows from the next partition.
If you specify both value1 and value2, then the function duplicates value1
rows from the previous partition and value2 rows from the next partition.
Each argument value must be non-negative int. Both value1 and value2 must
exceed the number of time points that the function needs for every
specified interpolation or aggregation method. For aggregation, the
number of time points required is determined by the value of n in window(n)
specified by aggregation_type.
The interpolation methods and the number of time points that the function
needs for them are:
* "linear": 1
* "constant": 1
* "spline": 2
* "median [(window(n))]": n/2
* "loess [(weights ({constant | tricube}), degree ({0 | 1 | 2}), span(m))]":
* m > 1: m-1
* m < 1: (m * n)-1
where n is total number of data rows, found in column n of the
count_rownumber DataFrame.
Types: int OR list of ints
accumulate:
Optional Argument.
Specifies the names of input teradataml DataFrame columns (other than those
specified by time_column and value_columns) to copy to the output table.
By default, the function copies to the output teradataml DataFrame only
the columns specified by time_column and value_columns.
Types: str OR list of Strings (str)
data_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
time_data_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "time_data". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
count_rownumber_sequence_column:
Optional Argument.
Specifies the list of column(s) that uniquely identifies each row of
the input argument "count_rownumber". The argument is used to ensure
deterministic results for functions which produce results that vary
from run to run.
Types: str OR list of Strings (str)
RETURNS:
Instance of Interpolator.
Output teradataml DataFrames can be accessed using attribute
references, such as InterpolatorObj.<attribute_name>.
Output teradataml DataFrame attribute name is:
result
RAISES:
TeradataMlException
EXAMPLES:
# Load the data to run the example.
load_example_data("Interpolator", ["ibm_stock1", "time_table1"])
# Create teradataml DataFrame.
ibm_stock1 = DataFrame.from_table("ibm_stock1")
time_table1 = DataFrame.from_table("time_table1")
# Example 1 : Running Interpolator function with aggregation_type min.
interpolator_out1 = Interpolator(data=ibm_stock1,
data_partition_column='id',
data_order_column='period',
time_data=time_table1,
time_data_order_column='period',
time_column='period',
value_columns='stockprice',
accumulate='id',
aggregation_type='min[(window(2))]',
values_before_first='0',
values_after_last='0',
data_sequence_column='period'
)
# Print the result DataFrame.
print(interpolator_out1.result)
# Example 2 : Running Interpolator function with constant interpolation.
interpolator_out2 = Interpolator(data=ibm_stock1,
data_partition_column='id',
data_order_column='period',
time_column='period',
value_columns='stockprice',
accumulate='id',
time_interval=86400.0,
interpolation_type='constant',
values_before_first='0',
values_after_last='0'
)
# Print the result DataFrame.
print(interpolator_out2.result)
# Example 3 : Running Interpolator function with linear interpolation.
interpolator_out3 = Interpolator(data=ibm_stock1,
data_partition_column='id',
data_order_column='period',
time_column='period',
value_columns='stockprice',
accumulate='id',
time_interval=86400.0,
interpolation_type='linear',
values_before_first='0',
values_after_last='0'
)
# Print the result DataFrame.
print(interpolator_out3.result)
# Example 4 : Running Interpolator function with median interpolation.
interpolator_out4 = Interpolator(data=ibm_stock1,
data_partition_column='id',
data_order_column='period',
time_column='period',
value_columns='stockprice',
accumulate='id',
time_interval=86400.0,
interpolation_type='median[(window(4))]',
values_before_first='0',
values_after_last='0'
)
# Print the result DataFrame.
print(interpolator_out4.result)
# Example 5 : Running Interpolator function with spline interpolation.
interpolator_out5 = Interpolator(data=ibm_stock1,
data_partition_column='id',
data_order_column='period',
time_column='period',
value_columns='stockprice',
accumulate='id',
time_interval=86400.0,
interpolation_type='spline[(type(cubic))]',
values_before_first='0',
values_after_last='0'
)
# Print the result DataFrame.
print(interpolator_out5.result)
# Example 6 : Running Interpolator function with loess interpolation.
interpolator_out6 = Interpolator(data=ibm_stock1,
data_partition_column='id',
data_order_column='period',
time_column='period',
value_columns='stockprice',
accumulate='id',
time_interval=86400.0,
interpolation_type='loess[(weights(constant),degree(2),span(4))]',
values_before_first='0',
values_after_last='0'
)
# Print the result DataFrame.
print(interpolator_out6)
- __repr__(self)
- Returns the string representation for a Interpolator class instance.
- get_build_time(self)
- Function to return the build time of the algorithm in seconds.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_prediction_type(self)
- Function to return the Prediction type of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- get_target_column(self)
- Function to return the Target Column of the algorithm.
When model object is created using retrieve_model(), then the value returned is
as saved in the Model Catalog.
- show_query(self)
- Function to return the underlying SQL query.
When model object is created using retrieve_model(), then None is returned.
|