December 2024 |
20.00.00.02 |
- New feature/functionality:
- teradatamlspk global functions
- years - Partition transform function: A transform for timestamps and dates to partition data into years.
- days - Partition transform function: A transform for timestamps and dates to partition data into days.
- months - Partition transform function: A transform for timestamps and dates to partition data into months.
- hours - Partition transform function: A transform for timestamps and dates to partition data into hours.
- udf - Creates a user defined function (UDF).
- conv - Convert a number in a string column from one base to another.
- log - Returns the first argument-based logarithm of the second argument.
- log2 - Returns the base-2 logarithm of the argument.
- date_from_unix_date - Create date from the number of days since 1970-01-01.
- extract - Extracts a part of the date/timestamp or interval source.
- datepart - Extracts a part of the date/timestamp or interval source.
- date_part - Extracts a part of the date/timestamp or interval source.
- make_dt_interval - Make DayTimeIntervalType duration from days, hours, mins and secs.
- make_timestamp - Create timestamp from years, months, days, hours, mins, secs and timezone fields.
- make_timestamp_ltz - Create the current timestamp with local time zone from years, months, days, hours, mins, secs and timezone fields.
- make_timestamp_ntz - Create local date-time from years, months, days, hours, mins, secs fields.
- make_ym_interval - Make year-month interval from years, months.
- make_date - Returns a column with a date built from the year, month and day columns.
- from_unixtime - Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp.
- unix_timestamp - Convert time string with given pattern to unix epoch.
- to_unix_timestamp - Convert time string with given pattern to unix epoch.
- to_timestamp - Converts a string column to timestamp.
- to_timestamp_ltz - Converts a string column to timestamp.
- to_timestamp_ntz - Converts a string column to timestamp.
- from_utc_timestamp - Converts column to utc timestamp from different timezone columns.
- to_utc_timestamp - Converts column to given timestamp from utc timezone columns.
- timestamp_micros - Creates timestamp from the number of microseconds since UTC epoch.
- timestamp_millis - Creates timestamp from the number of milliseconds since UTC epoch.
- timestamp_seconds - Converts the number of seconds from the Unix epoch to a timestamp
- unix_micros - Returns the number of microseconds since 1970-01-01 00:00:00 UTC.
- unix_millis - Returns the number of milliseconds since 1970-01-01 00:00:00 UTC.
- unix_seconds - Returns the number of seconds since 1970-01-01 00:00:00 UTC.
- base64 - Computes the BASE64 encoding of a binary column and returns it as a string column.
- current_timezone - Returns the current session local timezone.
- format_string - Formats the arguments in printf-style and returns the result as a string column.
- elt - Returns the n-th input, e.g., returns input2 when n is 2. The function returns NULL if the index exceeds the length of the array.
- to_varchar - Convert col to a string based on the format.
- current_catalog - Returns the current catalog.
- equal_null - Returns same result as the EQUAL(=) operator for non-null operands, but returns true if both are null, false if one of the them is null.
- version - Returns the teradatamlspk version.
- parse_url - Extracts a part from a URL.
- reverse - Returns a reversed string with reverse order of elements.
- convert_timezone - Converts the timestamp without time zone sourceTs from the sourceTz time zone to targetTz.
- call_udf - Register a user defined function (UDF).
- teradatamlspk UDFRegistration
register() - Call a registered user defined function (UDF).
- teradatamlspk DataFrameColumn a.k.a. ColumnExpression
eqNullSafe() - Equality test that is safe for null values.
- teradatamlspk MLlib Functions
RegexTokenizer() - Extracts tokens based on the pattern.
- pyspark2teradataml
pyspark2teradataml utility accepts directory containing PySpark scripts as input.
pyspark2teradataml utility accepts PySpark notebook as input.
- Updates
- spark.conf.set - Supports set time zone to session.
- spark.conf.unset - Supports unset time zone to previous time zone set by user.
- DataFrame.select(), DataFrame.withColumn(), DataFrame.withColumns() functions now accept functions like, ilike, isNull, isNotNull, contains, startswith, endswith, booleanexpressions, binaryexpressions without when clause.
- DataFrameColumn.cast() and DataFrameColumn.astype() functions support TimestampNTZType, DayTimeIntervalType, YearMonthIntervalType.
- DataFrame.createTempView() and DataFrame.createOrReplaceTempView() functions now drop view at the end of session.
- DataFrame.agg() and GroupedData.agg() functions support aggregate functions generated using arithmetic operators.
- Bug Fixes
DataFrame.withColumnRenamed() and DataFrame.withColumnsRenamed() will work if columns are renamed with same name of a column that is already present irrespective of case.
DataFrame.join() now works similar to PySpark if column name or list of column names are passed to ON clause.
|
August 2024 |
20.00.00.01 |
- teradatamlspk DataFrame
- write() - Supports writing the DataFrame to local file system or to Vantage or to cloud storage.
- writeTo() - Supports writing the DataFrame to a Vantage table.
- rdd - Returns the same DataFrame.
- teradatamlspk DataFrameColumn (ColumnExpression)
- desc_nulls_first - Returns a sort expression based on the descending order of the given column name, and null values appear before non-null values.
- desc_nulls_last - Returns a sort expression based on the descending order of the given column name, and null values appear after non-null values.
- asc_nulls_first - Returns a sort expression based on the ascending order of the given column name, and null values appear before non-null values.
- asc_nulls_last - Returns a sort expression based on the ascending order of the given column name, and null values appear after non-null values.
- Updates
- DataFrame.fillna() and DataFrame.na.fill() now supports input arguments of the same data type or their types must be compatible.
- DataFrame.agg() and GroupedData.agg() function supports Column as input and '*' for 'count'.
- DataFrameColumn.cast() and DataFrameColumn.alias() now accepts string literal which are case insensitive.
- Optimized performance for DataFrame.show()
- Classification Summary, TrainingSummary object and MulticlassClassificationEvaluator now supports weightedTruePositiveRate and weightedFalsePositiveRate metric.
- Arithmetic operations can be performed on window aggregates.
- Added new function time_difference to return difference between two timestamps in seconds.
- Bug fixes:
- DataFrame.head() returns a list when n is 1.
- DataFrame.union() and DataFrame.unionAll() now performs union of rows based on columns position.
- DataFrame.groupBy() and DataFrame.groupby() now accepts columns as positional arguments as well, for example df.groupBy("col1", "col2").
- MLlib Functions attribute numClasses and intercept now return value.
- Appropriate error is raised if invalid file is passed to pyspark2teradataml.
- when function accepts Column also along with literal for value argument.
|
March 2024 |
20.00.00.00 |
Initial release.- A pyspark2teradataml utility function to enable PySpark script conversion automatically to teradataml format.
- Supports the following:
- 85 DataFrame APIs with similar syntax compared to PySpark DataFrame APIs.
- 22 DataFrameColumn APIs with similar syntax compared to PySpark DataFrameColumn APIs.
- 200 Functions with similar syntax compared to PySpark Functions.
- 69 machine learning functions with similar syntax compared to PySpark machine learning functions.
|