Key Feature Additions and Changes | Teradata pyspark2teradataml Package - Key Feature Additions and Changes - Teradata Package for Python

Teradata® pyspark2teradataml User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
December 2024
ft:locale
en-US
ft:lastEdition
2024-12-18
dita:mapPath
oeg1710443196055.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
oeg1710443196055
Product Category
Teradata Vantage
The following table lists the key feature additions and changes in the Teradata product pyspark2teradataml.
Date Release Description
December 2024 20.00.00.02
  • New feature/functionality:
    • teradatamlspk DataFrameReader

      table() - Returns the specified table as a DataFrame.

    • teradatamlspk DataFrameWriterV2
      • partitionedBy - Partition the output table created by create, createOrReplace, or replace using the given columns or transforms.
      • option - Add an output option while writing a DataFrame to a data source.
      • options - Adds output options while writing a DataFrame to a data source.
  • teradatamlspk global functions
    • years - Partition transform function: A transform for timestamps and dates to partition data into years.
    • days - Partition transform function: A transform for timestamps and dates to partition data into days.
    • months - Partition transform function: A transform for timestamps and dates to partition data into months.
    • hours - Partition transform function: A transform for timestamps and dates to partition data into hours.
    • udf - Creates a user defined function (UDF).
    • conv - Convert a number in a string column from one base to another.
    • log - Returns the first argument-based logarithm of the second argument.
    • log2 - Returns the base-2 logarithm of the argument.
    • date_from_unix_date - Create date from the number of days since 1970-01-01.
    • extract - Extracts a part of the date/timestamp or interval source.
    • datepart - Extracts a part of the date/timestamp or interval source.
    • date_part - Extracts a part of the date/timestamp or interval source.
    • make_dt_interval - Make DayTimeIntervalType duration from days, hours, mins and secs.
    • make_timestamp - Create timestamp from years, months, days, hours, mins, secs and timezone fields.
    • make_timestamp_ltz - Create the current timestamp with local time zone from years, months, days, hours, mins, secs and timezone fields.
    • make_timestamp_ntz - Create local date-time from years, months, days, hours, mins, secs fields.
    • make_ym_interval - Make year-month interval from years, months.
    • make_date - Returns a column with a date built from the year, month and day columns.
    • from_unixtime - Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp.
    • unix_timestamp - Convert time string with given pattern to unix epoch.
    • to_unix_timestamp - Convert time string with given pattern to unix epoch.
    • to_timestamp - Converts a string column to timestamp.
    • to_timestamp_ltz - Converts a string column to timestamp.
    • to_timestamp_ntz - Converts a string column to timestamp.
    • from_utc_timestamp - Converts column to utc timestamp from different timezone columns.
    • to_utc_timestamp - Converts column to given timestamp from utc timezone columns.
    • timestamp_micros - Creates timestamp from the number of microseconds since UTC epoch.
    • timestamp_millis - Creates timestamp from the number of milliseconds since UTC epoch.
    • timestamp_seconds - Converts the number of seconds from the Unix epoch to a timestamp
    • unix_micros - Returns the number of microseconds since 1970-01-01 00:00:00 UTC.
    • unix_millis - Returns the number of milliseconds since 1970-01-01 00:00:00 UTC.
    • unix_seconds - Returns the number of seconds since 1970-01-01 00:00:00 UTC.
    • base64 - Computes the BASE64 encoding of a binary column and returns it as a string column.
    • current_timezone - Returns the current session local timezone.
    • format_string - Formats the arguments in printf-style and returns the result as a string column.
    • elt - Returns the n-th input, e.g., returns input2 when n is 2. The function returns NULL if the index exceeds the length of the array.
    • to_varchar - Convert col to a string based on the format.
    • current_catalog - Returns the current catalog.
    • equal_null - Returns same result as the EQUAL(=) operator for non-null operands, but returns true if both are null, false if one of the them is null.
    • version - Returns the teradatamlspk version.
    • parse_url - Extracts a part from a URL.
    • reverse - Returns a reversed string with reverse order of elements.
    • convert_timezone - Converts the timestamp without time zone sourceTs from the sourceTz time zone to targetTz.
    • call_udf - Register a user defined function (UDF).
  • teradatamlspk UDFRegistration

    register() - Call a registered user defined function (UDF).

  • teradatamlspk DataFrameColumn a.k.a. ColumnExpression

    eqNullSafe() - Equality test that is safe for null values.

  • teradatamlspk MLlib Functions

    RegexTokenizer() - Extracts tokens based on the pattern.

  • pyspark2teradataml

    pyspark2teradataml utility accepts directory containing PySpark scripts as input.

    pyspark2teradataml utility accepts PySpark notebook as input.

  • Updates
    • spark.conf.set - Supports set time zone to session.
    • spark.conf.unset - Supports unset time zone to previous time zone set by user.
    • DataFrame.select(), DataFrame.withColumn(), DataFrame.withColumns() functions now accept functions like, ilike, isNull, isNotNull, contains, startswith, endswith, booleanexpressions, binaryexpressions without when clause.
    • DataFrameColumn.cast() and DataFrameColumn.astype() functions support TimestampNTZType, DayTimeIntervalType, YearMonthIntervalType.
    • DataFrame.createTempView() and DataFrame.createOrReplaceTempView() functions now drop view at the end of session.
    • DataFrame.agg() and GroupedData.agg() functions support aggregate functions generated using arithmetic operators.
  • Bug Fixes

    DataFrame.withColumnRenamed() and DataFrame.withColumnsRenamed() will work if columns are renamed with same name of a column that is already present irrespective of case.

    DataFrame.join() now works similar to PySpark if column name or list of column names are passed to ON clause.

August 2024 20.00.00.01
  • teradatamlspk DataFrame
    • write() - Supports writing the DataFrame to local file system or to Vantage or to cloud storage.
    • writeTo() - Supports writing the DataFrame to a Vantage table.
    • rdd - Returns the same DataFrame.
  • teradatamlspk DataFrameColumn (ColumnExpression)
    • desc_nulls_first - Returns a sort expression based on the descending order of the given column name, and null values appear before non-null values.
    • desc_nulls_last - Returns a sort expression based on the descending order of the given column name, and null values appear after non-null values.
    • asc_nulls_first - Returns a sort expression based on the ascending order of the given column name, and null values appear before non-null values.
    • asc_nulls_last - Returns a sort expression based on the ascending order of the given column name, and null values appear after non-null values.
  • Updates
    • DataFrame.fillna() and DataFrame.na.fill() now supports input arguments of the same data type or their types must be compatible.
    • DataFrame.agg() and GroupedData.agg() function supports Column as input and '*' for 'count'.
    • DataFrameColumn.cast() and DataFrameColumn.alias() now accepts string literal which are case insensitive.
    • Optimized performance for DataFrame.show()
    • Classification Summary, TrainingSummary object and MulticlassClassificationEvaluator now supports weightedTruePositiveRate and weightedFalsePositiveRate metric.
    • Arithmetic operations can be performed on window aggregates.
    • Added new function time_difference to return difference between two timestamps in seconds.
  • Bug fixes:
    • DataFrame.head() returns a list when n is 1.
    • DataFrame.union() and DataFrame.unionAll() now performs union of rows based on columns position.
    • DataFrame.groupBy() and DataFrame.groupby() now accepts columns as positional arguments as well, for example df.groupBy("col1", "col2").
    • MLlib Functions attribute numClasses and intercept now return value.
    • Appropriate error is raised if invalid file is passed to pyspark2teradataml.
    • when function accepts Column also along with literal for value argument.
March 2024 20.00.00.00 Initial release.
  • A pyspark2teradataml utility function to enable PySpark script conversion automatically to teradataml format.
  • Supports the following:
    • 85 DataFrame APIs with similar syntax compared to PySpark DataFrame APIs.
    • 22 DataFrameColumn APIs with similar syntax compared to PySpark DataFrameColumn APIs.
    • 200 Functions with similar syntax compared to PySpark Functions.
    • 69 machine learning functions with similar syntax compared to PySpark machine learning functions.