PySpark API Supportability Matrix | Data Frame APIs | pyspark2teradataml - DataFrame APIs - Teradata Package for Python

Teradata® pyspark2teradataml User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
December 2024
ft:locale
en-US
ft:lastEdition
2024-12-18
dita:mapPath
oeg1710443196055.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
oeg1710443196055
Product Category
Teradata Vantage
PySpark API Name Supported Notes
alias  
count  
crossJoin If both the DataFrames share similar column names, then teradatamlspk shows the column names with prefix as “l” and “r”. Also the order of the columns varies.
join If both the DataFrame's share similar column names, then teradatamlspk shows the column names with prefix as “l” and “r”. Also the order of the columns varies.
  • This is not applicable to "semi", "left_semi", "leftsemi", "anti", "leftanti", "left_anti" type of joins. So, the output matches with PySpark for these type of joins even though they both share same column names.
  • This is not applicable when on clause is a string or list of strings. So, the output matches with PySpark for these type of joins even though they both share same column names.
distinct  
columns  
distinct  
dropDuplicates  
drop_duplicates  
dropna  
dtypes Output shows Teradata types, not PySpark types.
exceptAll  
intersect  
intersectAll  
limit  
subtract  
tail  
toPandas  
union  
unionAll  
toLocalIterator  
head  
filter  
where  
randomSplit Argument seed is ignored in teradatamlspk. While you can specify it, it will not be used in processing.
sample Argument seed is ignored in teradatamlspk. While you can specify it, it will not be used in processing.
withColumnRenamed  
withColumnsRenamed  
corr PySpark considers NULLS also while calculating the co-relation between 2 columns whereas teradatamlspk does not consider NULLS for calculating co-relation between 2 columns.
cov PySpark considers NULLS also while calculating the covariance between 2 columns whereas teradatamlspk does not consider NULLS for calculating covariance between 2 columns.
take  
select  
sort
  • API changes will not be propagated to next APIs.
  • ColumnExpressions are not supported. Only Column names are supported.
orderBy
  • API changes will not be propagated to next APIs.
  • ColumnExpressions are not supported. Only Column names are supported.
first  
unionByName  
cache teradatamlspk will return same DataFrame.
checkpoint teradatamlspk will return same DataFrame.
localCheckpoint teradatamlspk will return same DataFrame.
persist teradatamlspk will return same DataFrame.
unpersist teradatamlspk will return same DataFrame.
collect  
schema nullable parameter in StructField always shows True.
toDF  
summary  
describe  
colRegex Pyspark returns result based on Scala or Java regex, whereas teradatamlspk will return based on python regex.
isEmpty  
show  
unpivot Output DataFrame column names may vary when compared to PySpark DataFrame columns.
melt Output DataFrame column names may vary when compared to PySpark DataFrame columns.
createGlobalTempView  
createOrReplaceTempView  
createTempView No concept of temporary view. teradatamlspk creates a view; drop the view at end of session.
createOrReplaceTempView No concept of temporary view. teradatamlspk creates a view; drop the view at end of session.
registerTempTable  
sortWithinPartitions Functionality is not applicable for Vantage . Hence, teradatamlspk returns a sorted DataFrame based on columns.
hint Functionality is not applicable for Vantage . Hence, teradatamlspk returns same DataFrame.
coalesce Functionality is not applicable for Vantage . Hence, teradatamlspk returns same DataFrame.
repartition Functionality is not applicable for Vantage . Hence, teradatamlspk returns same DataFrame.
repartitionByRange Functionality is not applicable for Vantage . Hence, teradatamlspk returns same DataFrame.
sameSemantics Functionality is not applicable for Vantage . Hence teradatamlspk returns False always.
semanticHash Functionality is not applicable for Vantage . Hence teradatamlspk returns 0.
inputFiles Functionality is not applicable for Vantage . Hence teradatamlspk returns empty list.
selectExpr Column names may vary when compared to PySpark.
drop  
isLocal Functionality is not applicable for Vantage . Hence teradatamlspk returns False.
isStreaming Functionality is not applicable for Vantage . Hence teradatamlspk returns False.
printSchema  
replace PySpark ignores the replacement when it is not possible instead of raising error. If you replace a numeric column with a string type, PySpark ignores the replacement but teradatamlspk raises an error.
crosstab Column names may vary based on the data in DataFrame. Order of columns might also vary.
foreach  
foreachPartition  
cube Pyspark performs aggregation on columns used for grouping where as teradatamlspk ignores the aggregation of grouping columns.
rollup PySpark performs aggregation on columns used for grouping where as teradatamlspk ignores the aggregation of grouping columns.
fillna All input arguments must be of the same data type or their types must be compatible. For example, if value is an integer type, and subset contains a string column, PySpark ignores the replacement but, teradatamlspk raises an error. You must drop incompatible columns or cast them to the compatible ones.
transform  
groupBy  
agg Functions count_distinct and countDistinct only accepts one column as input.
__getattr__  
__getitem__  
na  
sampleBy  
stat  
withColumn  
withColumns  
DataFrameNaFunctions.drop  
DataFrameNaFunctions.fill All input arguments must be of the same data type or their types must be compatible. For example, if the value is an integer type, and subset contains a string column, PySpark ignores the replacement but, teradatamlspk raises an error. You must drop incompatible columns or cast them to the compatible ones.
DataFrameNaFunctions.replace  
DataFrameStatFunctions.corr  
DataFrameStatFunctions.cov  
DataFrameStatFunctions.crosstab  
DataFrameStatFunctions.sampleBy