VarianceThresholdSelector | teradatamlspk | pyspark2teradataml - RegexTokenizer - Teradata Package for Python

Teradata® pyspark2teradataml User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
December 2024
ft:locale
en-US
ft:lastEdition
2024-12-18
dita:mapPath
oeg1710443196055.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
oeg1710443196055
Product Category
Teradata Vantage

Assume a DataFrame as shown here.

>>> df.show()
+-------+
|textcol|
+-------+
|A\tB\tc|
|  x y Z|
| m N\nO|
+-------+

The following examples show the difference between PySpark RegexTokenizer function and teradatamlspk RegexTokenizer function.

PySpark

>>> from pyspark.ml.feature import import RegexTokenizer
>>> PyRT = RegexTokenizer(inputCol = 'textcol', outputCol = 'text_out')
>>> PyRT.setPattern('\t')
>>> PyRT.transform(df).collect()
[Row(textcol='A\tB\tc', text_out=['a', 'b', 'c']),
 Row(textcol='x y Z', text_out=['x y z']),
 Row(textcol='m N\nO', text_out=['m n\no'])]

teradatamlspk

>>> from teradatamlspk.ml.feature import RegexTokenizer
>>> TdRT = RegexTokenizer(inputCol = 'textcol', outputCol = 'text_out')
>>> TdRT.setPattern('\t')
>>> TdRT.transform(df).collect()
[Row(textcol='A\tB\tc', text_out='a'),
 Row(textcol='A\tB\tc', text_out='b'),
 Row(textcol='A\tB\tc', text_out='c'),
 Row(textcol='x y Z', text_out='x y z'),
 Row(textcol='m N\nO', text_out='m n\no')]

The differences

  • PySpark returns a Vector but teradatamlspk does not return a Vector. The output column will contain the tokenized words in multiple rows.