Assume a DataFrame as shown here.
>>> df.show()
+-------+ |textcol| +-------+ |A\tB\tc| | x y Z| | m N\nO| +-------+
The following examples show the difference between PySpark RegexTokenizer function and teradatamlspk RegexTokenizer function.
PySpark
>>> from pyspark.ml.feature import import RegexTokenizer
>>> PyRT = RegexTokenizer(inputCol = 'textcol', outputCol = 'text_out')
>>> PyRT.setPattern('\t')
>>> PyRT.transform(df).collect()
[Row(textcol='A\tB\tc', text_out=['a', 'b', 'c']), Row(textcol='x y Z', text_out=['x y z']), Row(textcol='m N\nO', text_out=['m n\no'])]
teradatamlspk
>>> from teradatamlspk.ml.feature import RegexTokenizer
>>> TdRT = RegexTokenizer(inputCol = 'textcol', outputCol = 'text_out')
>>> TdRT.setPattern('\t')
>>> TdRT.transform(df).collect()
[Row(textcol='A\tB\tc', text_out='a'), Row(textcol='A\tB\tc', text_out='b'), Row(textcol='A\tB\tc', text_out='c'), Row(textcol='x y Z', text_out='x y z'), Row(textcol='m N\nO', text_out='m n\no')]
The differences
- PySpark returns a Vector but teradatamlspk does not return a Vector. The output column will contain the tokenized words in multiple rows.