TD_ColumnTransformer Input Table: titanic_train
PassengerID |
Pclass |
Name |
Gender |
Age |
SibSp |
Parch |
Fare |
Cabin |
Embarked |
149 |
2 |
Navratil, Michael |
M |
36 |
0 |
2 |
26.0 |
B21 |
S |
152 |
1 |
Pearson, Mrs. Thomas |
F |
Null |
1 |
0 |
66.6 |
C2 |
S |
581 |
2 |
Christian, Miss Juliana |
F |
25 |
1 |
1 |
30.0 |
Null |
S |
663 |
1 |
Collier, Dr. Edwin |
M |
47 |
0 |
0 |
25.70 |
A23 |
S |
704 |
3 |
Gavin, Mr. Herbert |
M |
25 |
0 |
0 |
7.74 |
Null |
Q |
Create getCabin Input table
drop table getSubtitles;
create multiset table getSubtitles as (
select * from Unpack(
on titanic_train
Using
TargetColumn('Name')
OutputColumns('NTitle')
OutputDatatypes('Varchar')
Delimiter('$')
Regex('([A-Za-z]+)\.')
)as dt)with data;
drop table getCabin;
create multiset table getCabin as (
SELECT * FROM TD_strApply (
ON getSubtitles as inputtable
USING
TargetColumns ('cabin')
StringOperation('getNchars')
StringLength(1)
Accumulate('[:]','-cabin')
InPlace('True')
) as dt)with data;
TD_ColumnTransformer SQL Call
SELECT * FROM TD_ColumnTransformer(
ON getCabin AS inputtable
ON imputeFit AS SimpleImputeFitTable dimension
ON NonLinearCombineFit AS NonLinearCombineFitTable dimension
ON ordinalFit AS OrdinalEncodingFitTable dimension
ON onehotfittable AS OneHotEncodingFitTable dimension
ON ScaleFit AS ScaleFitTable dimension
)AS dt ORDER BY 1,2,3,4,5,6,7;
TD_ColumnTransformer Output
NTitle passenger survived pclass gender age sibsp parch ticket fare embarked cabin FamilySize cabin_A cabin_B cabin_C cabin_other
----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- -------------------- ---------------------- ----------- ----- ---------------------- ----------- ----------- ----------- -----------
-1 8 0 3 1 2 3 1 349909 4.11356604308324E-002 2 ? 5.00000000000000E 000 0 0 0 1
-1 17 0 3 1 2 4 1 382652 5.68482139999047E-002 1 ? 6.00000000000000E 000 0 0 0 1
.... ...... .... .... .... .... .... ...... .... .... .... .... .... .... .... ....
.... ...... .... .... .... .... .... ...... .... .... .... .... .... .... .... ....
5 888 1 1 2 19 0 0 112053 5.85561002574126E-002 2 B 1.00000000000000E 000 0 1 0 0
5 889 0 3 2 28 1 2 W./C. 6607 4.57713517012109E-002 2 ? 4.00000000000000E 000 0 0 0 1
Comparison of serial processing of the functions to TD_ColumnTransformer function based on size of data set:
Data Set |
Serial Processing in Seconds |
TD_ColumnTransformer Processing in Seconds |
10M |
89 |
29 |
20M |
167 |
49 |
30M |
332 |
98 |