Use Pipeline through teradataml OpenSourceML Functions | teradatamlspk - Use Pipeline through teradataml Open-Source Machine Learning Functions - Teradata Package for Python

Teradata® pyspark2teradataml User Guide

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Teradata Package for Python
Release Number
20.00
Published
March 2024
Language
English (United States)
Last Update
2024-04-11
dita:mapPath
oeg1710443196055.ditamap
dita:ditavalPath
ayr1485454803741.ditaval
dita:id
oeg1710443196055
Product Category
Teradata Vantage
  1. Create DataFrame.
    Create teradatamlspk DataFrame. Assume the similar data is available in Vantage in a table "housing_data".
    >>> housing_df = teradata_session.createDataFrame("house_data")
    >>> housing_df.show()
    +-------------------+------------------+------+--------+--------+------+--------+------------------+--------+
    |          longitude|          latitude|medage|totrooms|totbdrms|   pop|houshlds|            medinc|   medhv|
    +-------------------+------------------+------+--------+--------+------+--------+------------------+--------+
    |-121.29000091552734|  38.0099983215332|  10.0|    69.0|    16.0|  50.0|    20.0|              3.75|120800.0|
    | -121.2699966430664| 38.02000045776367|  32.0|   342.0|    58.0| 138.0|    52.0| 2.982100009918213|155000.0|
    |-121.30000305175781| 38.04999923706055|  52.0|   122.0|    26.0|  62.0|    25.0| 1.149999976158142|112500.0|
    |            -121.25| 38.04999923706055|  25.0|  1967.0|   362.0|1035.0|   361.0| 3.573499917984009|106800.0|
    |            -121.25|  38.0099983215332|  16.0|  2397.0|   501.0|1053.0|   557.0| 2.699399948120117|112500.0|
    | -121.2300033569336|38.040000915527344|  32.0|  1829.0|   262.0| 677.0|   243.0| 6.180500030517578|247900.0|
    |-121.22000122070312|38.040000915527344|  42.0|   343.0|    50.0| 116.0|    49.0| 5.537600040435791|212500.0|
    |            -121.25|38.029998779296875|  29.0|  2465.0|   327.0| 859.0|   315.0|6.6605000495910645|220700.0|
    |-121.22000122070312|              38.0|  35.0|  1841.0|   300.0| 783.0|   285.0| 2.816699981689453|162100.0|
    |-121.23999786376953|  38.0099983215332|  22.0|  1526.0|   299.0| 790.0|   300.0|2.4342000484466553|125000.0|
    |            -121.25|              38.0|  21.0|   446.0|    73.0| 182.0|    57.0|2.8958001136779785|135000.0|
    |-121.23999786376953|              38.0|  25.0|  1471.0|   300.0| 721.0|   304.0|2.4688000679016113|126800.0|
    | -121.2300033569336|  37.9900016784668|  38.0|   523.0|    80.0| 226.0|    72.0| 5.569300174713135|153100.0|
    | -121.2300033569336| 37.97999954223633|  27.0|   849.0|   137.0| 373.0|   131.0| 5.036200046539307|181300.0|
    |-121.19999694824219|37.970001220703125|  39.0|   440.0|    83.0| 270.0|    97.0| 6.058199882507324|157700.0|
    | -122.2300033569336|37.880001068115234|  41.0|   880.0|   129.0| 322.0|   126.0| 8.325200080871582|452600.0|
    |-122.22000122070312| 37.86000061035156|  21.0|  7099.0|  1106.0|2401.0|  1138.0| 8.301400184631348|358500.0|
    |-122.23999786376953|37.849998474121094|  52.0|  1467.0|   190.0| 496.0|   177.0| 7.257400035858154|352100.0|
    |            -122.25|37.849998474121094|  52.0|  1274.0|   235.0| 558.0|   219.0| 5.643099784851074|341300.0|
    |            -122.25|37.849998474121094|  52.0|  1627.0|   280.0| 565.0|   259.0|3.8461999893188477|342200.0|
    +-------------------+------------------+------+--------+--------+------+--------+------------------+--------+
    only showing top 20 rows
    
  2. Scale down the column 'medhv'.
    >>> housing_df = housing_df.withColumn("medhv", housing_df.medhv/1000)
    >>> housing_df.show()
    +-------------------+------------------+------+--------+--------+------+--------+------------------+-----+
    |          longitude|          latitude|medage|totrooms|totbdrms|   pop|houshlds|            medinc|medhv|
    +-------------------+------------------+------+--------+--------+------+--------+------------------+-----+
    |-121.29000091552734|  38.0099983215332|  10.0|    69.0|    16.0|  50.0|    20.0|              3.75|120.8|
    | -121.2699966430664| 38.02000045776367|  32.0|   342.0|    58.0| 138.0|    52.0| 2.982100009918213|155.0|
    |-121.30000305175781| 38.04999923706055|  52.0|   122.0|    26.0|  62.0|    25.0| 1.149999976158142|112.5|
    |            -121.25| 38.04999923706055|  25.0|  1967.0|   362.0|1035.0|   361.0| 3.573499917984009|106.8|
    |            -121.25|  38.0099983215332|  16.0|  2397.0|   501.0|1053.0|   557.0| 2.699399948120117|112.5|
    | -121.2300033569336|38.040000915527344|  32.0|  1829.0|   262.0| 677.0|   243.0| 6.180500030517578|247.9|
    |-121.22000122070312|38.040000915527344|  42.0|   343.0|    50.0| 116.0|    49.0| 5.537600040435791|212.5|
    |            -121.25|38.029998779296875|  29.0|  2465.0|   327.0| 859.0|   315.0|6.6605000495910645|220.7|
    |-121.22000122070312|              38.0|  35.0|  1841.0|   300.0| 783.0|   285.0| 2.816699981689453|162.1|
    |-121.23999786376953|  38.0099983215332|  22.0|  1526.0|   299.0| 790.0|   300.0|2.4342000484466553|125.0|
    |            -121.25|              38.0|  21.0|   446.0|    73.0| 182.0|    57.0|2.8958001136779785|135.0|
    |-121.23999786376953|              38.0|  25.0|  1471.0|   300.0| 721.0|   304.0|2.4688000679016113|126.8|
    | -121.2300033569336|  37.9900016784668|  38.0|   523.0|    80.0| 226.0|    72.0| 5.569300174713135|153.1|
    | -121.2300033569336| 37.97999954223633|  27.0|   849.0|   137.0| 373.0|   131.0| 5.036200046539307|181.3|
    |-121.19999694824219|37.970001220703125|  39.0|   440.0|    83.0| 270.0|    97.0| 6.058199882507324|157.7|
    | -122.2300033569336|37.880001068115234|  41.0|   880.0|   129.0| 322.0|   126.0| 8.325200080871582|452.6|
    |-122.22000122070312| 37.86000061035156|  21.0|  7099.0|  1106.0|2401.0|  1138.0| 8.301400184631348|358.5|
    |-122.23999786376953|37.849998474121094|  52.0|  1467.0|   190.0| 496.0|   177.0| 7.257400035858154|352.1|
    |            -122.25|37.849998474121094|  52.0|  1274.0|   235.0| 558.0|   219.0| 5.643099784851074|341.3|
    |            -122.25|37.849998474121094|  52.0|  1627.0|   280.0| 565.0|   259.0|3.8461999893188477|342.2|
    +-------------------+------------------+------+--------+--------+------+--------+------------------+-----+
    only showing top 20 rows
    
  3. Prepare the stages of Pipeline.
    teradatamlspk machine learning functions accept multiple columns as input.

    Thus, you do not need to convert feature columns to vectors. And, Pipeline has only two stages.

    teradataml open-source machine learning function is used for pipeline. Import the td_sklearn function from teradataml.
    >>> from teradataml import td_sklearn
    >>> ssc = td_sklearn.StandardScaler(with_mean=False, with_std=True)
    >>> lr = td_sklearn.LinearRegression()
    PySpark LinearRegression is initiated with arguments. However, none of those arguments are supported in scikit-learn. Thus, those arguments are not passed here.
  4. Initiate Pipeline with the stages in Step 3.
    >>> steps = [('scaler', ssc), ('lr', lr)]
    >>> pipe = td_sklearn.Pipeline(steps = steps)
    >>> pipe
    Pipeline(steps=[('scaler', StandardScaler(with_mean=False)),
                    ('lr', LinearRegression())])
    
  5. Prepare test and train data.

    Convert the teradatamlspk DataFrame to teradataml DataFrame since teradataml Pipeline object is used.

    >>> train_data, test_data = housing_df.randomSplit([.8,.2])
    >>> train_data = train_data.toTeradataml()
    >>> test_data = test_data.toTeradataml()
  6. Fit the Pipeline model and predict the values.
    • Pipeline.fit accepts features and labels in X and y arguments.
    • Pipeline.predict returns teradataml DataFrame. So, convert it to teradatamlspk DataFrame.
    >>> X=train_data.select(['longitude', 'latitude', 'medage', 'totrooms', 'totbdrms', 'pop', 'houshlds', 'medinc'])
    >>> y=train_data.select('medhv')
    
    >>> test_X=test_data.select(['longitude', 'latitude', 'medage', 'totrooms', 'totbdrms', 'pop', 'houshlds', 'medinc'])
    
    >>> pipe.fit(X, y)
    Pipeline(steps=[('scaler', StandardScaler(with_mean=False)),
                    ('lr', LinearRegression())])
    >>> predicted_df = pipe.predict(test_X)

    Convert teradataml DataFrame to teradatamlspk DataFrame.

    >>> from teradatamlspk.sql import DataFrame
    >>> DataFrame(predicted_df).show()
    +-----------------+----------------+------+--------+--------+-------+--------+----------------+------------------+
    |        longitude|        latitude|medage|totrooms|totbdrms|    pop|houshlds|          medinc|pipeline_predict_1|
    +-----------------+----------------+------+--------+--------+-------+--------+----------------+------------------+
    |-118.089996337891| 33.939998626709|  33.0|  1976.0|   404.0| 1379.0|   395.0|3.85419988632202|206.02160441530123|
    |-118.089996337891|           33.75|  32.0|  6239.0|   974.0| 2615.0|   950.0|6.61880016326904| 334.5972707090791|
    |-118.089996337891|33.9099998474121|  14.0|  2369.0|   604.0| 1546.0|   464.0|3.79690003395081| 199.3325153559972|
    |-118.089996337891|33.9000015258789|  37.0|  1147.0|   258.0|  742.0|   242.0|4.04610013961792|  228.622439734017|
    |-118.089996337891|33.7999992370605|  36.0|  1724.0|   322.0|  838.0|   328.0|4.48309993743896| 253.3557367881317|
    |-118.089996337891|34.6800003051758|   4.0| 23386.0|  4171.0|10493.0|  3671.0|4.02110004425049|203.56967265075446|
    |-114.470001220703|34.4000015258789|  19.0|  7650.0|  1901.0| 1129.0|   463.0|1.82000005245209| 67.38934713415028|
    |-122.540000915527|37.9300003051758|  43.0|  2998.0|   470.0|  970.0|   430.0|5.53849983215332| 323.9588973585096|
    |-122.540000915527|37.7599983215332|  45.0|  1592.0|   325.0|  920.0|   322.0|3.96000003814697|260.24239345547403|
    |-117.349998474121|33.2000007629395|  32.0|  1251.0|   220.0|  700.0|   232.0|3.98749995231628|215.95321673868193|
    |-117.349998474121|33.2299995422363|   4.0|  1837.0|   287.0|  934.0|   277.0|3.89580011367798| 176.2611374003559|
    |-117.349998474121|33.9500007629395|  28.0|  1650.0|   210.0|  557.0|   211.0|7.66319990158081|326.08117186226536|
    |-117.349998474121|34.0900001525879|  14.0|  5983.0|  1224.0| 3255.0|  1150.0|2.59019994735718|120.56159630979982|
    |-117.349998474121|34.2000007629395|   5.0|  9269.0|  1605.0| 4916.0|  1519.0|4.43669986724854| 146.8090945965846|
    |-117.349998474121|34.0099983215332|  23.0|  3707.0|   769.0| 1938.0|   658.0|2.72499990463257|134.06109946141441|
    |-117.349998474121| 33.689998626709|  11.0|  1229.0|   236.0|  581.0|   190.0|3.10199999809265|141.39928849102444|
    |-118.089996337891|33.9500007629395|  32.0|  1083.0|   206.0|  737.0|   218.0|3.55830001831055|195.53150806768326|
    |-118.089996337891|33.9700012207031|  35.0|  2664.0|   541.0| 2033.0|   491.0|3.73259997367859|188.61314692021506|
    |-118.089996337891|33.8899993896484|  27.0|  3399.0|   882.0| 2465.0|   811.0|3.09899997711182|189.91245231345965|
    |-118.089996337891|33.9199981689453|  36.0|   847.0|   185.0|  713.0|   194.0|4.85419988632202| 251.2192014151351|
    +-----------------+----------------+------+--------+--------+-------+--------+----------------+------------------+
    only showing top 20 rows