- Create DataFrame.Create teradatamlspk DataFrame. Assume the similar data is available in Vantage in a table "housing_data".
>>> housing_df = teradata_session.createDataFrame("house_data")
>>> housing_df.show()
+-------------------+------------------+------+--------+--------+------+--------+------------------+--------+ | longitude| latitude|medage|totrooms|totbdrms| pop|houshlds| medinc| medhv| +-------------------+------------------+------+--------+--------+------+--------+------------------+--------+ |-121.29000091552734| 38.0099983215332| 10.0| 69.0| 16.0| 50.0| 20.0| 3.75|120800.0| | -121.2699966430664| 38.02000045776367| 32.0| 342.0| 58.0| 138.0| 52.0| 2.982100009918213|155000.0| |-121.30000305175781| 38.04999923706055| 52.0| 122.0| 26.0| 62.0| 25.0| 1.149999976158142|112500.0| | -121.25| 38.04999923706055| 25.0| 1967.0| 362.0|1035.0| 361.0| 3.573499917984009|106800.0| | -121.25| 38.0099983215332| 16.0| 2397.0| 501.0|1053.0| 557.0| 2.699399948120117|112500.0| | -121.2300033569336|38.040000915527344| 32.0| 1829.0| 262.0| 677.0| 243.0| 6.180500030517578|247900.0| |-121.22000122070312|38.040000915527344| 42.0| 343.0| 50.0| 116.0| 49.0| 5.537600040435791|212500.0| | -121.25|38.029998779296875| 29.0| 2465.0| 327.0| 859.0| 315.0|6.6605000495910645|220700.0| |-121.22000122070312| 38.0| 35.0| 1841.0| 300.0| 783.0| 285.0| 2.816699981689453|162100.0| |-121.23999786376953| 38.0099983215332| 22.0| 1526.0| 299.0| 790.0| 300.0|2.4342000484466553|125000.0| | -121.25| 38.0| 21.0| 446.0| 73.0| 182.0| 57.0|2.8958001136779785|135000.0| |-121.23999786376953| 38.0| 25.0| 1471.0| 300.0| 721.0| 304.0|2.4688000679016113|126800.0| | -121.2300033569336| 37.9900016784668| 38.0| 523.0| 80.0| 226.0| 72.0| 5.569300174713135|153100.0| | -121.2300033569336| 37.97999954223633| 27.0| 849.0| 137.0| 373.0| 131.0| 5.036200046539307|181300.0| |-121.19999694824219|37.970001220703125| 39.0| 440.0| 83.0| 270.0| 97.0| 6.058199882507324|157700.0| | -122.2300033569336|37.880001068115234| 41.0| 880.0| 129.0| 322.0| 126.0| 8.325200080871582|452600.0| |-122.22000122070312| 37.86000061035156| 21.0| 7099.0| 1106.0|2401.0| 1138.0| 8.301400184631348|358500.0| |-122.23999786376953|37.849998474121094| 52.0| 1467.0| 190.0| 496.0| 177.0| 7.257400035858154|352100.0| | -122.25|37.849998474121094| 52.0| 1274.0| 235.0| 558.0| 219.0| 5.643099784851074|341300.0| | -122.25|37.849998474121094| 52.0| 1627.0| 280.0| 565.0| 259.0|3.8461999893188477|342200.0| +-------------------+------------------+------+--------+--------+------+--------+------------------+--------+ only showing top 20 rows
- Scale down the column 'medhv'.
>>> housing_df = housing_df.withColumn("medhv", housing_df.medhv/1000)
>>> housing_df.show()
+-------------------+------------------+------+--------+--------+------+--------+------------------+-----+ | longitude| latitude|medage|totrooms|totbdrms| pop|houshlds| medinc|medhv| +-------------------+------------------+------+--------+--------+------+--------+------------------+-----+ |-121.29000091552734| 38.0099983215332| 10.0| 69.0| 16.0| 50.0| 20.0| 3.75|120.8| | -121.2699966430664| 38.02000045776367| 32.0| 342.0| 58.0| 138.0| 52.0| 2.982100009918213|155.0| |-121.30000305175781| 38.04999923706055| 52.0| 122.0| 26.0| 62.0| 25.0| 1.149999976158142|112.5| | -121.25| 38.04999923706055| 25.0| 1967.0| 362.0|1035.0| 361.0| 3.573499917984009|106.8| | -121.25| 38.0099983215332| 16.0| 2397.0| 501.0|1053.0| 557.0| 2.699399948120117|112.5| | -121.2300033569336|38.040000915527344| 32.0| 1829.0| 262.0| 677.0| 243.0| 6.180500030517578|247.9| |-121.22000122070312|38.040000915527344| 42.0| 343.0| 50.0| 116.0| 49.0| 5.537600040435791|212.5| | -121.25|38.029998779296875| 29.0| 2465.0| 327.0| 859.0| 315.0|6.6605000495910645|220.7| |-121.22000122070312| 38.0| 35.0| 1841.0| 300.0| 783.0| 285.0| 2.816699981689453|162.1| |-121.23999786376953| 38.0099983215332| 22.0| 1526.0| 299.0| 790.0| 300.0|2.4342000484466553|125.0| | -121.25| 38.0| 21.0| 446.0| 73.0| 182.0| 57.0|2.8958001136779785|135.0| |-121.23999786376953| 38.0| 25.0| 1471.0| 300.0| 721.0| 304.0|2.4688000679016113|126.8| | -121.2300033569336| 37.9900016784668| 38.0| 523.0| 80.0| 226.0| 72.0| 5.569300174713135|153.1| | -121.2300033569336| 37.97999954223633| 27.0| 849.0| 137.0| 373.0| 131.0| 5.036200046539307|181.3| |-121.19999694824219|37.970001220703125| 39.0| 440.0| 83.0| 270.0| 97.0| 6.058199882507324|157.7| | -122.2300033569336|37.880001068115234| 41.0| 880.0| 129.0| 322.0| 126.0| 8.325200080871582|452.6| |-122.22000122070312| 37.86000061035156| 21.0| 7099.0| 1106.0|2401.0| 1138.0| 8.301400184631348|358.5| |-122.23999786376953|37.849998474121094| 52.0| 1467.0| 190.0| 496.0| 177.0| 7.257400035858154|352.1| | -122.25|37.849998474121094| 52.0| 1274.0| 235.0| 558.0| 219.0| 5.643099784851074|341.3| | -122.25|37.849998474121094| 52.0| 1627.0| 280.0| 565.0| 259.0|3.8461999893188477|342.2| +-------------------+------------------+------+--------+--------+------+--------+------------------+-----+ only showing top 20 rows
- Prepare the stages of Pipeline.teradatamlspk machine learning functions accept multiple columns as input.
Thus, you do not need to convert feature columns to vectors. And, Pipeline has only two stages.
teradataml open-source machine learning function is used for pipeline. Import the td_sklearn function from teradataml.>>> from teradataml import td_sklearn
>>> ssc = td_sklearn.StandardScaler(with_mean=False, with_std=True)
>>> lr = td_sklearn.LinearRegression()
PySpark LinearRegression is initiated with arguments. However, none of those arguments are supported in scikit-learn. Thus, those arguments are not passed here. - Initiate Pipeline with the stages in Step 3.
>>> steps = [('scaler', ssc), ('lr', lr)]
>>> pipe = td_sklearn.Pipeline(steps = steps)
>>> pipe
Pipeline(steps=[('scaler', StandardScaler(with_mean=False)), ('lr', LinearRegression())])
- Prepare test and train data.
Convert the teradatamlspk DataFrame to teradataml DataFrame since teradataml Pipeline object is used.
>>> train_data, test_data = housing_df.randomSplit([.8,.2])
>>> train_data = train_data.toTeradataml()
>>> test_data = test_data.toTeradataml()
- Fit the Pipeline model and predict the values.
- Pipeline.fit accepts features and labels in X and y arguments.
- Pipeline.predict returns teradataml DataFrame. So, convert it to teradatamlspk DataFrame.
>>> X=train_data.select(['longitude', 'latitude', 'medage', 'totrooms', 'totbdrms', 'pop', 'houshlds', 'medinc'])
>>> y=train_data.select('medhv')
>>> test_X=test_data.select(['longitude', 'latitude', 'medage', 'totrooms', 'totbdrms', 'pop', 'houshlds', 'medinc'])
>>> pipe.fit(X, y)
Pipeline(steps=[('scaler', StandardScaler(with_mean=False)), ('lr', LinearRegression())])
>>> predicted_df = pipe.predict(test_X)
Convert teradataml DataFrame to teradatamlspk DataFrame.
>>> from teradatamlspk.sql import DataFrame
>>> DataFrame(predicted_df).show()
+-----------------+----------------+------+--------+--------+-------+--------+----------------+------------------+ | longitude| latitude|medage|totrooms|totbdrms| pop|houshlds| medinc|pipeline_predict_1| +-----------------+----------------+------+--------+--------+-------+--------+----------------+------------------+ |-118.089996337891| 33.939998626709| 33.0| 1976.0| 404.0| 1379.0| 395.0|3.85419988632202|206.02160441530123| |-118.089996337891| 33.75| 32.0| 6239.0| 974.0| 2615.0| 950.0|6.61880016326904| 334.5972707090791| |-118.089996337891|33.9099998474121| 14.0| 2369.0| 604.0| 1546.0| 464.0|3.79690003395081| 199.3325153559972| |-118.089996337891|33.9000015258789| 37.0| 1147.0| 258.0| 742.0| 242.0|4.04610013961792| 228.622439734017| |-118.089996337891|33.7999992370605| 36.0| 1724.0| 322.0| 838.0| 328.0|4.48309993743896| 253.3557367881317| |-118.089996337891|34.6800003051758| 4.0| 23386.0| 4171.0|10493.0| 3671.0|4.02110004425049|203.56967265075446| |-114.470001220703|34.4000015258789| 19.0| 7650.0| 1901.0| 1129.0| 463.0|1.82000005245209| 67.38934713415028| |-122.540000915527|37.9300003051758| 43.0| 2998.0| 470.0| 970.0| 430.0|5.53849983215332| 323.9588973585096| |-122.540000915527|37.7599983215332| 45.0| 1592.0| 325.0| 920.0| 322.0|3.96000003814697|260.24239345547403| |-117.349998474121|33.2000007629395| 32.0| 1251.0| 220.0| 700.0| 232.0|3.98749995231628|215.95321673868193| |-117.349998474121|33.2299995422363| 4.0| 1837.0| 287.0| 934.0| 277.0|3.89580011367798| 176.2611374003559| |-117.349998474121|33.9500007629395| 28.0| 1650.0| 210.0| 557.0| 211.0|7.66319990158081|326.08117186226536| |-117.349998474121|34.0900001525879| 14.0| 5983.0| 1224.0| 3255.0| 1150.0|2.59019994735718|120.56159630979982| |-117.349998474121|34.2000007629395| 5.0| 9269.0| 1605.0| 4916.0| 1519.0|4.43669986724854| 146.8090945965846| |-117.349998474121|34.0099983215332| 23.0| 3707.0| 769.0| 1938.0| 658.0|2.72499990463257|134.06109946141441| |-117.349998474121| 33.689998626709| 11.0| 1229.0| 236.0| 581.0| 190.0|3.10199999809265|141.39928849102444| |-118.089996337891|33.9500007629395| 32.0| 1083.0| 206.0| 737.0| 218.0|3.55830001831055|195.53150806768326| |-118.089996337891|33.9700012207031| 35.0| 2664.0| 541.0| 2033.0| 491.0|3.73259997367859|188.61314692021506| |-118.089996337891|33.8899993896484| 27.0| 3399.0| 882.0| 2465.0| 811.0|3.09899997711182|189.91245231345965| |-118.089996337891|33.9199981689453| 36.0| 847.0| 185.0| 713.0| 194.0|4.85419988632202| 251.2192014151351| +-----------------+----------------+------+--------+--------+-------+--------+----------------+------------------+ only showing top 20 rows