Use the build_dataset() method to build a dataset based on the selected features for the specified entity.
Required Parameters
- entity
- Specifies the name of the entity or object Entity to be included in the dataset.
- selected_features
- Specifies the names of the features and the corresponding feature version to be included in the dataset.
Key is the name of the feature and value is the version of the feature. Refer to FeatureCatalog.list_feature_versions() to get the list of features and their versions.
- view_name
- Specifies the name of the view to be created for the dataset.
Optional Parameters
- description
- Specifies the description for the dataset.
- include_historic_records
- Specifies whether to include historic data in the dataset.
Default value: False
Example setup
Ingest sales data to the feature catalog configured for repo 'vfs_v1'.
>>> from teradataml import load_example_data, FeatureProcess
>>> load_example_data('dataframe', 'sales')
>>> df = DataFrame("sales")
>>> df
Feb Jan Mar Apr datetime accounts Red Inc 200.0 150.0 140.0 NaN 04/01/2017 Blue Inc 90.0 50.0 95.0 101.0 04/01/2017 Alpha Co 210.0 200.0 215.0 250.0 04/01/2017 Orange Inc 210.0 NaN NaN 250.0 04/01/2017 Yellow Inc 90.0 NaN NaN NaN 04/01/2017 Jones LLC 200.0 150.0 140.0 180.0 04/01/2017
Create a feature store.
>>> from teradataml import FeatureStore >>> fs = FeatureStore(repo='vfs_v1', data_domain='sales')
Repo vfs_v1 does not exist. Run FeatureStore.setup() to create the repo and setup FeatureStore.
Set up the feature store for this repository.
>>> fs.setup()
True
Initiate FeatureProcess to ingest features.
>>> fp = FeatureProcess(repo='vfs_v1', data_domain='sales', object=df, entity='accounts', features=['Jan', 'Feb', 'Mar', 'Apr'])
Run the feature process.
>>> fp.run()
Process 'a9f29a4e-3f75-11f0-b43b-f020ff57c62c' started. Process 'a9f29a4e-3f75-11f0-b43b-f020ff57c62c' completed.
Example 1: Build dataset with features 'Jan', 'Feb' from repo 'vfs_v1' and sales data domain
Name the dataset as 'ds_jan_feb'.
>>> from teradataml import DatasetCatalog
>>> dc = DatasetCatalog(repo='vfs_v1', data_domain='sales')
>>> dataset = dc.build_dataset(entity='accounts',
... selected_features = {
... 'Jan': 'a9f29a4e-3f75-11f0-b43b-f020ff57c62c',
... 'Feb': 'a9f29a4e-3f75-11f0-b43b-f020ff57c62c'},
... view_name='ds_jan_feb',
... description='Dataset with Jan and Feb features')
>>> dataset
accounts Jan Feb 0 Blue Inc 50.0 90.0 1 Alpha Co 200.0 210.0 2 Yellow Inc NaN 90.0 3 Orange Inc NaN 210.0 4 Jones LLC 150.0 200.0 5 Red Inc 150.0 200.0
Example 2: Build dataset with features 'Jan', 'Feb', 'Mar' from repo 'vfs_v1' and sales data domain
Name the dataset as 'ds_jan_feb_mar'.
>>> dataset = dc.build_dataset(entity='accounts',
... selected_features = {
... 'Jan': 'a9f29a4e-3f75-11f0-b43b-f020ff57c62c',
... 'Feb': 'a9f29a4e-3f75-11f0-b43b-f020ff57c62c',
... 'Mar': 'a9f29a4e-3f75-11f0-b43b-f020ff57c62c'},
... view_name='ds_jan_feb_mar',
... description='Dataset with Jan, Feb and Mar features')
>>> dataset
accounts Jan Feb Mar 0 Yellow Inc NaN 90.0 NaN 1 Alpha Co 200.0 210.0 215.0 2 Jones LLC 150.0 200.0 140.0 3 Blue Inc 50.0 90.0 95.0 4 Orange Inc NaN 210.0 NaN 5 Red Inc 150.0 200.0 140.0
Example 3: Build dataset with features 'Feb', 'Jan' from repo 'vfs_v1' and 'sales' data domain
This example includes creating a new table to avoid modifying existing table data.
Show the latest data.
>>> import time >>> from datetime import datetime as dt, date as d
Retrieve the record where accounts == 'Blue Inc'.
>>> df_test = df[df['accounts'] == 'Blue Inc'] >>> df_test
Feb Jan Mar Apr datetime accounts Blue Inc 90.0 50.0 95.0 101.0 04/01/2017
Create a new table.
>>> df_test.to_sql('sales_test', if_exists='replace')
>>> test_df = DataFrame('sales_test')
>>> test_df
accounts Feb Jan Mar Apr datetime 0 Blue Inc 90.0 50 95 101 17/01/04
Create a feature process.
>>> fp = FeatureProcess(repo='vfs_v1', ... data_domain='sales', ... object=test_df, ... entity='accounts', ... features=['Jan', 'Feb'])
Run the feature process.
>>> fp.run()
Process '6cb49b4b-79d4-11f0-8c5e-b0dcef8381ea' started. Process '6cb49b4b-79d4-11f0-8c5e-b0dcef8381ea' completed. True
- Wait 20 seconds.
- Update the data.
- Run the feature process.
>>> time.sleep(20)
>>> execute_sql("update sales_test set Jan = Jan * 10, Feb = Feb * 10")
TeradataCursor uRowsHandle=269 bClosed=False
>>> fp.run()
Process '6cb49b4b-79d4-11f0-8c5e-b0dcef8381ea' started. Process '6cb49b4b-79d4-11f0-8c5e-b0dcef8381ea' completed. True
>>> time.sleep(20)
>>> execute_sql("update sales_test set Jan = Jan * 10, Feb = Feb * 10")
TeradataCursor uRowsHandle=397 bClosed=False
>>> fp.run()
Process '6cb49b4b-79d4-11f0-8c5e-b0dcef8381ea' started. Process '6cb49b4b-79d4-11f0-8c5e-b0dcef8381ea' completed. True
Build the dataset with features 'Feb', 'Jan' by excluding the historic records from repo 'vfs_v1' and 'sales' data domain.
>>> dc = DatasetCatalog(repo='vfs_v1', data_domain='sales')
>>> exclude_history = dc.build_dataset(entity='accounts',
... selected_features={'Feb': fp.process_id,
... 'Jan': fp.process_id},
... view_name='exclude_history',
... include_historic_records=False)
>>> exclude_history
accounts Feb Jan 0 Blue Inc 9000.0 5000
Example 4: Build dataset with features 'Feb', 'Jan' from repo 'vfs_v1' and 'sales' data domain
Show the historic data.
>>> dc = DatasetCatalog(repo='vfs_v1', data_domain='sales')
>>> include_history = dc.build_dataset(entity='accounts',
... selected_features={'Feb': fp.process_id,
... 'Jan': fp.process_id},
... view_name='include_history',
... include_historic_records=True)
>>> include_history
accounts Feb Jan 0 Blue Inc 9000.0 5000 1 Blue Inc 90.0 50 2 Blue Inc 90.0 5000 3 Blue Inc 900.0 500 4 Blue Inc 900.0 5000 5 Blue Inc 900.0 50 6 Blue Inc 90.0 500 7 Blue Inc 9000.0 50 8 Blue Inc 9000.0 500