build_dataset() | DatasetCatalog Method | Teradata Package for Python - build_dataset() - Teradata Package for Python

Teradata® Package for Python User Guide

Deployment
VantageCloud
VantageCore
Edition
VMware
Enterprise
IntelliFlex
Product
Teradata Package for Python
Release Number
20.00
Published
March 2025
ft:locale
en-US
ft:lastEdition
2025-12-05
dita:mapPath
nvi1706202040305.ditamap
dita:ditavalPath
plt1683835213376.ditaval
dita:id
rkb1531260709148
Product Category
Teradata Vantage

Use the build_dataset() method to build a dataset based on the selected features for the specified entity.

Required Parameters

entity
Specifies the name of the entity or object Entity to be included in the dataset.
selected_features
Specifies the names of the features and the corresponding feature version to be included in the dataset.

Key is the name of the feature and value is the version of the feature. Refer to FeatureCatalog.list_feature_versions() to get the list of features and their versions.

view_name
Specifies the name of the view to be created for the dataset.

Optional Parameters

description
Specifies the description for the dataset.
include_historic_records
Specifies whether to include historic data in the dataset.

Default value: False

Example setup

Ingest sales data to the feature catalog configured for repo 'vfs_v1'.

>>> from teradataml import load_example_data, FeatureProcess
>>> load_example_data('dataframe', 'sales')
>>> df = DataFrame("sales")
>>> df
              Feb    Jan    Mar    Apr    datetime
accounts
Red Inc     200.0  150.0  140.0    NaN  04/01/2017
Blue Inc     90.0   50.0   95.0  101.0  04/01/2017
Alpha Co    210.0  200.0  215.0  250.0  04/01/2017
Orange Inc  210.0    NaN    NaN  250.0  04/01/2017
Yellow Inc   90.0    NaN    NaN    NaN  04/01/2017
Jones LLC   200.0  150.0  140.0  180.0  04/01/2017

Create a feature store.

>>> from teradataml import FeatureStore
>>> fs = FeatureStore(repo='vfs_v1', data_domain='sales')
Repo vfs_v1 does not exist. Run FeatureStore.setup() to create the repo and setup FeatureStore.

Set up the feature store for this repository.

>>> fs.setup()
True

Initiate FeatureProcess to ingest features.

>>> fp = FeatureProcess(repo='vfs_v1', data_domain='sales', object=df, entity='accounts', features=['Jan', 'Feb', 'Mar', 'Apr'])

Run the feature process.

>>> fp.run()
Process 'a9f29a4e-3f75-11f0-b43b-f020ff57c62c' started.
Process 'a9f29a4e-3f75-11f0-b43b-f020ff57c62c' completed.

Example 1: Build dataset with features 'Jan', 'Feb' from repo 'vfs_v1' and sales data domain

Name the dataset as 'ds_jan_feb'.

>>> from teradataml import DatasetCatalog
>>> dc = DatasetCatalog(repo='vfs_v1', data_domain='sales')
>>> dataset = dc.build_dataset(entity='accounts',
...                            selected_features = {
...                                 'Jan': 'a9f29a4e-3f75-11f0-b43b-f020ff57c62c',
...                                 'Feb': 'a9f29a4e-3f75-11f0-b43b-f020ff57c62c'},
...                            view_name='ds_jan_feb',
...                            description='Dataset with Jan and Feb features')
>>> dataset
     accounts    Jan    Feb
0    Blue Inc   50.0   90.0
1    Alpha Co  200.0  210.0
2  Yellow Inc    NaN   90.0
3  Orange Inc    NaN  210.0
4   Jones LLC  150.0  200.0
5     Red Inc  150.0  200.0

Example 2: Build dataset with features 'Jan', 'Feb', 'Mar' from repo 'vfs_v1' and sales data domain

Name the dataset as 'ds_jan_feb_mar'.

>>> dataset = dc.build_dataset(entity='accounts',
...                            selected_features = {
...                                 'Jan': 'a9f29a4e-3f75-11f0-b43b-f020ff57c62c',
...                                 'Feb': 'a9f29a4e-3f75-11f0-b43b-f020ff57c62c',
...                                 'Mar': 'a9f29a4e-3f75-11f0-b43b-f020ff57c62c'},
...                            view_name='ds_jan_feb_mar',
...                            description='Dataset with Jan, Feb and Mar features')
>>> dataset
     accounts    Jan    Feb    Mar
0  Yellow Inc    NaN   90.0    NaN
1    Alpha Co  200.0  210.0  215.0
2   Jones LLC  150.0  200.0  140.0
3    Blue Inc   50.0   90.0   95.0
4  Orange Inc    NaN  210.0    NaN
5     Red Inc  150.0  200.0  140.0

Example 3: Build dataset with features 'Feb', 'Jan' from repo 'vfs_v1' and 'sales' data domain

This example includes creating a new table to avoid modifying existing table data.

Show the latest data.

>>> import time
>>> from datetime import datetime as dt, date as d

Retrieve the record where accounts == 'Blue Inc'.

>>> df_test = df[df['accounts'] == 'Blue Inc']
>>> df_test
              Feb    Jan    Mar    Apr    datetime
accounts
Blue Inc     90.0   50.0   95.0  101.0  04/01/2017

Create a new table.

>>> df_test.to_sql('sales_test', if_exists='replace')
>>> test_df = DataFrame('sales_test')
>>> test_df
   accounts   Feb  Jan  Mar  Apr  datetime
0  Blue Inc  90.0   50   95  101  17/01/04

Create a feature process.

>>> fp = FeatureProcess(repo='vfs_v1',
...                     data_domain='sales',
...                     object=test_df,
...                     entity='accounts',
...                     features=['Jan', 'Feb'])

Run the feature process.

>>> fp.run()
Process '6cb49b4b-79d4-11f0-8c5e-b0dcef8381ea' started.
Process '6cb49b4b-79d4-11f0-8c5e-b0dcef8381ea' completed.
True
This example steps through the same sequence several times to demonstrate how you can retrieve specific feature versions using as_of.
  • Wait 20 seconds.
  • Update the data.
  • Run the feature process.
>>> time.sleep(20)
>>> execute_sql("update sales_test set Jan = Jan * 10, Feb = Feb * 10")
TeradataCursor uRowsHandle=269 bClosed=False
>>> fp.run()
Process '6cb49b4b-79d4-11f0-8c5e-b0dcef8381ea' started.
Process '6cb49b4b-79d4-11f0-8c5e-b0dcef8381ea' completed.
True
>>> time.sleep(20)
>>> execute_sql("update sales_test set Jan = Jan * 10, Feb = Feb * 10")
TeradataCursor uRowsHandle=397 bClosed=False
>>> fp.run()
Process '6cb49b4b-79d4-11f0-8c5e-b0dcef8381ea' started.
Process '6cb49b4b-79d4-11f0-8c5e-b0dcef8381ea' completed.
True

Build the dataset with features 'Feb', 'Jan' by excluding the historic records from repo 'vfs_v1' and 'sales' data domain.

>>> dc = DatasetCatalog(repo='vfs_v1', data_domain='sales')
>>> exclude_history = dc.build_dataset(entity='accounts',
...                                    selected_features={'Feb': fp.process_id,
...                                                       'Jan': fp.process_id},
...                                    view_name='exclude_history',
...                                    include_historic_records=False)
>>> exclude_history
   accounts     Feb   Jan
0  Blue Inc  9000.0  5000

Example 4: Build dataset with features 'Feb', 'Jan' from repo 'vfs_v1' and 'sales' data domain

Show the historic data.

>>> dc = DatasetCatalog(repo='vfs_v1', data_domain='sales')
>>> include_history = dc.build_dataset(entity='accounts',
...                                    selected_features={'Feb': fp.process_id,
...                                                       'Jan': fp.process_id},
...                                    view_name='include_history',
...                                    include_historic_records=True)
>>> include_history
   accounts     Feb   Jan
0  Blue Inc  9000.0  5000
1  Blue Inc    90.0    50
2  Blue Inc    90.0  5000
3  Blue Inc   900.0   500
4  Blue Inc   900.0  5000
5  Blue Inc   900.0    50
6  Blue Inc    90.0   500
7  Blue Inc  9000.0    50
8  Blue Inc  9000.0   500