Teradata Python Package Function Reference - groupby_time - Teradata Python Package - Look here for syntax, methods and examples for the functions included in the Teradata Python Package.

teradataml.dataframe.dataframe.DataFrame.groupby_time = groupby_time(self, timebucket_duration, value_expression=None, timecode_column=None, sequence_column=None, fill=None): DESCRIPTION: Apply Group By Time to one or more columns of a teradataml DataFrame. The result always behaves like calling group by time. Outcome of this function can be used to run Time Series Aggregate functions. PARAMETERS: timebucket_duration: Required Argument. Specifies the time unit duration of each timebucket for aggregation and is used to assign each potential timebucket a unique number. Permitted Values: =================================================================================================== | Time Units | Formal Form | Shorthand Equivalents for time_units | =================================================================================================== | Calendar Years | CAL_YEARS(N) | Ncy OR Ncyear OR Ncyears | --------------------------------------------------------------------------------------------------- | Calendar Months | CAL_MONTHS(N) | Ncm OR Ncmonth OR Ncmonths | --------------------------------------------------------------------------------------------------- | Calendar Days | CAL_DAYS(N) | Ncd OR Ncday OR Ncdays | --------------------------------------------------------------------------------------------------- | Weeks | WEEKS(N) | Nw OR Nweek OR Nweeks | --------------------------------------------------------------------------------------------------- | Days | DAYS(N) | Nd OR Nday OR Ndays | --------------------------------------------------------------------------------------------------- | Hours | HOURS(N) | Nh OR Nhr OR Nhrs OR Nhour OR Nhours | --------------------------------------------------------------------------------------------------- | Minutes | MINUTES(N) | Nm OR Nmins OR Nminute OR Nminutes | --------------------------------------------------------------------------------------------------- | Seconds | SECONDS(N) | Ns OR Nsec OR Nsecs OR Nsecond OR Nseconds | --------------------------------------------------------------------------------------------------- | Milliseconds | MILLISECONDS(N) | Nms OR Nmsec OR Nmsecs OR Nmillisecond OR Nmilliseconds | --------------------------------------------------------------------------------------------------- | Microseconds | MICROSECONDS(N) | Nus OR Nusec OR Nusecs OR Nmicrosecond OR Nmicroseconds | =================================================================================================== Where, N is a 16-bit positive integer with a maximum value of 32767. Notes: 1. When timebucket_duration is Calendar Days, it will group the columns in 24 hour periods starting at 00:00:00.000000 and ending at 23:59:59.999999 on the day identified by time zero. 2. A DAYS time unit is a 24 hour span relative to any moment in time. For example, If time zero (in teradataml DataFraame created on PTI tables) equal to 2016-10-01 12:00:00, the day buckets are: 2016-10-01 12:00:00.000000 - 2016-10-02 11:59:59.999999. This spans multiple calendar days, but encompasses one 24 hour period representative of a day. 3. The time units do not store values such as the year or the month. For example, CAL_YEARS(2017) does not set the year to 2017. It sets the timebucket_duration to intervals of 2017 years. Similarly, CAL_MONTHS(7) does not set the month to July. It sets the timebucket_duration to intervals of 7 months. Types: str Example: MINUTES(23) which is equal to 23 Minutes CAL_MONTHS(5) which is equal to 5 calendar months value_expression: Optional Argument. Specifies a column used for grouping purposes not related to time. Types: str or List of Strings Example: col1 or ["col1", "col2"] timecode_column: Optional Argument. Specifies a column that serves as the timecode for a non-PTI table. This is the column used for resampling time series data. For teradataml DataFrame created on PTI table: TD_TIMECODE is used implicitly for PTI tables, but can also be specified explicitly by the user with this parameter. For teradataml DataFrame created on non-PTI table: One must pass column name to this argument for teradataml DataFrame created on non-PTI table, otherwise an exception is raised. sequence_column: Optional Argument. Specifies a column that is the sequence number. For teradataml DataFrame created on PTI table: It can be TD_SEQNO or any other column that acts as a sequence number. For teradataml DataFrame created on non-PTI table: sequence_column is a column that plays the role of TD_SEQNO, because non-PTI tables do not have TD_SEQNO. Types: str fill: Optional Argument. Specifies values for missing timebucket values. Permitted values: NULLS, PREV / PREVIOUS, NEXT, and any numeric_constant NULLS: The missing timebuckets are returned to the user with a null value for all aggregate results. numeric_constant: Any Teradata Database supported Numeric literal. The missing timebuckets are returned to the user with the specified constant value for all aggregate results. If the data type specified in the fill argument is incompatible with the input data type for an aggregate function, an error is reported. PREVIOUS/PREV: The missing timebuckets are returned to the user with the aggregate results populated by the value of the closest previous timebucket with a non-missing value. If the immediate predecessor of a missing timebucket is also missing, both buckets, and any other immediate predecessors with missing values, are loaded with the first preceding non-missing value. If a missing timebucket has no predecessor with a result (for example, if the timebucket is the first in the series or all the preceding timebuckets in the entire series are missing), the missing timebuckets are returned to the user with a null value for all aggregate results. The abbreviation PREV may be used instead of PREVIOUS. NEXT: The missing timebuckets are returned to the user with the aggregate results populated by the value of the closest succeeding timebucket with a non-missing value. If the immediate successor of a missing timebucket is also missing, both buckets, and any other immediate successors with missing values, are loaded with the first succeeding non-missing value. If a missing timebucket has no successor with a result (for example, if the timebucket is the last in the series or all the succeeding timebuckets in the entire series are missing), the missing timebuckets are returned to the user with a null value for all aggregate results. Types: str or int or float NOTES: 1. This API is similar to resample(). 2. Users can still apply teradataml DataFrame methods (filters/sort/etc) on top of the result. 3. Consecutive operations of grouping, i.e., groupby_time(), resample() and groupby() are not permitted. An exception will be raised. Following are some cases where exception will be raised as "Invalid operation applied, check documentation for correct usage." a. df.groupby_time().groupby() b. df.groupby_time().resample() c. df.groupby_time().groupby_time() RETURNS: teradataml DataFrameGroupBy Object RAISES: TypeError, ValueError, TeradataMLException EXAMPLES: >>> # Load the example datasets ... load_example_data("dataframe", ["ocean_buoys", "ocean_buoys_nonpti"]) >>> >>> # Create the required DataFrames. ... # DataFrame on non-sequenced PTI table ... ocean_buoys = DataFrame("ocean_buoys") >>> # Check DataFrame columns and let's peek at the data ... ocean_buoys.columns ['buoyid', 'TD_TIMECODE', 'temperature', 'salinity'] >>> ocean_buoys.head() TD_TIMECODE temperature salinity buoyid 0 2014-01-06 08:10:00.000000 100.0 55 0 2014-01-06 08:08:59.999999 NaN 55 1 2014-01-06 09:01:25.122200 77.0 55 1 2014-01-06 09:03:25.122200 79.0 55 1 2014-01-06 09:01:25.122200 70.0 55 1 2014-01-06 09:02:25.122200 71.0 55 1 2014-01-06 09:03:25.122200 72.0 55 0 2014-01-06 08:09:59.999999 99.0 55 0 2014-01-06 08:00:00.000000 10.0 55 0 2014-01-06 08:10:00.000000 10.0 55 >>> # DataFrame on NON-PTI table ... ocean_buoys_nonpti = DataFrame("ocean_buoys_nonpti") >>> # Check DataFrame columns and let's peek at the data ... ocean_buoys_nonpti.columns ['buoyid', 'timecode', 'temperature', 'salinity'] >>> ocean_buoys_nonpti.head() buoyid temperature salinity timecode 2014-01-06 08:09:59.999999 0 99.0 55 2014-01-06 08:10:00.000000 0 10.0 55 2014-01-06 09:01:25.122200 1 70.0 55 2014-01-06 09:01:25.122200 1 77.0 55 2014-01-06 09:02:25.122200 1 71.0 55 2014-01-06 09:03:25.122200 1 72.0 55 2014-01-06 09:02:25.122200 1 78.0 55 2014-01-06 08:10:00.000000 0 100.0 55 2014-01-06 08:08:59.999999 0 NaN 55 2014-01-06 08:00:00.000000 0 10.0 55 # # Example 1: Group by timebucket of 2 calendar years, using formal notation and buoyid column on # DataFrame created on non-sequenced PTI table. # Fill missing values with Nulls. # >>> ocean_buoys_grpby1 = ocean_buoys.groupby_time(timebucket_duration="CAL_YEARS(2)", ... value_expression="buoyid", fill="NULLS") >>> number_of_values_to_column = {2: "temperature"} >>> ocean_buoys_grpby1.bottom(number_of_values_to_column).sort(["TIMECODE_RANGE", "buoyid"]) TIMECODE_RANGE GROUP BY TIME(CAL_YEARS(2)) buoyid bottom2temperature 0 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 0 10 1 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 0 10 2 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 1 71 3 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 1 70 4 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 2 80 5 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 2 81 6 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 44 43 7 ('2014-01-01 00:00:00.000000-00:00', '2016-01-... 2 44 43 >>> # # Example 2: Group by timebucket of 2 minutes, using shorthand notation to specify timebucket, # on DataFrame created on non-PTI table. Fill missing values with Nulls. # Time column must be specified for non-PTI table. # >>> ocean_buoys_nonpti_grpby2 = ocean_buoys_nonpti.groupby_time(timebucket_duration="2m", ... value_expression="buoyid", ... timecode_column="timecode", fill="NULLS") >>> number_of_values_to_column = {2: "temperature"} >>> ocean_buoys_nonpti_grpby2.bottom(number_of_values_to_column, with_ties=True).sort(["TIMECODE_RANGE", ... "buoyid"]) TIMECODE_RANGE GROUP BY TIME(MINUTES(2)) buoyid bottom_with_ties2temperature 0 ('2014-01-06 08:00:00.000000+00:00', '2014-01-... 11574961 0 10.0 1 ('2014-01-06 08:02:00.000000+00:00', '2014-01-... 11574962 0 NaN 2 ('2014-01-06 08:04:00.000000+00:00', '2014-01-... 11574963 0 NaN 3 ('2014-01-06 08:06:00.000000+00:00', '2014-01-... 11574964 0 NaN 4 ('2014-01-06 08:08:00.000000+00:00', '2014-01-... 11574965 0 99.0 5 ('2014-01-06 08:10:00.000000+00:00', '2014-01-... 11574966 0 100.0 6 ('2014-01-06 08:10:00.000000+00:00', '2014-01-... 11574966 0 10.0 7 ('2014-01-06 09:00:00.000000+00:00', '2014-01-... 11574991 1 70.0 8 ('2014-01-06 09:00:00.000000+00:00', '2014-01-... 11574991 1 77.0 9 ('2014-01-06 09:02:00.000000+00:00', '2014-01-... 11574992 1 71.0 >>> # # Example 3: Group by timebucket of 2 minutes, using shorthand notation to specify timebucket, # on DataFrame created on non-PTI table. Fill missing values with previous values. # Time column must be specified for non-PTI table. # >>> ocean_buoys_nonpti_grpby2 = ocean_buoys_nonpti.groupby_time(timebucket_duration="2mins", ... value_expression="buoyid", ... timecode_column="timecode", fill="prev") >>> number_of_values_to_column = {2: "temperature"} >>> ocean_buoys_nonpti_grpby2.bottom(number_of_values_to_column, with_ties=True).sort(["TIMECODE_RANGE", ... "buoyid"]) TIMECODE_RANGE GROUP BY TIME(MINUTES(2)) buoyid bottom_with_ties2temperature 0 ('2014-01-06 08:00:00.000000+00:00', '2014-01-... 11574961 0 10 1 ('2014-01-06 08:02:00.000000+00:00', '2014-01-... 11574962 0 10 2 ('2014-01-06 08:04:00.000000+00:00', '2014-01-... 11574963 0 10 3 ('2014-01-06 08:06:00.000000+00:00', '2014-01-... 11574964 0 10 4 ('2014-01-06 08:08:00.000000+00:00', '2014-01-... 11574965 0 99 5 ('2014-01-06 08:10:00.000000+00:00', '2014-01-... 11574966 0 10 6 ('2014-01-06 08:10:00.000000+00:00', '2014-01-... 11574966 0 100 7 ('2014-01-06 09:00:00.000000+00:00', '2014-01-... 11574991 1 77 8 ('2014-01-06 09:00:00.000000+00:00', '2014-01-... 11574991 1 70 9 ('2014-01-06 09:02:00.000000+00:00', '2014-01-... 11574992 1 71 # # Example 4: Group by timebucket of 2 minutes, using shorthand notation to specify timebucket, # on DataFrame created on non-PTI table. Fill missing values with numeric constant 12345. # Time column must be specified for non-PTI table. # >>> ocean_buoys_nonpti_grpby2 = ocean_buoys_nonpti.groupby_time(timebucket_duration="2minute", ... value_expression="buoyid", ... timecode_column="timecode", fill=12345) >>> number_of_values_to_column = {2: "temperature"} >>> ocean_buoys_nonpti_grpby2.bottom(number_of_values_to_column, with_ties=True).sort(["TIMECODE_RANGE", ... "buoyid"]) TIMECODE_RANGE GROUP BY TIME(MINUTES(2)) buoyid bottom_with_ties2temperature 0 ('2014-01-06 08:00:00.000000+00:00', '2014-01-... 11574961 0 10 1 ('2014-01-06 08:02:00.000000+00:00', '2014-01-... 11574962 0 12345 2 ('2014-01-06 08:04:00.000000+00:00', '2014-01-... 11574963 0 12345 3 ('2014-01-06 08:06:00.000000+00:00', '2014-01-... 11574964 0 12345 4 ('2014-01-06 08:08:00.000000+00:00', '2014-01-... 11574965 0 99 5 ('2014-01-06 08:10:00.000000+00:00', '2014-01-... 11574966 0 10 6 ('2014-01-06 08:10:00.000000+00:00', '2014-01-... 11574966 0 100 7 ('2014-01-06 09:00:00.000000+00:00', '2014-01-... 11574991 1 77 8 ('2014-01-06 09:00:00.000000+00:00', '2014-01-... 11574991 1 70 9 ('2014-01-06 09:02:00.000000+00:00', '2014-01-... 11574992 1 71 >>>