- Regular Aggregate Mode
It computes the count, mean, std, min, percentiles, and max for numeric columns.
Default statistics include: "count", "mean", "std", "min", "percentile", "max".
If describe() is used on the output of any DataFrame API or groupby(), then it is used in regular aggregate mode. - Time Series Aggregate Mode
It computes the max, mean, min, std, median, mode, and percentiles for numeric columns.
Default statistics include: 'max', 'mean', 'min', 'std'.
If describe() is used on the output of groupby_time(), then it is used in time series aggregate mode, where time series aggregates are used to calculate the statistics.
- percentiles: A list of values between 0 and 1 used for computing percentiles.
The default value is [.25, .5, .75], which generates the 25th, 50th, and 75th percentiles.
- include: The values for this argument can be either 'None' or 'all', used to specify if non-numeric columns are included in the computation.
- If the value is 'all': Both numeric and non-numeric columns are included. The function computes count, mean, std, min, percentiles, and max for numeric columns, and computes count and unique for non-numeric columns.
- If the value is 'None': Only numeric columns are used for collecting statics.
Value 'all' is not applicable for Time Series Aggregate Mode. - verbose: Specifies a boolean value to be used for time series aggregation, stating whether to get verbose output or not. When this argument is set to 'True', function calculates median, mode, and percentile values on top of its default statistics.
Default and the only acceptable value for this argument when used in Regular Aggregate Mode is 'False'.
verbose as 'True' is not applicable for Regular Aggregate Mode.
- distinct: Specifies a boolean value to decide whether to consider duplicate rows in statistic calculation or not.
By default, this argument is set to 'False', which means that duplicate values are considered for statistic calculation.
When this is set to 'True', only distinct rows are considered for statistic calculation.
Examples: describe() as Regular Aggregate Function
Prerequisite:
>>> df = DataFrame('sales')
>>> df Feb Jan Mar Apr datetime accounts Alpha Co 210.0 200 215 250 04/01/2017 Red Inc 200.0 150 140 None 04/01/2017 Orange Inc 210.0 None None 250 04/01/2017 Jones LLC 200.0 150 140 180 04/01/2017 Yellow Inc 90.0 None None None 04/01/2017 Blue Inc 90.0 50 95 101 04/01/2017
Example: Generate Statistics for DataFrame "sales"
Use default values to computes count, mean, std, min, percentiles, and max for numeric columns.
>>> df.describe() Apr Feb Mar Jan func count 4 6 4 4 mean 195.25 166.667 147.5 137.5 std 70.971 59.554 49.749 62.915 min 101 90 95 50 25% 160.25 117.5 128.75 125 50% 215 200 140 150 75% 250 207.5 158.75 162.5 max 250 210 215 200
Example: Use Argument Percentiles to Compute 30th and 60th Percentiles
>>> df.describe(percentiles=[.3, .6]) Apr Feb Mar Jan func count 4 6 4 4 mean 195.25 166.667 147.5 137.5 std 70.971 59.554 49.749 62.915 min 101 90 95 50 30% 172.1 145 135.5 140 60% 236 200 140 150 max 250 210 215 200
Example: Use groupby to Compute Statistics for Specific Groups
>>> df1 = df.groupby(["datetime", "Feb"])
>>> df1.describe() Jan Mar Apr datetime Feb func 04/01/2017 90.0 25% 50 95 101 50% 50 95 101 75% 50 95 101 count 1 1 1 max 50 95 101 mean 50 95 101 min 50 95 101 std None None None 200.0 25% 150 140 180 50% 150 140 180 75% 150 140 180 count 2 2 1 max 150 140 180 mean 150 140 180 min 150 140 180 std 0 0 None 210.0 25% 200 215 250 50% 200 215 250 75% 200 215 250 count 1 1 2 max 200 215 250 mean 200 215 250 min 200 215 250 std None None 0
Example: Use Argument include Value 'all' to Compute Statistics for All Columns
Computes count, mean, std, min, percentiles, and max for numeric columns and computes count and unique for non-numeric columns.
>>> df.describe(include="all") accounts Feb Jan Mar Apr datetime func 25% None 117.5 125 128.75 160.25 None 75% None 207.5 162.5 158.75 250 None count 6 6 4 4 4 6 mean None 166.667 137.5 147.5 195.25 None max None 210 200 215 250 None min None 90 50 95 101 None 50% None 200 150 140 215 None std None 59.554 62.915 49.749 70.971 None unique 6 None None None None 1
Examples: describe() as Time Series Aggregate Function
See Example Setup to set up the environment for the following examples.
Example: Get Basic Statistics
Get the basic statistics for time series aggregation for all the numeric columns, use default settings. This example returns max, mean, min and std values.
>>> ocean_buoys_grpby.describe()
temperature salinity TIMECODE_RANGE GROUP BY TIME(CAL_YEARS(2)) buoyid func ('2014-01-01 00:00:00.000000-00:00', '2016-01-0... 2 0 max 100 55 mean 54.75 55 min 10 55 std 51.674 0 1 max 79 55 mean 74.5 55 min 70 55 std 3.937 0 2 max 82 55 mean 81 55 min 80 55 std 1 0 44 max 56 55 mean 48.077 55 min 43 55 std 5.766 0
Example: Get Verbose Statistics
Get the verbose statistics for time series aggregation for all the numeric columns, use default settings. This example returns max, mean, min, std, median, mode, 25th, 50th and 75th percentile.
>>> ocean_buoys_grpby.describe(verbose=True)
temperature salinity TIMECODE_RANGE GROUP BY TIME(CAL_YEARS(2)) buoyid func ('2014-01-01 00:00:00.000000-00:00', '2016-01-0... 2 0 25% 10 55 50% 54.5 55 75% 99.25 55 max 100 55 mean 54.75 55 median 54.5 55 min 10 55 mode 10 55 std 51.674 0 1 25% 71.25 55 50% 74.5 55 75% 77.75 55 max 79 55 mean 74.5 55 median 74.5 55 min 70 55 mode 71 55 mode 72 55 mode 77 55 mode 78 55 mode 79 55 mode 70 55 std 3.937 0 2 25% 80.5 55 50% 81 55 75% 81.5 55 max 82 55 mean 81 55 median 81 55 min 80 55 mode 80 55 mode 81 55 mode 82 55 std 1 0 44 25% 43 55 50% 43 55 75% 53 55 max 56 55 mean 48.077 55 median 43 55 min 43 55 mode 43 55 std 5.766 0
Example: Get Basic Statistics, Consider Only Unique Values
Get the basic statistics for time series aggregation for all the numeric columns, consider only unique values. This example returns max, mean, min and std values.
>>> ocean_buoys_grpby.describe(distinct=True)
temperature salinity TIMECODE_RANGE GROUP BY TIME(CAL_YEARS(2)) buoyid func ('2014-01-01 00:00:00.000000-00:00', '2016-01-0... 2 0 max 100 55 mean 69.667 55 min 10 55 std 51.675 None 1 max 79 55 mean 74.5 55 min 70 55 std 3.937 None 2 max 82 55 mean 81 55 min 80 55 std 1 None 44 max 56 55 mean 52.2 55 min 43 55 std 5.263 None
Example: Get Verbose Statistics, Select Nondefault Percentiles
Get the verbose statistics for time series aggregation for all the numeric columns. In this example, you select non-default percentiles 33rd and 66th. This example returns max, mean, min, std, median, mode, 33rd, and 66th percentile.
>>> ocean_buoys_grpby.describe(verbose=True, percentiles=[0.33, 0.66])
temperature salinity TIMECODE_RANGE GROUP BY TIME(CAL_YEARS(2)) buoyid func ('2014-01-01 00:00:00.000000-00:00', '2016-01-0... 2 0 33% 10 55 66% 97.22 55 max 100 55 mean 54.75 55 median 54.5 55 min 10 55 mode 10 55 std 51.674 0 1 33% 71.65 55 66% 77.3 55 max 79 55 mean 74.5 55 median 74.5 55 min 70 55 mode 70 55 mode 71 55 mode 77 55 mode 78 55 mode 79 55 mode 72 55 std 3.937 0 2 33% 80.66 55 66% 81.32 55 max 82 55 mean 81 55 median 81 55 min 80 55 mode 80 55 mode 81 55 mode 82 55 std 1 0 44 33% 43 55 66% 53 55 max 56 55 mean 48.077 55 median 43 55 min 43 55 mode 43 55 std 5.766 0