Analytics at Scale: Full Data Set Analysis - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product
Teradata Vantage
Release Number
8.00
1.0
Published
May 2019
Language
English (United States)
Last Update
2019-11-22
dita:mapPath
blj1506016597986.ditamap
dita:ditavalPath
blj1506016597986.ditaval
dita:id
B700-4003
lifecycle
previous
Product Category
Teradata Vantage™

Teradata Vantage™ lets you efficiently perform analytical tasks on your full data set, in the database, rather than using samples or bulk-exporting data to a dedicated computing cluster.

Advantages of full data set analysis:
  • Accurate, reproducible results
  • Increased iteration speed

    Pushing down analytics into a massively parallel processing (MPP) system decreases the iteration cycle time. Teradata is working with partners, including the SAS Institute, Inc., to make this process straightforward, and is developing functions where appropriate (for example, functionality that takes advantage of the MapReduce paradigm).

  • "Needle in a haystack," "false negative," and "exceptional cases" searches

    Very rare events can only be found (and defined) against the background of the entire data set (consider defining "elite baseball player" by looking at the 2008 SF Giants, as opposed to every player in Major League Baseball history).

  • Statistical significance

    Reliable analytics may require using a large portion of the data, which cannot be fit on a typical, single database machine.

  • Model tuning

    The parameters to predictive models depend on aggregate statistics of the entire data set (for example, residual away from the mean).

  • No meaningful way to sample

    Sampling a graph is not straightforward, especially if you are interested in critical behavior that only appears when a certain threshold of connections is reached.

  • Larger data sets are just different

    The resulting analytics are applied to the entire data set in the cluster. Algorithms developed on smaller data sets may not scale appropriately to the full data set, requiring redevelopment.