A time series is a collection of data observations made sequentially over time. Time series occur in medical, scientific, entertainment, and business domains.
Symbolic Aggregate Approximation (SAX) uses an algorithm with low computational complexity to convert time series data to create symbolic strings. Symbolic strings are easily manipulated by functions like Teradata Aster nPath and hashing or regular-expression pattern matching algorithms.
In data-mining tasks such as classification, clustering, and indexing, SAX is as good as storage-intensive methods like Discrete Wavelet Transform (DWT) and Discrete Fourier Transform (DFT).
SAX transforms a time series X of length n into the string of arbitrary length w, where w < n, using an alphabet A of size a > 2.
The SAX algorithm has two steps:
- Transform the original time series data into a piecewise aggregate approximation
This transformation splits the time series data into intervals and assigns each interval to one of a limited set of alphabetical symbols (letters) based on the data being examined. The symbol set is based on dividing all observed data into chunks (or thresholds), using the normal distribution curve. Each chunk is represented by a symbol (a letter). This technique reduces the dimensionality of the data.
- Convert the PAA into a string of letters that represents the patterns occurring in the data over time.
The symbols that SAX creates correspond to the time series features with equal probability, allowing them to be compared and used for further manipulation with reliable accuracy. The time series that are normalized using the zero mean and unit of energy follow the normal distribution law. By using Gaussian distribution properties, SAX can easily select equal-sized areas under the normal curve using lookup tables for the cut lines coordinates, slicing the under-the-Gaussian-curve area. In the SAX algorithm context, the x coordinates of these lines are called breakpoints.