TD_TextParser Function | TextParser | Teradata Vantage - TD_TextParser - Analytics Database

Database Analytic Functions

Deployment
VantageCloud
VantageCore
Edition
Enterprise
IntelliFlex
VMware
Product
Analytics Database
Release Number
17.20
Published
June 2022
ft:locale
en-US
ft:lastEdition
2025-04-01
dita:mapPath
gjn1627595495337.ditamap
dita:ditavalPath
qkf1628213546010.ditaval
dita:id
jmh1512506877710
Product Category
Teradata Vantageā„¢

A text parser, also known as a text tokenizer, is a software component that breaks a text into its constituent parts, such as words, phrases, sentences, or other meaningful units. Text parsing is an important technique in natural language processing (NLP) and is used in a wide range of applications, from search engines and chatbots to email filters and data analysis tools.

In text analytics, a text parser is often used as the first step in processing text data to extract useful insights. By breaking the text into smaller units, a parser makes it easier to analyze the text and identify patterns, trends, and relationships among the data.

Text parsers can be simple or complex, depending on the type of text data being processed and the level of detail required for analysis. For example, a basic text parser might split a sentence into individual words, while a more advanced parser might recognize parts of speech, identify named entities, or recognize patterns in the text that suggest a particular sentiment or tone.

By breaking text into its constituent parts and analyzing its structure, text parsers enable a variety of tasks, from information extraction and sentiment analysis to machine translation and chatbot dialog generation. Overall, text parser function is a powerful tool for extracting structured information from unstructured or semi-structured text data, making it easier for analysts and data scientists to work with large amounts of text data and gain insights from it.

The TD_TextParser performs the following operations:
  • This function tokenizes a text with single-character delimiter values, or through using a PCRE regular expression as the token delimiter to parse it.
  • Removes the punctuations from the text and converts the text to lowercase
  • Removes stop words from the text and converts the text to their root forms
  • Creates a row for each word in the output table
  • Performs stemming; that is, the function identifies the common root form of a word by removing or replacing word suffixes
  • Counts the occurrences of each token or stem
  • Obtains a comma separated list of positions for each token occurrence
  • Outputs all parsed tokens in a single row
The stems resulting from stemming may not be actual words. For example, the stem for 'communicate' is 'commun' and the stem for 'early' is 'earli' (trailing 'y' is replaced by 'i').