TextChunker Example: SentenceExtractor and POSTagger Output as Input

TextChunker Example: SentenceExtractor and POSTagger Output as Input - Teradata Vantage

Machine Learning Engine Analytic Function Reference

Product

Teradata Vantage

Release Number

8.10

1.1

Published

October 2019

Language

English (United States)

Last Update

2019-12-31

dita:mapPath

ima1540829771750.ditamap

dita:ditavalPath

jsj1481748799576.ditaval

dita:id

B700-4003

lifecycle

Product Category

Teradata Vantage™

Input

paragraphs_input
paraid	paratopic	paratext
1	Decision Trees	Decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the items target value. It is one of the predictive modeling approaches used in statistics, data mining and machine learning. Tree models where the target variable can take a finite set of values are called classification trees. In these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees.
2	Simple Regression	In statistics, simple linear regression is the least squares estimator of a linear regression model with a single explanatory variable. In other words, simple linear regression fits a straight line through the set of n points in such a way that makes the sum of squared residuals of the model (that is, vertical distances between the points of the data set and the fitted line) as small as possible.
3	Logistic Regression	Logistic regression was developed by statistician David Cox in 1958[2][3] (although much work was done in the single independent variable case almost two decades earlier). The binary logistic model is used to estimate the probability of a binary response based on one or more predictor (or independent) variables (features). As such it is not a classification method. It could be called a qualitative response/discrete choice model in the terminology of economics.
4	Cluster analysis	Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. Cluster analysis itself is not one specific algorithm, but the general task to solve. It can be achieved by various algorithms that differ significantly in their notion of what constitutes a cluster and how to efficiently find them.
5	Association rule learning	Association rule learning is a method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using different measures of interestingness. Based on the concept of strong rules, Rakesh Agrawal et al.[2] introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule {onions, potatoes} => {burger} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat.

SQL Call

TextChunker requires each sentence to have a unique identifier, and the input to TextChunker must be partitioned by that identifier.

SELECT * FROM TextChunker (
  ON (
    SELECT * FROM POSTagger (
      ON (
        SELECT paraid*1000+sentence_sn AS sentence_id, sentence FROM SentenceExtractor (
          ON paragraphs_input
          USING
          TextColumn ('paratext')
          Accumulate ('paraid')
        ) AS dt1
      )
      USING
      TextColumn ('sentence')
      Accumulate ('sentence_id')
    ) AS dt2 
  ) PARTITION BY sentence_id ORDER BY word_sn
  USING
  WordColumn('word')
  POSColumn('pos_tag')
) AS dt;

Output

 partition_key chunk_sn chunk                                                                                                chunk_tag 
 ------------- -------- ---------------------------------------------------------------------------------------------------- --------- 
          1001        1 decision tree learning                                                                               NP       
          1001        2 uses                                                                                                 VP       
          1001        3 a decision tree                                                                                      NP       
          1001        4 as                                                                                                   PP       
          1001        5 a predictive model                                                                                   NP       
          1001        6 which                                                                                                NP       
          1001        7 maps                                                                                                 VP       
          1001        8 observations                                                                                         NP       
          1001        9 about                                                                                                PP       
          1001       10 an item                                                                                              NP       
          1001       11 to                                                                                                   PP       
          1001       12 conclusions                                                                                          NP       
          1001       13 about                                                                                                PP       
          1001       14 the items target value                                                                               NP       
          1001       15 .                                                                                                    O        
          1001       16 it                                                                                                   NP       
          1001       17 is                                                                                                   VP       
          1001       18 one                                                                                                  NP       
          1001       19 of                                                                                                   PP       
          1001       20 the predictive modelling approaches                                                                  NP       
          1001       21 used                                                                                                 VP       
          1001       22 in                                                                                                   PP       
          1001       23 statistics , data mining and machine learning . tree models                                          NP       
          1001       24 where                                                                                                ADVP     
          1001       25 the target variable                                                                                  NP       
          1001       26 can take                                                                                             VP       
          1001       27 a finite set                                                                                         NP       
          1001       28 of                                                                                                   PP       
          1001       29 values                                                                                               NP       
          1001       30 are called                                                                                           VP       
          1001       31 classification trees                                                                                 NP       
          1001       32 .                                                                                                    O        
          1001       33 in                                                                                                   PP       
          1001       34 these tree structures                                                                                NP       
          1001       35 ,                                                                                                    O        
          1001       36 leaves                                                                                               VP       
          1001       37 represent class labels and branches                                                                  NP       
          1001       38 represent                                                                                            VP       
          1001       39 conjunctions                                                                                         NP       
          1001       40 of                                                                                                   PP       
          1001       41 features                                                                                             NP       
          1001       42 that                                                                                                 NP       
          1001       43 lead                                                                                                 VP       
          1001       44 to                                                                                                   PP       
          1001       45 those class labels . decision trees                                                                  NP       
          1001       46 where                                                                                                ADVP     
          1001       47 the target variable                                                                                  NP       
          1001       48 can take                                                                                             VP       
          1001       49 continuous values                                                                                    NP       
          1001       50 ( typically real numbers                                                                             NP       
          1001       51 )                                                                                                    NP       
          1001       52 are called                                                                                           VP       
          1001       53 regression trees                                                                                     NP       
          1001       54 .                                                                                                    O        
          2001        1 in                                                                                                   PP       
          2001        2 statistics                                                                                           NP       
          2001        3 ,                                                                                                    O        
          2001        4 simple linear regression                                                                             NP       
          2001        5 is                                                                                                   VP       
          2001        6 the least squares estimator                                                                          NP       
          2001        7 of                                                                                                   PP       
          2001        8 a linear regression model                                                                            NP       
          2001        9 with                                                                                                 PP       
          2001       10 a single explanatory variable .                                                                      NP       
          2001       11 in                                                                                                   PP       
          2001       12 other words                                                                                          NP       
          2001       13 ,                                                                                                    O        
          2001       14 simple linear regression                                                                             NP       
          2001       15 fits                                                                                                 VP       
          2001       16 a straight line                                                                                      NP       
          2001       17 through                                                                                              PP       
          2001       18 the set                                                                                              NP       
          2001       19 of                                                                                                   PP       
          2001       20 n points                                                                                             NP       
          2001       21 in                                                                                                   PP       
          2001       22 such a way                                                                                           NP       
          2001       23 that                                                                                                 NP       
          2001       24 makes                                                                                                VP       
          2001       25 the sum                                                                                              NP       
          2001       26 of                                                                                                   PP       
          2001       27 squared residuals                                                                                    NP       
          2001       28 of                                                                                                   PP       
          2001       29 the model (                                                                                          NP       
          2001       30 that                                                                                                 NP       
          2001       31 is                                                                                                   VP       
          2001       32 , vertical distances                                                                                 NP       
          2001       33 between                                                                                              PP       
          2001       34 the points                                                                                           NP       
          2001       35 of                                                                                                   PP       
          2001       36 the data                                                                                             NP       
          2001       37 set                                                                                                  VP       
          2001       38 and                                                                                                  O        
          2001       39 the fitted line                                                                                      NP       
          2001       40 )                                                                                                    VP       
          2001       41 as small                                                                                             ADJP     
          2001       42 as                                                                                                   PP       
          2001       43 possible                                                                                             ADJP     
          2001       44 .                                                                                                    O        
          3001        1 logistic regression                                                                                  NP       
          3001        2 was developed                                                                                        VP       
          3001        3 by                                                                                                   PP       
          3001        4 statistician david cox                                                                               NP       
          3001        5 in                                                                                                   PP       
          3001        6 1958[2][3](although much work                                                                        NP       
          3001        7 was done                                                                                             VP       
          3001        8 in                                                                                                   PP       
          3001        9 the single independent variable case                                                                 NP       
          3001       10 almost                                                                                               ADVP     
          3001       11 two decades                                                                                          NP       
          3001       12 earlier)                                                                                             VP       
          3001       13 .                                                                                                    O        
          3001       14 the binary logistic model                                                                            NP       
          3001       15 is used to estimate                                                                                  VP       
          3001       16 the probability                                                                                      NP       
          3001       17 of                                                                                                   PP       
          3001       18 a binary response                                                                                    NP       
          3001       19 based                                                                                                VP       
          3001       20 on                                                                                                   PP       
          3001       21 one or more predictor ( or independent ) variables ( features) .                                     NP       
          3001       22 as                                                                                                   PP       
          3001       23 such                                                                                                 ADJP     
          3001       24 it                                                                                                   NP       
          3001       25 is                                                                                                   VP       
          3001       26 not                                                                                                  O        
          3001       27 a classification method                                                                              NP       
          3001       28 .                                                                                                    VP       
          3001       29 it                                                                                                   NP       
          3001       30 could be called                                                                                      VP       
          3001       31 a qualitative response/discrete choice model                                                         NP       
          3001       32 in                                                                                                   PP       
          3001       33 the terminology                                                                                      NP       
          3001       34 of                                                                                                   PP       
          3001       35 economics                                                                                            NP       
          3001       36 .                                                                                                    O        
          4001        1 cluster analysis or clustering                                                                       NP       
          4001        2 is                                                                                                   VP       
          4001        3 the task                                                                                             NP       
          4001        4 of                                                                                                   PP       
          4001        5 grouping                                                                                             VP       
          4001        6 a set                                                                                                NP       
          4001        7 of                                                                                                   PP       
          4001        8 objects                                                                                              NP       
          4001        9 in                                                                                                   PP       
          4001       10 such a way                                                                                           NP       
          4001       11 that                                                                                                 NP       
          4001       12 objects                                                                                              VP       
          4001       13 in                                                                                                   PP       
          4001       14 the same group                                                                                       NP       
          4001       15 ( called                                                                                             VP       
          4001       16 a cluster )                                                                                          NP       
          4001       17 are                                                                                                  VP       
          4001       18 more similar                                                                                         ADJP     
          4001       19 (                                                                                                    O        
          4001       20 in                                                                                                   PP       
          4001       21 some sense                                                                                           NP       
          4001       22 or                                                                                                   O        
          4001       23 another )                                                                                            NP       
          4001       24 to                                                                                                   PP       
          4001       25 each other                                                                                           NP       
          4001       26 than                                                                                                 PP       
          4001       27 to                                                                                                   PP       
          4001       28 those                                                                                                NP       
          4001       29 in                                                                                                   PP       
          4001       30 other groups                                                                                         NP       
          4001       31 ( clusters)                                                                                          NP       
          4001       32 .                                                                                                    O        
          4001       33 it                                                                                                   NP       
          4001       34 is                                                                                                   VP       
          4001       35 a main task                                                                                          NP       
          4001       36 of                                                                                                   PP       
          4001       37 exploratory data mining                                                                              NP       
          4001       38 ,                                                                                                    O        
          4001       39 and                                                                                                  O        
          4001       40 a common technique                                                                                   NP       
          4001       41 for                                                                                                  PP       
          4001       42 statistical data analysis                                                                            NP       
          4001       43 , used                                                                                               VP       
          4001       44 in                                                                                                   PP       
          4001       45 many fields                                                                                          NP       
          4001       46 ,                                                                                                    O        
          4001       47 including                                                                                            PP       
          4001       48 machine learning                                                                                     NP       
          4001       49 ,                                                                                                    O        
          4001       50 pattern recognition , image analysis , information retrieval , and bioinformatics . cluster analysis NP       
          4001       51 itself                                                                                               NP       
          4001       52 is                                                                                                   VP       
          4001       53 not                                                                                                  O        
          4001       54 one specific algorithm                                                                               NP       
          4001       55 ,                                                                                                    O        
          4001       56 but                                                                                                  O        
          4001       57 the general task                                                                                     NP       
          4001       58 to be solved                                                                                         VP       
          4001       59 .                                                                                                    O        
          4001       60 it                                                                                                   NP       
          4001       61 can be achieved                                                                                      VP       
          4001       62 by                                                                                                   PP       
          4001       63 various algorithms                                                                                   NP       
          4001       64 that                                                                                                 NP       
          4001       65 differ                                                                                               VP       
          4001       66 significantly                                                                                        ADVP     
          4001       67 in                                                                                                   PP       
          4001       68 their notion                                                                                         NP       
          4001       69 of                                                                                                   PP       
          4001       70 what                                                                                                 NP       
          4001       71 constitutes                                                                                          VP       
          4001       72 a cluster                                                                                            NP       
          4001       73 and                                                                                                  O        
          4001       74 how                                                                                                  ADVP     
          4001       75 to efficiently find                                                                                  VP       
          4001       76 them                                                                                                 NP       
          4001       77 .                                                                                                    O        
          5001        1 association rule learning                                                                            NP       
          5001        2 is                                                                                                   VP       
          5001        3 a method                                                                                             NP       
          5001        4 for                                                                                                  PP       
          5001        5 discovering                                                                                          VP       
          5001        6 interesting relations                                                                                NP       
          5001        7 between                                                                                              PP       
          5001        8 variables                                                                                            NP       
          5001        9 in                                                                                                   PP       
          5001       10 large databases                                                                                      NP       
          5001       11 .                                                                                                    O        
          5001       12 it                                                                                                   NP       
          5001       13 is intended to identify                                                                              VP       
          5001       14 strong rules                                                                                         NP       
          5001       15 discovered                                                                                           VP       
          5001       16 in                                                                                                   PP       
          5001       17 databases                                                                                            NP       
          5001       18 using                                                                                                VP       
          5001       19 different measures                                                                                   NP       
          5001       20 of                                                                                                   PP       
          5001       21 interestingness                                                                                      NP       
          5001       22 . based                                                                                              VP       
          5001       23 on                                                                                                   PP       
          5001       24 the concept                                                                                          NP       
          5001       25 of                                                                                                   PP       
          5001       26 strong rules                                                                                         NP       
          5001       27 ,                                                                                                    O        
          5001       28 rakesh agrawal et al.[2 ] introduced association rules                                               NP       
          5001       29 for                                                                                                  PP       
          5001       30 discovering regularities                                                                             NP       
          5001       31 between                                                                                              PP       
          5001       32 products                                                                                             NP       
          5001       33 in                                                                                                   PP       
          5001       34 large-scale transaction data                                                                         NP       
          5001       35 recorded                                                                                             VP       
          5001       36 by                                                                                                   PP       
          5001       37 point-of-sale ( pos ) systems                                                                        NP       
          5001       38 in                                                                                                   PP       
          5001       39 supermarkets                                                                                         NP       
          5001       40 .                                                                                                    O        
          5001       41 for                                                                                                  PP       
          5001       42 example                                                                                              NP       
          5001       43 ,                                                                                                    O        
          5001       44 the rule { onions , potatoes}=>{burger                                                               NP       
          5001       45 } found                                                                                              VP       
          5001       46 in                                                                                                   PP       
          5001       47 the sales data                                                                                       NP       
          5001       48 of                                                                                                   PP       
          5001       49 a supermarket                                                                                        NP       
          5001       50 would indicate                                                                                       VP       
          5001       51 that                                                                                                 SBAR     
          5001       52 if                                                                                                   SBAR     
          5001       53 a customer                                                                                           NP       
          5001       54 buys                                                                                                 VP       
          5001       55 onions                                                                                               NP       
          5001       56 and                                                                                                  O        
          5001       57 potatoes                                                                                             VP       
          5001       58 together                                                                                             ADVP     
          5001       59 ,                                                                                                    O        
          5001       60 they                                                                                                 NP       
          5001       61 are                                                                                                  VP       
          5001       62 likely                                                                                               ADJP     
          5001       63 to also buy                                                                                          VP       
          5001       64 hamburger meat                                                                                       NP       
          5001       65 .                                                                                                    O

Download a zip file of all examples and a SQL script file that creates their input tables from the attachment in the left sidebar.