Part 1 of the example template file declares three extractor classes—Defaul_Token, Begin_with_Uppercase, and com.asterdata.ner.SuffixExtractor, with serial numbers 0, 1, and 2, respectively. (Serial numbers must start with 0 and be incremented by 1.)
Defaul_Token and Begin_with_Uppercase are predefined extractor classes.
The third class, com.asterdata.ner.SuffixExtractor, is an example of a user-defined class. User-defined classes must be created in Java and must implement the Extractor interface, which is:
package com.asterdata.sqlmr.text_analysis.ner;
import java.io.Serializable;
import java.util.List;
/**
* Implement this interface to define a
* function that generates features from a sequence
*/
public interface Extractor extends Serializable
{
/**
* extract the feature of a token
* @param sequence
* @param i, the index
* @return the feature flag
*/
String extract(List String sequence, int i);
}
The Java class SuffixExtractor in this example returns the last character of the current token. The code for this class is:
public class SuffixExtractor implements Extractor
{
@Override
public String extract(List String sequence, int i)
{
String token = sequence.get(i);
return token.substring(token.length() - 1);
}
}
Suppose that the function applies the extractor classes in the example template file to the input text "More restaurants open in San Diego." For the token "More":
- Defaul_Token returns the token itself, "More".
- Begin_with_Uppercase returns "T" because the token begins with an uppercase letter.
- com.asterdata.ner.SuffixExtractor returns "e", the last character of the token.
This table shows the features that each extractor class returns for the entire input text:
Defaul_Token | Begin_with_Uppercase | com.asterdata.ner.SuffixExtractor |
---|---|---|
More | T | e |
restaurants | F | s |
open | F | n |
in | F | n |
San | T | n |
Diego | T | o |
. | F | . |