Parsing English language text includes:
- Punctuating sentences
- Breaking a sentence into words (tokenizing it)
- Removing stop words
- Stemming words (reducing them to their root forms)
The Text_Parser function reads a document into a memory buffer and creates a hash table. The dictionary for the document must not exceed available memory; however, a million-word dictionary with an average word length of ten bytes requires only 10 MB of memory.
The Text_Parser function uses Porter2 as the stemming algorithm.
For general information about tokenization, see:
http://en.wikipedia.org/wiki/Lexical_analysis#Tokenizer
For general information about stemming, see: