For general information about tokenization, see http://en.wikipedia.org/wiki/Lexical_analysis#Tokenizer.