- TextColumn
- Specify the name of the input table column that contains the text to tokenize.
- InputLanguage
- [Optional] Specify the language of the text in text_column:
Option Description 'en' (Default) English 'zh_CN' Simplified Chinese 'zh_TW' Traditional Chinese 'jp' Japanese - ModelFile
- [Optional] Specify the name of model file that the function uses for tokenizing. The model must be a conditional random-fields model and model_file must already be installed on the ML Engine. If you omit this argument, or if model_file is not installed, then the function uses white spaces to separate English words and an embedded dictionary to tokenize Chinese text.If you specify InputLanguage('jp'), the function ignores this argument.
- OutputDelimiter
- [Optional] Specify the delimiter, a string, for separating tokens in the output.
- OutputByWord
- Specify whether to output one token in each row.
- Accumulate
- [Optional] Specify the names of the input table columns to copy to the output table.
- UserDictionaryFile
- [Optional] Specify the name of the user dictionary to use to correct results specified by the model. If you specify both this argument and a dictionary table (dict), then the function uses the union of user_dictionary_file and dict as its dictionary. TextTokenizer Input describes the format of user_dictionary_file and dict.If the function finds more than one matched term, it selects the longest term for the first match.