$lang
$lang :string
The language currently being used e.g. zh_CN, ja
A Stochastic Finite-State Word-Segmenter.
This class contains necessary tools to segment terms from sentences.
Currently only supports Chinese. Instruction to add a new language: Add a switch case in the constructor. Define the following function: isExceptionImpl See the class function 'isException' for more information isPunctuationImpl See the class function 'isPunctuation' for more information isNotCurrentLangImpl See the class function 'notCurrentLang' for more information Chinese example is provided in the constructor
$cache_pct :\seekquarry\yioop\library\number
Percentage for cache entries. Value should be between 0 and 1.0 Set to small number when running on memory limited machines Here is a general comparison when setting it to 0 and 1: In the test of Chinese Segmentation on pku dataset, the peak usage of memory is 26.288MB vs. 151.46MB The trade off is some efficiency, In the test of Chinese Segmentation on pku dataset, the speed is 43.803s vs. 1.540s Default value = 0.06 The time and Peak Memory are 5.094 s and 98.97MB
isException( $term): true
Check if the term passed in is an exception term Not all valid terms should be indexed.
e.g. there are infinite combinations of numbers in the world. isExceptionImpl should be defined in constructor if needed
$term | is a string that to be checked |
if $term is an exception term, false otherwise
train(mixed $text_files,string $format = "default"): boolean
Generate a term dictionary file for later segmentation
mixed | $text_files | is a string name or an array of files that to be trained; words in the files need to be segmented by space |
string | $format | currently only support default and CTB |
true if success
segmentFiles( $text_files,boolean $return_string = false): string
This function is used to segment a list of files
$text_files | can be a file name or a list of file names to be segmented |
|
boolean | $return_string | return segmented string if true, print to stdout otherwise user can use > filename to output it to a file |
segmented words with space or true/false;