\seekquarry\yioop\libraryStochasticTermSegmenter

A Stochastic Finite-State Word-Segmenter.

This class contains necessary tools to segment terms from sentences.

Currently only supports Chinese. Instruction to add a new language: Add a switch case in the constructor. Define the following function: isExceptionImpl See the class function 'isException' for more information isPunctuationImpl See the class function 'isPunctuation' for more information isNotCurrentLangImpl See the class function 'notCurrentLang' for more information Chinese example is provided in the constructor

Summary

Methods

Properties

Constants

__construct()
__call()
__get()
__set()
isException()
isPunctuation()
notCurrentLang()
train()
segmentFiles()
segmentText()
segmentSentence()

$lang
$non_char_preg
$unknown_term_score
$dictionary_file

No constants found

No protected methods found

No protected properties found

N/A

getScore()
add()

$cache_pct
$cache

N/A

File: src/library/StochasticTermSegmenter.php
Package: Default
Class hierarchy: \seekquarry\yioop\library\StochasticTermSegmenter

Properties

$lang

$lang :string

The language currently being used e.g. zh_CN, ja

Type

string

$non_char_preg

$non_char_preg :string

regular expression to determine if the non of the char in this term is in current language Recommanded expression for: Chinese: \p{Han} Japanese: \x{4E00}-\x{9FBF}\x{3040}-\x{309F}\x{30A0}-\x{30FF} Korean: \x{3130}-\x{318F}\x{AC00}-\x{D7AF}

Type

string

$unknown_term_score

$unknown_term_score :float

Default score for any unknown term

Type

float

$dictionary_file

$dictionary_file :array

A dictionary file that contains the statistic infomation of the terms

Type

array

$cache_pct

$cache_pct :\seekquarry\yioop\library\number

Percentage for cache entries. Value should be between 0 and 1.0 Set to small number when running on memory limited machines Here is a general comparison when setting it to 0 and 1: In the test of Chinese Segmentation on pku dataset, the peak usage of memory is 26.288MB vs. 151.46MB The trade off is some efficiency, In the test of Chinese Segmentation on pku dataset, the speed is 43.803s vs. 1.540s Default value = 0.06 The time and Peak Memory are 5.094 s and 98.97MB

Type

\seekquarry\yioop\library\number—from 0 - 1.0

$cache

$cache :array

Cache. Will have runtime data for the segmentation

Type

array

Methods

__construct()

__construct(string  $lang,  $cache_pct = 0.06)

Construct an instance of this class used for segmenting string with respect to words in a locale using a probabilistic approach to evaluate segmentation possibilities.

Parameters

string	$lang	is a string to indicate the language
	$cache_pct

__call()

__call(string  $method,array  $args): mixed

__call for calling dynamic methods

Parameters

string	$method	method of this class to call
array	$args	arguments to pass to method

Returns

mixed —

result of method calculation

__get()

__get(string  $var_name): mixed

__get for getting dynamic variables

Parameters

string

$var_name

variable to retrieve

Returns

mixed —

result of retrieval

__set()

__set(string  $var_name,mixed  $value)

__set for assigning dynamic variables

Parameters

string	$var_name	variable to assign
mixed	$value	value to assign to it

isException()

isException(  $term): true

Check if the term passed in is an exception term Not all valid terms should be indexed.

e.g. there are infinite combinations of numbers in the world. isExceptionImpl should be defined in constructor if needed

Parameters

$term

is a string that to be checked

Returns

true —

if $term is an exception term, false otherwise

isPunctuation()

isPunctuation(  $term): true

Check if the term passed in is a punctuation isPunctuationImpl should be defined in constructor if needed

Parameters

$term

is a string that to be checked

Returns

true —

if $term is a punctuation, false otherwise

notCurrentLang()

notCurrentLang(  $term): boolean

Check if all the chars in the term is NOT current language

Parameters

$term

is a string that to be checked

Returns

boolean —

true if all the chars in $term is NOT current language false otherwise

train()

train(mixed  $text_files,string  $format = "default"): boolean

Generate a term dictionary file for later segmentation

Parameters

mixed	$text_files	is a string name or an array of files that to be trained; words in the files need to be segmented by space
string	$format	currently only support default and CTB

Returns

boolean —

true if success

segmentFiles()

segmentFiles(  $text_files,boolean  $return_string = false): string

This function is used to segment a list of files

Parameters

	$text_files	can be a file name or a list of file names to be segmented
boolean	$return_string	return segmented string if true, print to stdout otherwise user can use > filename to output it to a file

Returns

string —

segmented words with space or true/false;

segmentText()

segmentText(string  $text,boolean  $return_string = false): string

Segment texts. Words are seperated by space

Parameters

string	$text	to be segmented
boolean	$return_string	return segmented string if true, print otherwise

Returns

string —

segmented words with space or true/false;

segmentSentence()

segmentSentence(string  $sentence): array

Segment a sentence into arrays of words.

Need NOT contain any new line characters.

Parameters

string

$sentence

is a string without newline to be segmented

Returns

array —

of segmented words

getScore()

getScore(integer  $frequency): float

This is the function to calculate scores for each word

Parameters

integer

$frequency

is an integer tells the frequency of a word

Returns

float —

the score of the term.

add()

add(string  $key,string  $value,array  $array)

Adds a term to the dictionary

Parameters

string	$key	the term to be inserted
string	$value	the frequency to be inserted
array	$array	for insertion