\seekquarry\yioop\libraryContextWeightedPosTagger

Machine learning based Part of Speech tagger. Typically, ContextWeightedPosTagger.php can be used to train a tagger for a language according to some dataset. Once training is complete it can be used to predict the tags for terms in a string or array of terms.

Instruction to add a new language: Add a switch case in the constructor. Define the following functions: getKeyImpl See the class function 'getKey' for more information

Summary

Methods
Properties
Constants
notCurrentLang()
__construct()
__call()
__get()
__set()
getKey()
processTexts()
train()
predict()
tag()
$lang
$w
$b
No constants found
No protected methods found
No protected properties found
N/A
getIndex()
getB()
setB()
getW()
saveWeights()
loadWeights()
packB()
unpackB()
packW()
unpackW()
$min_w
$max_w
$tag_set
$unknown_word_possible_tags
N/A

Properties

$lang

$lang : string

Current language, only tested on Simplified Chinese Might be extensable for other languages in the furture

Type

string

$w

$w : array

The weights for predicting the parts-of-speech tag y = wx + b Generalized by training method

Type

array

$b

$b : array

The bias for predicting the parts-of-speech tag y = wx + b Generalized by training method

Type

array

$min_w

$min_w : float

minimum allowed value for a weight component

Type

float

$max_w

$max_w : float

maximum allowed value for a weight component

Type

float

$tag_set

$tag_set : array

All possible tag set Generalized by training method

Type

array — [tag => tag index]

$unknown_word_possible_tags

$unknown_word_possible_tags : array

The unknown words should be picked from these tags

Type

array

Methods

notCurrentLang()

notCurrentLang(  $term) : true

Check if all the characters in the term is not current language

Parameters

$term

is a string that to be checked

Returns

true —

if all the chars in $term is not current language false otherwise

__construct()

__construct(string  $lang, boolean  $packed = true) 

The constructer of the pos tagger To extend to other languages, some work are needed: Define $this->getKeyImpl, $this->rule_defined_key See Chinese example.

Parameters

string $lang

describes current langauge

boolean $packed

describes how weight and bias would look like

__call()

__call(string  $method, array  $args) : mixed

__call for calling dynamic methods

Parameters

string $method

method of this class to call

array $args

arguments to pass to method

Returns

mixed —

result of method calculation

__get()

__get(string  $var_name) : mixed

__get for getting dynamic variables

Parameters

string $var_name

variable to retrieve

Returns

mixed —

result of retrieval

__set()

__set(string  $var_name, mixed  $value) 

__set for assigning dynamic variables

Parameters

string $var_name

variable to assign

mixed $value

value to assign to it

getKey()

getKey(string  $term) : mixed

Maps a term to its corresponding key in the weight, bias, string arrays

Parameters

string $term

is the term to be checked

Returns

mixed —

eiter the int key for those matrices of just the term itself if the getKeyImpl function has not been defined for the current language

processTexts()

processTexts(\seekquarry\yioop\library\@mixed  $text_files, string  $term_tag_separator = "_", \seekquarry\yioop\library\function  $term_callback = null, \seekquarry\yioop\library\function  $tag_callback = null) : \seekquarry\yioop\library\@array

Converts training data from the format tagged sentence with terms of the form term_tag into a pair of arrays [[terms_in_sentence], [tags_in_sentence]]

Parameters

\seekquarry\yioop\library\@mixed $text_files

can be a file or an array of file names

string $term_tag_separator

separator used to separate term and tag for terms in input sentence

\seekquarry\yioop\library\function $term_callback

callback function applied to a term before adding term to sentence term array

\seekquarry\yioop\library\function $tag_callback

callback function applied to a part of speech tag before adding tag to sentence tag array

Returns

\seekquarry\yioop\library\@array —

of separated sentences, each sentence having the format of [[terms...], [tags...]] Currently, the training data needs to fit Chinese Treebank format: term followed by a underscore and followed by the tag e.g. "新_VA 的_DEC 南斯拉夫_NR 会国_NN" To adapt to other language, some modifications are needed

train()

train(mixed  $text_files, string  $term_tag_separator = "_", float  $learning_rate = 0.1, integer  $num_epoch = 1200, \seekquarry\yioop\library\function  $term_callback = null, \seekquarry\yioop\library\function  $tag_callback = null, boolean  $resume = false) 

Useds text files containing tagged sentences to create a matrix so that from a two term before a term, two term after a term context and a term the odds of each of its possible parts of speech can be calculated

Parameters

mixed $text_files

with training data. These can be a file or an array of file names. For now these files ae assumed to be in Chinese Treebank format.

string $term_tag_separator

separator used to separate term and tag for terms in input sentence

float $learning_rate

learnninng ate when cycling over data trying to minimize the cross-entopy loss in the prediction of the tag of the middle term.

integer $num_epoch

maximum number of times to cycle trough the complete data set. Default value of 1200 seems to avoid overfitting

\seekquarry\yioop\library\function $term_callback

callback function applied to a term before adding term to sentence term array as part of processing and training with a sentence.

\seekquarry\yioop\library\function $tag_callback

callback function applied to a part of speech tag before adding tag to sentence tag array as part of processing and training with a sentence.

boolean $resume

if true, read the weight file and continue training if false, start from beginning

predict()

predict(mixed  $sentence) : \seekquarry\yioop\library\@array

Predicts the part of speech tag for each term in a sentence

Parameters

mixed $sentence

is an array of segmented words/terms or a string with words/terms seperated by space

Returns

\seekquarry\yioop\library\@array —

of tags for these terms

tag()

tag(string  $text, boolean  $return_string = false) : mixed

Function to tag each term in a supplied input text.

Parameters

string $text

string to tag each term of

boolean $return_string

if true then the result of tagging the string if returned; otherwise, it is echo to default out if $return_string is false

Returns

mixed —

the string result of tagging $text, if $return_string is true; otherwise, te value true e.g. 中国_NR 人民_NN 将_AD 满怀信心_VV 地_DEV 开创_VV 新_VA 的_DEC 业绩_NN 。_PU

getIndex()

getIndex(integer  $index, array  $terms) : integer

Given a sentence (array $terms), find the key for the term at position $index

Parameters

integer $index

position of term to get key for

array $terms

an array of terms typically from and in the order of a sentence

Returns

integer —

key position in weigts and bias arrays

getB()

getB(integer  $tag_index) : float

Get the bias value for a tag

Parameters

integer $tag_index

the index of tag's value within the bias string

Returns

float —

bias value for tag

setB()

setB(integer  $tag_index, float  $value) 

Set the bias value for tag

Parameters

integer $tag_index

the index of tag's value within the bias string

float $value

bias value to associate to tag

getW()

getW(string  $term, integer  $position, integer  $tag_index) : float

Get the weight value for term at position for tag

Parameters

string $term

to get weight of

integer $position

of term within the current 5-gram

integer $tag_index

index of the particular tag we are trying to see the term's weight for

Returns

float

saveWeights()

saveWeights() 

Save the trained weights to disk

loadWeights()

loadWeights(boolean  $for_training = false) 

Load the trained weight from disk

Parameters

boolean $for_training

whether we are loading the weights to continue training (true) or we are using the weights only for prediction.

packB()

packB() : string

Pack the bias

Returns

string —

the bias vector packed as a string

unpackB()

unpackB() : array

Unpack the bias

Returns

array —

the bias vector unpacked from a string

packW()

packW(integer  $key) : string

Pack the weights matrix to a string for a particular part of speech key

Parameters

integer $key

index corresponding to a part of speech according to $this->tag_set

Returns

string —

the packed weights matrix

unpackW()

unpackW(integer  $key) : array

Unpack the weight matrix for a given part of speech key. This is a 5 x term_set_size matrix the 5 rows corresponds to -2, -1, 0, 1, 2, locations in a 5-gram.

An (i, j) entry roughly gives the probability of the j term in location i having the part of speech given by $key

Parameters

integer $key

in weight set corresponding to a part of speech

Returns

array —

of weights corresponding to that key