\seekquarry\yioop\libraryContextWeightedPosTagger

Machine learning based Part of Speech tagger. Typically, ContextWeightedPosTagger.php can be used to train a tagger for a language according to some dataset. Once training is complete it can be used to predict the tags for terms in a string or array of terms.

Instruction to add a new language: Add a switch case in the constructor. Define the following functions: getKeyImpl See the class function 'getKey' for more information

Summary

Methods

Properties

Constants

notCurrentLang()
__construct()
__call()
__get()
__set()
getKey()
processTexts()
train()
predict()
tag()

$lang
$w
$b

No constants found

No protected methods found

No protected properties found

N/A

getIndex()
getB()
setB()
getW()
saveWeights()
loadWeights()
packB()
unpackB()
packW()
unpackW()

$min_w
$max_w
$tag_set
$unknown_word_possible_tags

N/A

File: src/library/ContextWeightedPosTagger.php
Package: Default
Class hierarchy: \seekquarry\yioop\library\ContextWeightedPosTagger

Properties

$lang

$lang : string

Current language, only tested on Simplified Chinese Might be extensable for other languages in the furture

Type

string

$w

$w : array

The weights for predicting the parts-of-speech tag y = wx + b Generalized by training method

Type

array

$b

$b : array

The bias for predicting the parts-of-speech tag y = wx + b Generalized by training method

Type

array

$min_w

$min_w : float

minimum allowed value for a weight component

Type

float

$max_w

$max_w : float

maximum allowed value for a weight component

Type

float

$tag_set

$tag_set : array

All possible tag set Generalized by training method

Type

array — [tag => tag index]

$unknown_word_possible_tags

$unknown_word_possible_tags : array

The unknown words should be picked from these tags

Type

array

Methods

notCurrentLang()

notCurrentLang(  $term) : true

Check if all the characters in the term is not current language

Parameters

$term

is a string that to be checked

Returns

true —

if all the chars in $term is not current language false otherwise

__construct()

__construct(string  $lang, boolean  $packed = true)

The constructer of the pos tagger To extend to other languages, some work are needed: Define $this->getKeyImpl, $this->rule_defined_key See Chinese example.

Parameters

string	$lang	describes current langauge
boolean	$packed	describes how weight and bias would look like

__call()

__call(string  $method, array  $args) : mixed

__call for calling dynamic methods

Parameters

string	$method	method of this class to call
array	$args	arguments to pass to method

Returns

mixed —

result of method calculation

__get()

__get(string  $var_name) : mixed

__get for getting dynamic variables

Parameters

string

$var_name

variable to retrieve

Returns

mixed —

result of retrieval

__set()

__set(string  $var_name, mixed  $value)

__set for assigning dynamic variables

Parameters

string	$var_name	variable to assign
mixed	$value	value to assign to it

getKey()

getKey(string  $term) : mixed

Maps a term to its corresponding key in the weight, bias, string arrays

Parameters

string

$term

is the term to be checked

Returns

mixed —

eiter the int key for those matrices of just the term itself if the getKeyImpl function has not been defined for the current language

processTexts()

processTexts(\seekquarry\yioop\library\@mixed  $text_files, string  $term_tag_separator = "_", \seekquarry\yioop\library\function  $term_callback = null, \seekquarry\yioop\library\function  $tag_callback = null) : \seekquarry\yioop\library\@array

Converts training data from the format tagged sentence with terms of the form term_tag into a pair of arrays [[terms_in_sentence], [tags_in_sentence]]

Parameters

\seekquarry\yioop\library\@mixed	$text_files	can be a file or an array of file names
string	$term_tag_separator	separator used to separate term and tag for terms in input sentence
\seekquarry\yioop\library\function	$term_callback	callback function applied to a term before adding term to sentence term array
\seekquarry\yioop\library\function	$tag_callback	callback function applied to a part of speech tag before adding tag to sentence tag array

Returns

\seekquarry\yioop\library\@array —

of separated sentences, each sentence having the format of [[terms...], [tags...]] Currently, the training data needs to fit Chinese Treebank format: term followed by a underscore and followed by the tag e.g. "新_VA 的_DEC 南斯拉夫_NR 会国_NN" To adapt to other language, some modifications are needed

train()

train(mixed  $text_files, string  $term_tag_separator = "_", float  $learning_rate = 0.1, integer  $num_epoch = 1200, \seekquarry\yioop\library\function  $term_callback = null, \seekquarry\yioop\library\function  $tag_callback = null, boolean  $resume = false)

Useds text files containing tagged sentences to create a matrix so that from a two term before a term, two term after a term context and a term the odds of each of its possible parts of speech can be calculated

Parameters

mixed	$text_files	with training data. These can be a file or an array of file names. For now these files ae assumed to be in Chinese Treebank format.
string	$term_tag_separator	separator used to separate term and tag for terms in input sentence
float	$learning_rate	learnninng ate when cycling over data trying to minimize the cross-entopy loss in the prediction of the tag of the middle term.
integer	$num_epoch	maximum number of times to cycle trough the complete data set. Default value of 1200 seems to avoid overfitting
\seekquarry\yioop\library\function	$term_callback	callback function applied to a term before adding term to sentence term array as part of processing and training with a sentence.
\seekquarry\yioop\library\function	$tag_callback	callback function applied to a part of speech tag before adding tag to sentence tag array as part of processing and training with a sentence.
boolean	$resume	if true, read the weight file and continue training if false, start from beginning

predict()

predict(mixed  $sentence) : \seekquarry\yioop\library\@array

Predicts the part of speech tag for each term in a sentence

Parameters

mixed

$sentence

is an array of segmented words/terms or a string with words/terms seperated by space

Returns

\seekquarry\yioop\library\@array —

of tags for these terms

tag()

tag(string  $text, boolean  $return_string = false) : mixed

Function to tag each term in a supplied input text.

Parameters

string	$text	string to tag each term of
boolean	$return_string	if true then the result of tagging the string if returned; otherwise, it is echo to default out if $return_string is false

Returns

mixed —

the string result of tagging $text, if $return_string is true; otherwise, te value true e.g. 中国_NR 人民_NN 将_AD 满怀信心_VV 地_DEV 开创_VV 新_VA 的_DEC 业绩_NN 。_PU

getIndex()

getIndex(integer  $index, array  $terms) : integer

Given a sentence (array $terms), find the key for the term at position $index

Parameters

integer	$index	position of term to get key for
array	$terms	an array of terms typically from and in the order of a sentence

Returns

integer —

key position in weigts and bias arrays

getB()

getB(integer  $tag_index) : float

Get the bias value for a tag

Parameters

integer

$tag_index

the index of tag's value within the bias string

Returns

float —

bias value for tag

setB()

setB(integer  $tag_index, float  $value)

Set the bias value for tag

Parameters

integer	$tag_index	the index of tag's value within the bias string
float	$value	bias value to associate to tag

getW()

getW(string  $term, integer  $position, integer  $tag_index) : float

Get the weight value for term at position for tag

Parameters

string	$term	to get weight of
integer	$position	of term within the current 5-gram
integer	$tag_index	index of the particular tag we are trying to see the term's weight for

Returns

float

saveWeights()

saveWeights()

Save the trained weights to disk

loadWeights()

loadWeights(boolean  $for_training = false)

Load the trained weight from disk

Parameters

boolean

$for_training

whether we are loading the weights to continue training (true) or we are using the weights only for prediction.

packB()

packB() : string

Pack the bias

Returns

string —

the bias vector packed as a string

unpackB()

unpackB() : array

Unpack the bias

Returns

array —

the bias vector unpacked from a string

packW()

packW(integer  $key) : string

Pack the weights matrix to a string for a particular part of speech key

Parameters

integer

$key

index corresponding to a part of speech according to $this->tag_set

Returns

string —

the packed weights matrix

unpackW()

unpackW(integer  $key) : array

Unpack the weight matrix for a given part of speech key. This is a 5 x term_set_size matrix the 5 rows corresponds to -2, -1, 0, 1, 2, locations in a 5-gram.

An (i, j) entry roughly gives the probability of the j term in location i having the part of speech given by $key

Parameters

integer

$key

in weight set corresponding to a part of speech

Returns

array —

of weights corresponding to that key