$lang
$lang : string
Current language, only tested on Simplified Chinese Might be extensable for other languages in the furture
Machine learning based Part of Speech tagger. Typically, ContextWeightedPosTagger.php can be used to train a tagger for a language according to some dataset. Once training is complete it can be used to predict the tags for terms in a string or array of terms.
Instruction to add a new language: Add a switch case in the constructor. Define the following functions: getKeyImpl See the class function 'getKey' for more information
__construct(string $lang, boolean $packed = true)
The constructer of the pos tagger To extend to other languages, some work are needed: Define $this->getKeyImpl, $this->rule_defined_key See Chinese example.
string | $lang | describes current langauge |
boolean | $packed | describes how weight and bias would look like |
getKey(string $term) : mixed
Maps a term to its corresponding key in the weight, bias, string arrays
string | $term | is the term to be checked |
eiter the int key for those matrices of just the term itself if the getKeyImpl function has not been defined for the current language
processTexts(\seekquarry\yioop\library\@mixed $text_files, string $term_tag_separator = "_", \seekquarry\yioop\library\function $term_callback = null, \seekquarry\yioop\library\function $tag_callback = null) : \seekquarry\yioop\library\@array
Converts training data from the format tagged sentence with terms of the form term_tag into a pair of arrays [[terms_in_sentence], [tags_in_sentence]]
\seekquarry\yioop\library\@mixed | $text_files | can be a file or an array of file names |
string | $term_tag_separator | separator used to separate term and tag for terms in input sentence |
\seekquarry\yioop\library\function | $term_callback | callback function applied to a term before adding term to sentence term array |
\seekquarry\yioop\library\function | $tag_callback | callback function applied to a part of speech tag before adding tag to sentence tag array |
of separated sentences, each sentence having the format of [[terms...], [tags...]] Currently, the training data needs to fit Chinese Treebank format: term followed by a underscore and followed by the tag e.g. "新_VA 的_DEC 南斯拉夫_NR 会国_NN" To adapt to other language, some modifications are needed
train(mixed $text_files, string $term_tag_separator = "_", float $learning_rate = 0.1, integer $num_epoch = 1200, \seekquarry\yioop\library\function $term_callback = null, \seekquarry\yioop\library\function $tag_callback = null, boolean $resume = false)
Useds text files containing tagged sentences to create a matrix so that from a two term before a term, two term after a term context and a term the odds of each of its possible parts of speech can be calculated
mixed | $text_files | with training data. These can be a file or an array of file names. For now these files ae assumed to be in Chinese Treebank format. |
string | $term_tag_separator | separator used to separate term and tag for terms in input sentence |
float | $learning_rate | learnninng ate when cycling over data trying to minimize the cross-entopy loss in the prediction of the tag of the middle term. |
integer | $num_epoch | maximum number of times to cycle trough the complete data set. Default value of 1200 seems to avoid overfitting |
\seekquarry\yioop\library\function | $term_callback | callback function applied to a term before adding term to sentence term array as part of processing and training with a sentence. |
\seekquarry\yioop\library\function | $tag_callback | callback function applied to a part of speech tag before adding tag to sentence tag array as part of processing and training with a sentence. |
boolean | $resume | if true, read the weight file and continue training if false, start from beginning |
predict(mixed $sentence) : \seekquarry\yioop\library\@array
Predicts the part of speech tag for each term in a sentence
mixed | $sentence | is an array of segmented words/terms or a string with words/terms seperated by space |
of tags for these terms
tag(string $text, boolean $return_string = false) : mixed
Function to tag each term in a supplied input text.
string | $text | string to tag each term of |
boolean | $return_string | if true then the result of tagging the string if returned; otherwise, it is echo to default out if $return_string is false |
the string result of tagging $text, if $return_string is true; otherwise, te value true e.g. 中国_NR 人民_NN 将_AD 满怀信心_VV 地_DEV 开创_VV 新_VA 的_DEC 业绩_NN 。_PU
getIndex(integer $index, array $terms) : integer
Given a sentence (array $terms), find the key for the term at position $index
integer | $index | position of term to get key for |
array | $terms | an array of terms typically from and in the order of a sentence |
key position in weigts and bias arrays
getW(string $term, integer $position, integer $tag_index) : float
Get the weight value for term at position for tag
string | $term | to get weight of |
integer | $position | of term within the current 5-gram |
integer | $tag_index | index of the particular tag we are trying to see the term's weight for |
unpackW(integer $key) : array
Unpack the weight matrix for a given part of speech key. This is a 5 x term_set_size matrix the 5 rows corresponds to -2, -1, 0, 1, 2, locations in a 5-gram.
An (i, j) entry roughly gives the probability of the j term in location i having the part of speech given by $key
integer | $key | in weight set corresponding to a part of speech |
of weights corresponding to that key