$stop_words
$stop_words :
A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection
Hindi specific tokenization code. In particular, it has a stemmer, The stemmer is my stab at porting Ljiljana Dolamic (University of Neuchatel, www.unine.ch/info/clef/) Java stemming algorithm: http://members.unine.ch/jacques.savoy/clef/HindiStemmerLight.java.txt Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.
segment(string $pre_segment) : string
Stub function which could be used for a word segmenter.
Such a segmenter on input thisisabunchofwords would output this is a bunch of words
string | $pre_segment | before segmentation |
should return string with words separated by space in this case does nothing
tagPartsOfSpeechPhrase(string $phrase, boolean $with_tokens = true) : string
The method takes as input a phrase and returns a string with each term tagged with a part of speech.
string | $phrase | text to add parts speech tags to |
boolean | $with_tokens | whether to include the terms and the tags in the output string or just the part of speech tags |
$tagged_phrase which is a string of format term~pos
tagTokenizePartOfSpeech(string $text) : string
Uses the lexicon to assign a tag to each token and then uses a rule based approach to assign the most likely of tags to each token
string | $text | input phrase which is to be tagged |
$result which is an array of token => tag
tagUnknownWords(array $partially_tagged_text) : array
This method tags the remaining words in a partially tagged text array.
array | $partially_tagged_text | term array representing a text passage. Each element in array is in turnan associative array [token => token_value, tag => tag_value (may be empty)] |
text passage array where all empty tags now have values
taggedPartOfSpeechTokensToString(array $tagged_tokens, boolean $with_tokens = true) : string
This method is used to simplify the different tags of speech to a common form
array | $tagged_tokens | which is an array of tokens assigned tags. |
boolean | $with_tokens | whether to include the terms and the tags in the output string or just the part of speech tags |
$tagged_phrase which is a string fo form token~pos
parseTypeList(\seekquarry\yioop\locale\hi\resources\array& $cur_node, array $tagged_phrase, string $type) : string
Starting at the $cur_node in a $tagged_phrase parse tree for a Hindi sentence, create a phrase string for each of the next nodes which belong to part of speech group $type.
\seekquarry\yioop\locale\hi\resources\array& | $cur_node | node within parse tree |
array | $tagged_phrase | parse tree for phrase |
string | $type | self::$noun_type, self::$verb_type, etc |
phrase string involving only terms of that $type
parseNoun(array $tagged_phrase, array $tree) : array
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun if possible
array | $tagged_phrase | an array of pairs of the form ("token" => token_for_term, "tag" => part_of_speech_tag_for_term) |
array | $tree | that consists of ["curnode" => current parse position in $tagged_phrase] |
has fields "cur_node" index of how far we parsed $tagged_phrase "NN" a subarray with a token node for the noun string that was parsed
parseVerb(array $tagged_phrase, array $tree) : array
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb if possible
array | $tagged_phrase | an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) |
array | $tree | that consists of ["curnode" => current parse position in $tagged_phrase] |
has fields "cur_node" index of how far we parsed $tagged_phrase "VB" a subarray with a token node for the verb string that was parsed
parseAdjective(array $tagged_phrase, array $tree) : array
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for an adjective if possible
array | $tagged_phrase | an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) |
array | $tree | that consists of ["cur_node" => current parse position in $tagged_phrase] |
has fields "cur_node" index of how far we parsed $tagged_phrase "JJ" a subarray with a token node for the adjective that was parsed
parsePostpositionPhrase(array $tagged_phrase, array $tree, integer $index = 1) : array
Takes a part-of-speech tagged phrase and parse-tree with a parse-from position and builds a parse tree for a sequence of postpositional phrases if possible
array | $tagged_phrase | an array of pairs of the form ("token" => token_for_term, "tag" => part_of_speech_tag_for_term) |
array | $tree | that consists of ["cur_node" => current parse position in $tagged_phrase] |
integer | $index | position in array to start from |
has fields "cur_node" index of how far we parsed $tagged_phrase
parseNounPhrase(array $tagged_phrase, array $tree) : array
Takes a part-of-speech tagged phrase and parse-tree with a parse-from position and builds a parse tree for a noun phrase if possible
array | $tagged_phrase | an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) |
array | $tree | that consists of ["curnode" => current parse position in $tagged_phrase] |
has fields "cur_node" index of how far we parsed $tagged_phrase "JJ" with value an Adjective subtree "NN" with value of a Noun Subtree
parseVerbPhrase(array $tagged_phrase, array $tree) : array
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb phrase if possible
array | $tagged_phrase | an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) |
array | $tree | that consists of ["curnode" => current parse position in $tagged_phrase] |
has fields "cur_node" index of how far we parsed $tagged_phrase "VP" a subarray with possible fields "VB" with value a verb subtree
parseWholePhrase(array $tagged_phrase, $tree = array()) : array
Given a part-of-speeech tagged phrase array generates a parse tree for the phrase using a recursive descent parser.
array | $tagged_phrase | an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) |
$tree | this parameter is ignored but kept so as to match other methods such as @see parseNounPhrase in the recursive descent parser |
used to represent a tree. The array has up to three fields $tree["cur_node"] index of how far we parsed our$tagged_phrase $tree["NP"] contains a subtree for a subject phrase $tree["POST"] contains a subtree for a object phrase $tree["VP"] contains a subtree for a predicate phrase
extractTripletsPhrases(array $word_and_phrase_list) : array
Scans a word list for phrases. For phrases found generate a list of question and answer pairs at two levels of granularity: CONCISE (using all terms in orginal phrase) and RAW (removing (adjectives, etc).
array | $word_and_phrase_list | of statements |
with two fields: QUESTION_LIST consisting of (SUBJECT, COMPLEMENT) where one of the components has been replaced with a question marker.
extractDeepestSpeechPartPhrase(array $tree, string $pos) : string
Takes phrase tree $tree and a part-of-speech $pos returns the deepest $pos only path in tree.
array | $tree | phrase to extract type from |
string | $pos | the part of speech to extract |
the label of deepest $pos only path in $tree
extractSubjectParseTree( $tree) : array
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the subject of the original phrase (as a string) the latter having the importart parts of the subject
$tree |
with two fields CONCISE and RAW as described above
extractPredicateParseTree( $tree) : array
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the predicate of the original phrase (as a string) the latter having the importart parts of the predicate
$tree |
with two fields CONCISE and RAW as described above
extractObjectParseTree( $tree) : array
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the object of the original phrase (as a string) the latter having the importart parts of the object
$tree |
with two fields CONCISE and RAW as described above
extractTripletsParseTree(array $parse_tree) : array
Takes a parse tree of a phrase and computes subject, predicate, and object arrays. Each of these array consists of two components CONCISE and RAW, CONCISE corresponding to something more similar to the words in the original phrase and RAW to the case where extraneous words have been removed
array | $parse_tree | a parse tree for a sentence |
triplet array
rearrangeTripletsByType(array $sub_pred_obj_triplets) : array
Takes a triplets array with subject, predicate, object fields with CONCISE and RAW subfields and rearranges it to have two fields CONCISE and RAW with subject, predicate, object, and QUESTION_ANSWER_LIST subfields
array | $sub_pred_obj_triplets | in format described above |
$processed_triplets in format described above
extractTripletByType(array $sub_pred_obj_triplets, string $type) : array
Takes a triplets array with subject, predicate, object fields with CONCISE, RAW subfields and produces triplets with $type subfield where $type is one of CONCISE and RAW and with subject, predicate, object and QUESTION_ANSWER_LIST subfields
array | $sub_pred_obj_triplets | in format described above |
string | $type | either CONCISE or RAW |
$triplets in format described above
parseQuestion(string $tagged_question, integer $index) : array
Takes tagged question string starts with Who and returns question triplet from the question string
string | $tagged_question | part-of-speech tagged question |
integer | $index | current index in statement |
parsed triplet