$no_stem_list
$no_stem_list : array
Words we don't want to be stemmed
This class has a collection of methods for English locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation), and a part of speech tagger (for question answering). The stemmer is my stab at implementing the Porter Stemmer algorithm presented http://tartarus.org/~martin/PorterStemmer/def.txt The code is based on the non-thread safe C version given by Martin Porter.
Since PHP is single-threaded this should be okay. Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.
segment(string $pre_segment) : string
Stub function which could be used for a word segmenter.
Such a segmenter on input thisisabunchofwords would output this is a bunch of words
string | $pre_segment | before segmentation |
should return string with words separated by space in this case does nothing
canonicalizePunctuatedTerms(\seekquarry\yioop\locale\en_US\resources\string& $string)
This methods tries to handle punctuation in terms specific to the English language such as abbreviations.
\seekquarry\yioop\locale\en_US\resources\string& | $string | a string of words, etc which might involve such terms |
tagPartsOfSpeechPhrase(string $phrase, boolean $with_tokens = true) : string
Takes a phrase and tags each term in it with its part of speech.
So each term in the original phrase gets mapped to term~part_of_speech This tagger is based on a Brill tagger. It makes uses a lexicon consisting of words from the Brown corpus together with a list of part of speech tags that that word had in the Brown Corpus. These are used to get an initial part of speech (in word was not present than we assume it is a noun). From this a fixed set of rules is used to modify the initial tag if necessary.
string | $phrase | text to add parts speech tags to |
boolean | $with_tokens | whether to include the terms and the tags in the output string or just the part of speech tags |
$tagged_phrase phrase where each term has ~part_of_speech appended ($with_tokens == true) or just space separated part_of_speech (!$with_tokens)
tagTokenizePartOfSpeech(string $text) : array
Split input text into terms and output an array with one element per term, that element consisting of array with the term token and the part of speech tag.
string | $text | string to tag and tokenize |
of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) for one each token in $text
compressSentence(string $sentence_to_compress) : \seekquarry\yioop\locale\en_US\resources\the
Take in a sentence and try to compress it to a smaller version that "retains the most important information and remains grammatically correct" (Jing 2000).
string | $sentence_to_compress | the sentence to compress |
compressed sentence
rearrangeTripletsByType(array $sub_pred_obj_triplets) : array
Takes a triplets array with subject, predicate, object fields with CONCISE and RAW subfields and rearranges it to have two fields CONCISE and RAW with subject, predicate, object, and QUESTION_ANSWER_LIST subfields
array | $sub_pred_obj_triplets | in format described above |
$processed_triplets in format described above
parseTypeList(\seekquarry\yioop\locale\en_US\resources\array& $cur_node, array $tagged_phrase, string $type) : string
Starting at the $cur_node in a $tagged_phrase parse tree for an English sentence, create a phrase string for each of the next nodes which belong to part of speech group $type.
\seekquarry\yioop\locale\en_US\resources\array& | $cur_node | node within parse tree |
array | $tagged_phrase | parse tree for phrase |
string | $type | self::$noun_type, self::$verb_type, etc |
phrase string involving only terms of that $type
parseAdjective(array $tagged_phrase, array $tree) : array
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for an adjective if possible
array | $tagged_phrase | an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) |
array | $tree | that consists of ["cur_node" => current parse position in $tagged_phrase] |
has fields "cur_node" index of how far we parsed $tagged_phrase "JJ" a subarray with a token node for the adjective that was parsed
parseDeterminer(array $tagged_phrase, array $tree) : array
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a determiner if possible
array | $tagged_phrase | an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) |
array | $tree | that consists of ["curnode" => current parse position in $tagged_phrase] |
has fields "cur_node" index of how far we parsed $tagged_phrase "DT" a subarray with a token node for the determiner that was parsed
parseNoun(array $tagged_phrase, array $tree) : array
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun if possible
array | $tagged_phrase | an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) |
array | $tree | that consists of ["curnode" => current parse position in $tagged_phrase] |
has fields "cur_node" index of how far we parsed $tagged_phrase "NN" a subarray with a token node for the noun string that was parsed
parseVerb(array $tagged_phrase, array $tree) : array
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb if possible
array | $tagged_phrase | an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) |
array | $tree | that consists of ["curnode" => current parse position in $tagged_phrase] |
has fields "cur_node" index of how far we parsed $tagged_phrase "VB" a subarray with a token node for the verb string that was parsed
parsePrepositionalPhrases(array $tagged_phrase, array $tree, integer $index = 1) : array
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a sequence of prepositional phrases if possible
array | $tagged_phrase | an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) |
array | $tree | that consists of ["cur_node" => current parse position in $tagged_phrase] |
integer | $index | which term in $tagged_phrase to start to try to parse a preposition from |
has fields "cur_node" index of how far we parsed $tagged_phrase parsed followed by additional possible fields (here i represents the ith clause found): "IN_i" with value a preposition subtree "DT_i" with value a determiner subtree "JJ_i" with value an adjective subtree "NN_i" with value an additional noun subtree
parseNounPhrase(array $tagged_phrase, array $tree) : array
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun phrase if possible
array | $tagged_phrase | an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) |
array | $tree | that consists of ["curnode" => current parse position in $tagged_phrase] |
has fields "cur_node" index of how far we parsed $tagged_phrase "NP" a subarray with possible fields "DT" with value a determiner subtree "JJ" with value an adjective subtree "NN" with value a noun tree
parseVerbPhrase(array $tagged_phrase, array $tree) : array
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb phrase if possible
array | $tagged_phrase | an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) |
array | $tree | that consists of ["curnode" => current parse position in $tagged_phrase] |
has fields "cur_node" index of how far we parsed $tagged_phrase "VP" a subarray with possible fields "VB" with value a verb subtree "NP" with value an noun phrase subtree
parseWholePhrase(array $tagged_phrase, $tree) : array
Given a part-of-speeech tagged phrase array generates a parse tree for the phrase using a recursive descent parser.
array | $tagged_phrase | an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) |
$tree | that consists of ["curnode" => current parse position in $tagged_phrase] |
used to represent a tree. The array has up to three fields $tree["cur_node"] index of how far we parsed our$tagged_phrase $tree["NP"] contains a subtree for a noun phrase $tree["VP"] contains a subtree for a verb phrase
parseAuxClause(array $tagged_phrase, array $tree) : array
Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a auxiliary clause if possible
array | $tagged_phrase | an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) |
array | $tree | that consists of ["cur_node" => current parse position in $tagged_phrase] |
has fields "cur_node" index of how far we parsed $tagged_phrase
extractTripletsParseTree(\seekquarry\yioop\locale\en_US\resources\are $tree) : array
Takes a parse tree of a phrase and computes subject, predicate, and object arrays. Each of these array consists of two components CONCISE and RAW, CONCISE corresponding to something more similar to the words in the original phrase and RAW to the case where extraneous words have been removed
\seekquarry\yioop\locale\en_US\resources\are | $tree | a parse tree for a sentence |
triplet array
extractTripletsPhrases(array $word_and_phrase_list) : array
Scans a word list for phrases. For phrases found generate a list of question and answer pairs at two levels of granularity: CONCISE (using all terms in orginal phrase) and RAW (removing (adjectives, etc).
array | $word_and_phrase_list | of statements |
with two fields: QUESTION_LIST consisting of triplets (SUBJECT, PREDICATES, OBJECT) where one of the components has been replaced with a question marker.
extractDeepestSpeechPartPhrase(array $tree, string $pos) : string
Takes phrase tree $tree and a part-of-speech $pos returns the deepest $pos only path in tree.
array | $tree | phrase to extract type from |
string | $pos | the part of speech to extract |
the label of deepest $pos only path in $tree
extractObjectParseTree( $tree) : array
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the object of the original phrase (as a string) the latter having the importart parts of the object
$tree |
with two fields CONCISE and RAW as described above
extractPredicateParseTree( $tree) : array
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the predicate of the original phrase (as a string) the latter having the importart parts of the predicate
$tree |
with two fields CONCISE and RAW as described above
extractSubjectParseTree( $tree) : array
Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the subject of the original phrase (as a string) the latter having the importart parts of the subject
$tree |
with two fields CONCISE and RAW as described above
parseWhoQuestion(string $tagged_question, integer $index) : array
Takes tagged question string starts with Who and returns question triplet from the question string
string | $tagged_question | part-of-speech tagged question |
integer | $index | current index in statement |
parsed triplet
parseWHPlusQuestion(string $tagged_question, $index) : array
Takes tagged question string starts with Wh+ except Who and returns question triplet from the question string Unlike the WHO case, here we assume there is an auxliary verb followed by a noun phrase then the rest of the verb phrase. For example, Where is soccer played?
string | $tagged_question | part-of-speech tagged question |
$index | current index in statement |
parsed triplet suitable for query look-up
extractTripletByType(array $sub_pred_obj_triplets, string $type) : array
Takes a triplets array with subject, predicate, object fields with CONCISE, RAW subfields and produces a triplits with $type subfield (where $type is one of CONCISE and RAW) and with subject, predicate, object, and QUESTION_ANSWER_LIST subfields
array | $sub_pred_obj_triplets | in format described above |
string | $type | either CONCISE or RAW |
$triplets in format described above
cvc(integer $i) : boolean
Checks whether the letters at the indices $i-2, $i-1, $i in the buffer have the form consonant - vowel - consonant and also if the second c is not w,x or y. this is used when trying to restore an e at the end of a short word. e.g.
cav(e), lov(e), hop(e), crim(e), but snow, box, tray.
integer | $i | position to check in buffer for consonant-vowel-consonant |
whether the letters at indices have the given form
taggedPartOfSpeechTokensToString(array $tagged_tokens, boolean $with_tokens = true) : \seekquarry\yioop\locale\en_US\resources\$tagged_phrase
Takes an array of pairs (token, tag) that came from phrase and builds a new phrase where terms look like token~tag.
array | $tagged_tokens | array pairs as might come from tagTokenize |
boolean | $with_tokens | whether to include the terms and the tags in the output string or just the part of speech tags |
a phrase with terms in the format token~tag ($with_token == true) or space separated tags (!$with_token).