\seekquarry\yioop\locale\en_US\resourcesTokenizer

This class has a collection of methods for English locale specific tokenization. In particular, it has a stemmer, a stop word remover (for use mainly in word cloud creation), and a part of speech tagger (for question answering). The stemmer is my stab at implementing the Porter Stemmer algorithm presented http://tartarus.org/~martin/PorterStemmer/def.txt The code is based on the non-thread safe C version given by Martin Porter.

Since PHP is single-threaded this should be okay. Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.

Summary

Methods
Properties
Constants
__construct()
segment()
stopwordsRemover()
canonicalizePunctuatedTerms()
tagPartsOfSpeechPhrase()
tagTokenizePartOfSpeech()
isQuestion()
stem()
compressSentence()
rearrangeTripletsByType()
parseTypeList()
parseAdjective()
parseDeterminer()
parseNoun()
parseVerb()
parsePrepositionalPhrases()
parseNounPhrase()
parseVerbPhrase()
parseWholePhrase()
parseAuxClause()
extractTripletsParseTree()
extractTripletsPhrases()
extractDeepestSpeechPartPhrase()
extractObjectParseTree()
extractPredicateParseTree()
extractSubjectParseTree()
parseWhoQuestion()
parseWHPlusQuestion()
questionParser()
extractTripletByType()
$no_stem_list
$semantic_rewrites
$question_token
$adjective_type
$adverb_type
$conjunction_type
$determiner_type
$noun_type
$verb_type
$stop_words
No constants found
No protected methods found
No protected properties found
N/A
stemPhrase()
cons()
m()
vowelinstem()
doublec()
cvc()
ends()
setto()
r()
step1ab()
step1c()
step2()
step3()
step4()
step5()
taggedPartOfSpeechTokensToString()
$buffer
$k
$j
N/A

Properties

$no_stem_list

$no_stem_list : array

Words we don't want to be stemmed

Type

array

$semantic_rewrites

$semantic_rewrites : array

Phrases we would like yioop to rewrite before performing a query

Type

array

$question_token

$question_token : 

Any unique identifier corresponding to the component of a triplet which can be answered using a question answer list

Type

$adjective_type

$adjective_type : 

List of adjective-like parts of speech that might appear in lexicon file

Type

$adverb_type

$adverb_type : 

List of adverb-like parts of speech that might appear in lexicon file

Type

$conjunction_type

$conjunction_type : 

List of conjunction-like parts of speech that might appear in lexicon file

Type

$determiner_type

$determiner_type : 

List of determiner-like parts of speech that might appear in lexicon file

Type

$noun_type

$noun_type : 

List of noun-like parts of speech that might appear in lexicon file

Type

$verb_type

$verb_type : 

List of verb-like parts of speech that might appear in lexicon file

Type

$stop_words

$stop_words : 

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries

Type

$buffer

$buffer : string

storage used in computing the stem

Type

string

$k

$k : integer

Index of the current end of the word at the current state of computing its stem

Type

integer

$j

$j : integer

Index to start of the suffix of the word being considered for manipulation

Type

integer

Methods

__construct()

__construct() 

Do any global set up for tokenizer (none in the case of en-US)

segment()

segment(string  $pre_segment) : string

Stub function which could be used for a word segmenter.

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters

string $pre_segment

before segmentation

Returns

string —

should return string with words separated by space in this case does nothing

stopwordsRemover()

stopwordsRemover(mixed  $data) : mixed

Removes the stop words from the page (used for Word Cloud generation)

Parameters

mixed $data

either a string or an array of string to remove stop words from

Returns

mixed —

$data with no stop words

canonicalizePunctuatedTerms()

canonicalizePunctuatedTerms(\seekquarry\yioop\locale\en_US\resources\string&  $string) 

This methods tries to handle punctuation in terms specific to the English language such as abbreviations.

Parameters

\seekquarry\yioop\locale\en_US\resources\string& $string

a string of words, etc which might involve such terms

tagPartsOfSpeechPhrase()

tagPartsOfSpeechPhrase(string  $phrase, boolean  $with_tokens = true) : string

Takes a phrase and tags each term in it with its part of speech.

So each term in the original phrase gets mapped to term~part_of_speech This tagger is based on a Brill tagger. It makes uses a lexicon consisting of words from the Brown corpus together with a list of part of speech tags that that word had in the Brown Corpus. These are used to get an initial part of speech (in word was not present than we assume it is a noun). From this a fixed set of rules is used to modify the initial tag if necessary.

Parameters

string $phrase

text to add parts speech tags to

boolean $with_tokens

whether to include the terms and the tags in the output string or just the part of speech tags

Returns

string —

$tagged_phrase phrase where each term has ~part_of_speech appended ($with_tokens == true) or just space separated part_of_speech (!$with_tokens)

tagTokenizePartOfSpeech()

tagTokenizePartOfSpeech(string  $text) : array

Split input text into terms and output an array with one element per term, that element consisting of array with the term token and the part of speech tag.

Parameters

string $text

string to tag and tokenize

Returns

array —

of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term) for one each token in $text

isQuestion()

isQuestion(  $phrase) : boolean

Takes a phrase query entered by user and return true if it is question and false if not

Parameters

$phrase

any statement

Returns

boolean —

returns true if statement is question

stem()

stem(string  $word) : string

Computes the stem of an English word

For example, jumps, jumping, jumpy, all have jump as a stem

Parameters

string $word

the string to stem

Returns

string —

the stem of $words

compressSentence()

compressSentence(string  $sentence_to_compress) : \seekquarry\yioop\locale\en_US\resources\the

Take in a sentence and try to compress it to a smaller version that "retains the most important information and remains grammatically correct" (Jing 2000).

Parameters

string $sentence_to_compress

the sentence to compress

Returns

\seekquarry\yioop\locale\en_US\resources\the —

compressed sentence

rearrangeTripletsByType()

rearrangeTripletsByType(array  $sub_pred_obj_triplets) : array

Takes a triplets array with subject, predicate, object fields with CONCISE and RAW subfields and rearranges it to have two fields CONCISE and RAW with subject, predicate, object, and QUESTION_ANSWER_LIST subfields

Parameters

array $sub_pred_obj_triplets

in format described above

Returns

array —

$processed_triplets in format described above

parseTypeList()

parseTypeList(\seekquarry\yioop\locale\en_US\resources\array&  $cur_node, array  $tagged_phrase, string  $type) : string

Starting at the $cur_node in a $tagged_phrase parse tree for an English sentence, create a phrase string for each of the next nodes which belong to part of speech group $type.

Parameters

\seekquarry\yioop\locale\en_US\resources\array& $cur_node

node within parse tree

array $tagged_phrase

parse tree for phrase

string $type

self::$noun_type, self::$verb_type, etc

Returns

string —

phrase string involving only terms of that $type

parseAdjective()

parseAdjective(array  $tagged_phrase, array  $tree) : array

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for an adjective if possible

Parameters

array $tagged_phrase

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

array $tree

that consists of ["cur_node" => current parse position in $tagged_phrase]

Returns

array —

has fields "cur_node" index of how far we parsed $tagged_phrase "JJ" a subarray with a token node for the adjective that was parsed

parseDeterminer()

parseDeterminer(array  $tagged_phrase, array  $tree) : array

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a determiner if possible

Parameters

array $tagged_phrase

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

array $tree

that consists of ["curnode" => current parse position in $tagged_phrase]

Returns

array —

has fields "cur_node" index of how far we parsed $tagged_phrase "DT" a subarray with a token node for the determiner that was parsed

parseNoun()

parseNoun(array  $tagged_phrase, array  $tree) : array

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun if possible

Parameters

array $tagged_phrase

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

array $tree

that consists of ["curnode" => current parse position in $tagged_phrase]

Returns

array —

has fields "cur_node" index of how far we parsed $tagged_phrase "NN" a subarray with a token node for the noun string that was parsed

parseVerb()

parseVerb(array  $tagged_phrase, array  $tree) : array

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb if possible

Parameters

array $tagged_phrase

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

array $tree

that consists of ["curnode" => current parse position in $tagged_phrase]

Returns

array —

has fields "cur_node" index of how far we parsed $tagged_phrase "VB" a subarray with a token node for the verb string that was parsed

parsePrepositionalPhrases()

parsePrepositionalPhrases(array  $tagged_phrase, array  $tree, integer  $index = 1) : array

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a sequence of prepositional phrases if possible

Parameters

array $tagged_phrase

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

array $tree

that consists of ["cur_node" => current parse position in $tagged_phrase]

integer $index

which term in $tagged_phrase to start to try to parse a preposition from

Returns

array —

has fields "cur_node" index of how far we parsed $tagged_phrase parsed followed by additional possible fields (here i represents the ith clause found): "IN_i" with value a preposition subtree "DT_i" with value a determiner subtree "JJ_i" with value an adjective subtree "NN_i" with value an additional noun subtree

parseNounPhrase()

parseNounPhrase(array  $tagged_phrase, array  $tree) : array

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun phrase if possible

Parameters

array $tagged_phrase

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

array $tree

that consists of ["curnode" => current parse position in $tagged_phrase]

Returns

array —

has fields "cur_node" index of how far we parsed $tagged_phrase "NP" a subarray with possible fields "DT" with value a determiner subtree "JJ" with value an adjective subtree "NN" with value a noun tree

parseVerbPhrase()

parseVerbPhrase(array  $tagged_phrase, array  $tree) : array

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb phrase if possible

Parameters

array $tagged_phrase

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

array $tree

that consists of ["curnode" => current parse position in $tagged_phrase]

Returns

array —

has fields "cur_node" index of how far we parsed $tagged_phrase "VP" a subarray with possible fields "VB" with value a verb subtree "NP" with value an noun phrase subtree

parseWholePhrase()

parseWholePhrase(array  $tagged_phrase,   $tree) : array

Given a part-of-speeech tagged phrase array generates a parse tree for the phrase using a recursive descent parser.

Parameters

array $tagged_phrase

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree

that consists of ["curnode" => current parse position in $tagged_phrase]

Returns

array —

used to represent a tree. The array has up to three fields $tree["cur_node"] index of how far we parsed our$tagged_phrase $tree["NP"] contains a subtree for a noun phrase $tree["VP"] contains a subtree for a verb phrase

parseAuxClause()

parseAuxClause(array  $tagged_phrase, array  $tree) : array

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a auxiliary clause if possible

Parameters

array $tagged_phrase

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

array $tree

that consists of ["cur_node" => current parse position in $tagged_phrase]

Returns

array —

has fields "cur_node" index of how far we parsed $tagged_phrase

extractTripletsParseTree()

extractTripletsParseTree(\seekquarry\yioop\locale\en_US\resources\are  $tree) : array

Takes a parse tree of a phrase and computes subject, predicate, and object arrays. Each of these array consists of two components CONCISE and RAW, CONCISE corresponding to something more similar to the words in the original phrase and RAW to the case where extraneous words have been removed

Parameters

\seekquarry\yioop\locale\en_US\resources\are $tree

a parse tree for a sentence

Returns

array —

triplet array

extractTripletsPhrases()

extractTripletsPhrases(array  $word_and_phrase_list) : array

Scans a word list for phrases. For phrases found generate a list of question and answer pairs at two levels of granularity: CONCISE (using all terms in orginal phrase) and RAW (removing (adjectives, etc).

Parameters

array $word_and_phrase_list

of statements

Returns

array —

with two fields: QUESTION_LIST consisting of triplets (SUBJECT, PREDICATES, OBJECT) where one of the components has been replaced with a question marker.

extractDeepestSpeechPartPhrase()

extractDeepestSpeechPartPhrase(array  $tree, string  $pos) : string

Takes phrase tree $tree and a part-of-speech $pos returns the deepest $pos only path in tree.

Parameters

array $tree

phrase to extract type from

string $pos

the part of speech to extract

Returns

string —

the label of deepest $pos only path in $tree

extractObjectParseTree()

extractObjectParseTree(  $tree) : array

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the object of the original phrase (as a string) the latter having the importart parts of the object

Parameters

$tree

Returns

array —

with two fields CONCISE and RAW as described above

extractPredicateParseTree()

extractPredicateParseTree(  $tree) : array

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the predicate of the original phrase (as a string) the latter having the importart parts of the predicate

Parameters

$tree

Returns

array —

with two fields CONCISE and RAW as described above

extractSubjectParseTree()

extractSubjectParseTree(  $tree) : array

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the subject of the original phrase (as a string) the latter having the importart parts of the subject

Parameters

$tree

Returns

array —

with two fields CONCISE and RAW as described above

parseWhoQuestion()

parseWhoQuestion(string  $tagged_question, integer  $index) : array

Takes tagged question string starts with Who and returns question triplet from the question string

Parameters

string $tagged_question

part-of-speech tagged question

integer $index

current index in statement

Returns

array —

parsed triplet

parseWHPlusQuestion()

parseWHPlusQuestion(string  $tagged_question,   $index) : array

Takes tagged question string starts with Wh+ except Who and returns question triplet from the question string Unlike the WHO case, here we assume there is an auxliary verb followed by a noun phrase then the rest of the verb phrase. For example, Where is soccer played?

Parameters

string $tagged_question

part-of-speech tagged question

$index

current index in statement

Returns

array —

parsed triplet suitable for query look-up

questionParser()

questionParser(string  $question) : array

Takes any question started with WH question and returns the triplet from the question

Parameters

string $question

question to parse

Returns

array —

question triplet

extractTripletByType()

extractTripletByType(array  $sub_pred_obj_triplets, string  $type) : array

Takes a triplets array with subject, predicate, object fields with CONCISE, RAW subfields and produces a triplits with $type subfield (where $type is one of CONCISE and RAW) and with subject, predicate, object, and QUESTION_ANSWER_LIST subfields

Parameters

array $sub_pred_obj_triplets

in format described above

string $type

either CONCISE or RAW

Returns

array —

$triplets in format described above

stemPhrase()

stemPhrase(string  $phrase) : string

Given an English phrase produces a phrase where each of the terms has been stemmed

Parameters

string $phrase

phrase to stem

Returns

string —

in which each term has been stemmed according to the English stemmer

cons()

cons(integer  $i) : \seekquarry\yioop\locale\en_US\resources\if

Checks to see if the ith character in the buffer is a consonant

Parameters

integer $i

the character to check

Returns

\seekquarry\yioop\locale\en_US\resources\if —

the ith character is a constant

m()

m() 

m() measures the number of consonant sequences between 0 and j. if c is a consonant sequence and v a vowel sequence, and [.] indicates arbitrary presence, <pre> [c][v] gives 0 [c]vc[v] gives 1 [c]vcvc[v] gives 2 [c]vcvcvc[v] gives 3 .

...

vowelinstem()

vowelinstem() : boolean

Checks if 0,.

..$j contains a vowel

Returns

boolean —

whether it does not

doublec()

doublec(integer  $j) : boolean

Checks if $j,($j-1) contain a double consonant.

Parameters

integer $j

position to check in buffer for double consonant

Returns

boolean —

if it does or not

cvc()

cvc(integer  $i) : boolean

Checks whether the letters at the indices $i-2, $i-1, $i in the buffer have the form consonant - vowel - consonant and also if the second c is not w,x or y. this is used when trying to restore an e at the end of a short word. e.g.

  cav(e), lov(e), hop(e), crim(e), but
  snow, box, tray.

Parameters

integer $i

position to check in buffer for consonant-vowel-consonant

Returns

boolean —

whether the letters at indices have the given form

ends()

ends(string  $s) : boolean

Checks if the buffer currently ends with the string $s

Parameters

string $s

string to use for check

Returns

boolean —

whether buffer currently ends with $s

setto()

setto(string  $s) 

setto($s) sets (j+1),.

..k to the characters in the string $s, readjusting k.

Parameters

string $s

string to modify the end of buffer with

r()

r(string  $s) 

Sets the ending in the buffer to $s if the number of consonant sequences between $k and $j is positive.

Parameters

string $s

what to change the suffix to

step1ab()

step1ab() 

step1ab() gets rid of plurals and -ed or -ing. e.g.

   caresses  ->  caress
   ponies    ->  poni
   ties      ->  ti
   caress    ->  caress
   cats      ->  cat

   feed      ->  feed
   agreed    ->  agree
   disabled  ->  disable

   matting   ->  mat
   mating    ->  mate
   meeting   ->  meet
   milling   ->  mill
   messing   ->  mess

   meetings  ->  meet

step1c()

step1c() 

step1c() turns terminal y to i when there is another vowel in the stem.

step2()

step2() 

step2() maps double suffices to single ones. so -ization ( = -ize plus -ation) maps to -ize etc.Note that the string before the suffix must give m() > 0.

step3()

step3() 

step3() deals with -ic-, -full, -ness etc. similar strategy to step2.

step4()

step4() 

step4() takes off -ant, -ence etc., in context <c>vcvc<v>.

step5()

step5() 

step5() removes a final -e if m() > 1, and changes -ll to -l if m() > 1.

taggedPartOfSpeechTokensToString()

taggedPartOfSpeechTokensToString(array  $tagged_tokens, boolean  $with_tokens = true) : \seekquarry\yioop\locale\en_US\resources\$tagged_phrase

Takes an array of pairs (token, tag) that came from phrase and builds a new phrase where terms look like token~tag.

Parameters

array $tagged_tokens

array pairs as might come from tagTokenize

boolean $with_tokens

whether to include the terms and the tags in the output string or just the part of speech tags

Returns

\seekquarry\yioop\locale\en_US\resources\$tagged_phrase —

a phrase with terms in the format token~tag ($with_token == true) or space separated tags (!$with_token).