\seekquarry\yioop\locale\hi\resourcesTokenizer

Hindi specific tokenization code. In particular, it has a stemmer, The stemmer is my stab at porting Ljiljana Dolamic (University of Neuchatel, www.unine.ch/info/clef/) Java stemming algorithm: http://members.unine.ch/jacques.savoy/clef/HindiStemmerLight.java.txt Here given a word, its stem is that part of the word that is common to all its inflected variants. For example, tall is common to tall, taller, tallest. A stemmer takes a word and tries to produce its stem.

Summary

Methods
Properties
Constants
stopwordsRemover()
segment()
stem()
tagPartsOfSpeechPhrase()
tagTokenizePartOfSpeech()
tagUnknownWords()
taggedPartOfSpeechTokensToString()
parseTypeList()
parseNoun()
parseVerb()
parseAdjective()
parsePostpositionPhrase()
parseNounPhrase()
parseVerbPhrase()
parseWholePhrase()
extractTripletsPhrases()
extractDeepestSpeechPartPhrase()
extractSubjectParseTree()
extractPredicateParseTree()
extractObjectParseTree()
extractTripletsParseTree()
rearrangeTripletsByType()
extractTripletByType()
parseQuestion()
isQuestion()
questionParser()
$stop_words
$verb_type
$noun_type
$adjective_type
$postpositional_type
$question_pattern
$question_token
$no_stem_list
No constants found
No protected methods found
No protected properties found
N/A
removeSuffix()
No private properties found
N/A

Properties

$stop_words

$stop_words : 

A list of frequently occurring terms for this locale which should be excluded from certain kinds of queries. This is also used for language detection

Type

$verb_type

$verb_type : array

List of verb-like parts of speech that might appear in lexicon

Type

array

$noun_type

$noun_type : array

List of noun-like parts of speech that might appear in lexicon

Type

array

$adjective_type

$adjective_type : array

List of adjective-like parts of speech that might appear in lexicon

Type

array

$postpositional_type

$postpositional_type : array

List of postpositional-like parts of speech that might appear in lexicon

Type

array

$question_pattern

$question_pattern : array

List of questions in Hindi

Type

array

$question_token

$question_token : string

Any unique identifier corresponding to the component of a triplet which can be answered using a question answer list

Type

string

$no_stem_list

$no_stem_list : array

Words we don't want to be stemmed

Type

array

Methods

stopwordsRemover()

stopwordsRemover(mixed  $data) : mixed

Removes the stop words from the page (used for Word Cloud generation and language detection)

Parameters

mixed $data

either a string or an array of string to remove stop words from

Returns

mixed —

$data with no stop words

segment()

segment(string  $pre_segment) : string

Stub function which could be used for a word segmenter.

Such a segmenter on input thisisabunchofwords would output this is a bunch of words

Parameters

string $pre_segment

before segmentation

Returns

string —

should return string with words separated by space in this case does nothing

stem()

stem(string  $word) : string

Computes the stem of an Hindi word

Parameters

string $word

the string to stem

Returns

string —

the stem of $word

tagPartsOfSpeechPhrase()

tagPartsOfSpeechPhrase(string  $phrase, boolean  $with_tokens = true) : string

The method takes as input a phrase and returns a string with each term tagged with a part of speech.

Parameters

string $phrase

text to add parts speech tags to

boolean $with_tokens

whether to include the terms and the tags in the output string or just the part of speech tags

Returns

string —

$tagged_phrase which is a string of format term~pos

tagTokenizePartOfSpeech()

tagTokenizePartOfSpeech(string  $text) : string

Uses the lexicon to assign a tag to each token and then uses a rule based approach to assign the most likely of tags to each token

Parameters

string $text

input phrase which is to be tagged

Returns

string —

$result which is an array of token => tag

tagUnknownWords()

tagUnknownWords(array  $partially_tagged_text) : array

This method tags the remaining words in a partially tagged text array.

Parameters

array $partially_tagged_text

term array representing a text passage. Each element in array is in turnan associative array [token => token_value, tag => tag_value (may be empty)]

Returns

array —

text passage array where all empty tags now have values

taggedPartOfSpeechTokensToString()

taggedPartOfSpeechTokensToString(array  $tagged_tokens, boolean  $with_tokens = true) : string

This method is used to simplify the different tags of speech to a common form

Parameters

array $tagged_tokens

which is an array of tokens assigned tags.

boolean $with_tokens

whether to include the terms and the tags in the output string or just the part of speech tags

Returns

string —

$tagged_phrase which is a string fo form token~pos

parseTypeList()

parseTypeList(\seekquarry\yioop\locale\hi\resources\array&  $cur_node, array  $tagged_phrase, string  $type) : string

Starting at the $cur_node in a $tagged_phrase parse tree for a Hindi sentence, create a phrase string for each of the next nodes which belong to part of speech group $type.

Parameters

\seekquarry\yioop\locale\hi\resources\array& $cur_node

node within parse tree

array $tagged_phrase

parse tree for phrase

string $type

self::$noun_type, self::$verb_type, etc

Returns

string —

phrase string involving only terms of that $type

parseNoun()

parseNoun(array  $tagged_phrase, array  $tree) : array

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a noun if possible

Parameters

array $tagged_phrase

an array of pairs of the form ("token" => token_for_term, "tag" => part_of_speech_tag_for_term)

array $tree

that consists of ["curnode" => current parse position in $tagged_phrase]

Returns

array —

has fields "cur_node" index of how far we parsed $tagged_phrase "NN" a subarray with a token node for the noun string that was parsed

parseVerb()

parseVerb(array  $tagged_phrase, array  $tree) : array

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb if possible

Parameters

array $tagged_phrase

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

array $tree

that consists of ["curnode" => current parse position in $tagged_phrase]

Returns

array —

has fields "cur_node" index of how far we parsed $tagged_phrase "VB" a subarray with a token node for the verb string that was parsed

parseAdjective()

parseAdjective(array  $tagged_phrase, array  $tree) : array

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for an adjective if possible

Parameters

array $tagged_phrase

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

array $tree

that consists of ["cur_node" => current parse position in $tagged_phrase]

Returns

array —

has fields "cur_node" index of how far we parsed $tagged_phrase "JJ" a subarray with a token node for the adjective that was parsed

parsePostpositionPhrase()

parsePostpositionPhrase(array  $tagged_phrase, array  $tree, integer  $index = 1) : array

Takes a part-of-speech tagged phrase and parse-tree with a parse-from position and builds a parse tree for a sequence of postpositional phrases if possible

Parameters

array $tagged_phrase

an array of pairs of the form ("token" => token_for_term, "tag" => part_of_speech_tag_for_term)

array $tree

that consists of ["cur_node" => current parse position in $tagged_phrase]

integer $index

position in array to start from

Returns

array —

has fields "cur_node" index of how far we parsed $tagged_phrase

parseNounPhrase()

parseNounPhrase(array  $tagged_phrase, array  $tree) : array

Takes a part-of-speech tagged phrase and parse-tree with a parse-from position and builds a parse tree for a noun phrase if possible

Parameters

array $tagged_phrase

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

array $tree

that consists of ["curnode" => current parse position in $tagged_phrase]

Returns

array —

has fields "cur_node" index of how far we parsed $tagged_phrase "JJ" with value an Adjective subtree "NN" with value of a Noun Subtree

parseVerbPhrase()

parseVerbPhrase(array  $tagged_phrase, array  $tree) : array

Takes a part-of-speech tagged phrase and pre-tree with a parse-from position and builds a parse tree for a verb phrase if possible

Parameters

array $tagged_phrase

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

array $tree

that consists of ["curnode" => current parse position in $tagged_phrase]

Returns

array —

has fields "cur_node" index of how far we parsed $tagged_phrase "VP" a subarray with possible fields "VB" with value a verb subtree

parseWholePhrase()

parseWholePhrase(array  $tagged_phrase,   $tree = array()) : array

Given a part-of-speeech tagged phrase array generates a parse tree for the phrase using a recursive descent parser.

Parameters

array $tagged_phrase

an array of pairs of the form ("token" => token_for_term, "tag"=> part_of_speech_tag_for_term)

$tree

this parameter is ignored but kept so as to match other methods such as @see parseNounPhrase in the recursive descent parser

Returns

array —

used to represent a tree. The array has up to three fields $tree["cur_node"] index of how far we parsed our$tagged_phrase $tree["NP"] contains a subtree for a subject phrase $tree["POST"] contains a subtree for a object phrase $tree["VP"] contains a subtree for a predicate phrase

extractTripletsPhrases()

extractTripletsPhrases(array  $word_and_phrase_list) : array

Scans a word list for phrases. For phrases found generate a list of question and answer pairs at two levels of granularity: CONCISE (using all terms in orginal phrase) and RAW (removing (adjectives, etc).

Parameters

array $word_and_phrase_list

of statements

Returns

array —

with two fields: QUESTION_LIST consisting of (SUBJECT, COMPLEMENT) where one of the components has been replaced with a question marker.

extractDeepestSpeechPartPhrase()

extractDeepestSpeechPartPhrase(array  $tree, string  $pos) : string

Takes phrase tree $tree and a part-of-speech $pos returns the deepest $pos only path in tree.

Parameters

array $tree

phrase to extract type from

string $pos

the part of speech to extract

Returns

string —

the label of deepest $pos only path in $tree

extractSubjectParseTree()

extractSubjectParseTree(  $tree) : array

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the subject of the original phrase (as a string) the latter having the importart parts of the subject

Parameters

$tree

Returns

array —

with two fields CONCISE and RAW as described above

extractPredicateParseTree()

extractPredicateParseTree(  $tree) : array

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the predicate of the original phrase (as a string) the latter having the importart parts of the predicate

Parameters

$tree

Returns

array —

with two fields CONCISE and RAW as described above

extractObjectParseTree()

extractObjectParseTree(  $tree) : array

Takes a parse tree of a phrase or statement and returns an array with two fields CONCISE and RAW the former having the object of the original phrase (as a string) the latter having the importart parts of the object

Parameters

$tree

Returns

array —

with two fields CONCISE and RAW as described above

extractTripletsParseTree()

extractTripletsParseTree(array  $parse_tree) : array

Takes a parse tree of a phrase and computes subject, predicate, and object arrays. Each of these array consists of two components CONCISE and RAW, CONCISE corresponding to something more similar to the words in the original phrase and RAW to the case where extraneous words have been removed

Parameters

array $parse_tree

a parse tree for a sentence

Returns

array —

triplet array

rearrangeTripletsByType()

rearrangeTripletsByType(array  $sub_pred_obj_triplets) : array

Takes a triplets array with subject, predicate, object fields with CONCISE and RAW subfields and rearranges it to have two fields CONCISE and RAW with subject, predicate, object, and QUESTION_ANSWER_LIST subfields

Parameters

array $sub_pred_obj_triplets

in format described above

Returns

array —

$processed_triplets in format described above

extractTripletByType()

extractTripletByType(array  $sub_pred_obj_triplets, string  $type) : array

Takes a triplets array with subject, predicate, object fields with CONCISE, RAW subfields and produces triplets with $type subfield where $type is one of CONCISE and RAW and with subject, predicate, object and QUESTION_ANSWER_LIST subfields

Parameters

array $sub_pred_obj_triplets

in format described above

string $type

either CONCISE or RAW

Returns

array —

$triplets in format described above

parseQuestion()

parseQuestion(string  $tagged_question, integer  $index) : array

Takes tagged question string starts with Who and returns question triplet from the question string

Parameters

string $tagged_question

part-of-speech tagged question

integer $index

current index in statement

Returns

array —

parsed triplet

isQuestion()

isQuestion(  $phrase) : boolean

Takes a phrase query entered by user and return true if it is question and false if not

Parameters

$phrase

any statement

Returns

boolean —

returns true if statement is question

questionParser()

questionParser(string  $question) : array

Takes questions and returns the triplet from the question

Parameters

string $question

question to parse

Returns

array —

question triplet

removeSuffix()

removeSuffix(string  $word) : string

Removes common Hindi suffixes

Parameters

string $word

to remove suffixes from

Returns

string —

result of suffix removal