\seekquarry\yioop\libraryThesaurus

Class used to reorder the last 10 links computed by PhraseModel based on thesaurus semantic information. For English, thesaurus semantic information can be provided by WordNet, a lexical English database available at http://wordnet.princeton.edu/ To enable, you this have to define WORDNET_EXEC in your local_config file.

The idea behind thresaurus reordering is that given a query, it is tagged for parts of speech. Each term is then looked up in thesaurus for those parts of speech. Representative phrases for those term senses are extracted from the ranked thesaurus output and a set of rewrites of the original query are created. By looking up the number of times these rewrites occur in the searched index the top two phrases that represent the original query are computed.The BM25 similarity of these phrases is then scored against each of the 10 output summaries of PhraseModel and used to reorder the results. To add thesaurus reordering for a different locale, two methods need to be written in that locale tokenizer.php file tagPartsOfSpeechPhrase($phrase) which on an input phrase return a string where each term_i in the phrase has been replace with term_i~pos where pos is a two character part of speech NN, VB, AJ, AV, or NA (if none of the previous apply) scoredThesaurusMatches($term, $word_type, $whole_query) which takes a term from an original whole_query which has been tagged to be one of the types VB (for verb), NN (for noun), AJ (for adjective), AV (for adverb), or NA (for anything else), it outputs a sequence of (score => array of thesaurus terms) associations. The score representing one word sense of term Given that these methods have been implemented if the use_thesaurus field of that language tokenizer is set to true, the thesaurus will be used.

Summary

Methods
Properties
Constants
getSimilarPhrases()
scorePhrasesSummaries()
getInitialSuggestions()
numDocsIndex()
changeCaseOfStringArray()
calculateBM25()
calculateTFBM25()
calculateTermFreq()
calculateIDF()
No public properties found
No constants found
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Methods

getSimilarPhrases()

getSimilarPhrases(string  $orig_query, string  $index_name, string  $lang, integer  $threshold = 10) : array

Extracts similar phrases to the input query using thesaurus results.

Part of speech tagging is processed on input and the output is looked up in the thesaurus. USing this a ranked list of alternate query phrases is created. For those phrases, counts in the Yioop index are calculated and the top two phrases are selected.

Parameters

string $orig_query

input query from user

string $index_name

selected index for search engine

string $lang

locale tag for the query

integer $threshold

once count in posting list for any word reaches to threshold then return the number

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

array —

of top two words

scorePhrasesSummaries()

scorePhrasesSummaries(array  $similar_phrases, array  $summaries) : array

Gets array of BM25 scores for given input array of summaries and thesaurus generated queries

Parameters

array $similar_phrases

an array of thesaurus generated queries

array $summaries

an array of summaries which is generated during crawl time.

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

array —

of BM25 score for each document based on the thesaurus simimar phrases

getInitialSuggestions()

getInitialSuggestions(string  $query, string  $lang) : string

Computes suggested related phrases from thesaurus based on part of speech done on each query term.

Parameters

string $query

query entered by user

string $lang

locale tag for the query

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

string —

array $suggestion consisting of phrases suggested to be similar in meaning to some sens of the query

numDocsIndex()

numDocsIndex(string  $phrase, integer  $threshold, string  $index_name, string  $lang) : integer

Returns the number of documents in an index that a phrase occurs in.

If it occurs in more than threshold documents then cut off search.

Parameters

string $phrase

to look up in index

integer $threshold

once count in posting list for any word reaches to threshold then return the number

string $index_name

selected index for search engine

string $lang

locale tag for the query

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

integer —

number of documents phrase occurs in

changeCaseOfStringArray()

changeCaseOfStringArray(array  $summaries) : array

Lower cases an array of strings

Parameters

array $summaries

strings to put into lower case

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

array —

with strings converted to lower case

calculateBM25()

calculateBM25(array  $idf, array  $tf,   $num_terms,   $num_summaries) 

Computes the BM25 of an array of documents given that the idf and tf scores for these documents have already been computed

Parameters

array $idf

inverse doc frequency for given query array

array $tf

term frequency for given query array

$num_terms

number of terms that make up input query

$num_summaries

count for input summaries

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

calculateTFBM25()

calculateTFBM25(array  $summaries, array  $terms) : array

Calculates the BM25 normalized term frequency of a set of terms in a collection of text summaries

Parameters

array $summaries

list of summary strings to compute BM25TF w.r.t

array $terms

we want the term frequency computation for

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

array —

$tfbm25 a 2d array with rows being indexed by terms and columns indexed by summaries and the values of an entry being the tfbm25 score for that term in that document

calculateTermFreq()

calculateTermFreq(array  $summaries, array  $terms) : array

Computes a 2D array of the number of occurences of term i in document j

Parameters

array $summaries

documents to compute frequencies in

array $terms

terms to compute frequencies for

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

array —

2D array as described above

calculateIDF()

calculateIDF(array  $summaries, array  $terms) : array

To get the inverse document frequencies for a collection of terms in a set of documents.

IDF(term_i) = log_10(# of document / # docs term i in)

Parameters

array $summaries

documents to use in calculating IDF score

array $terms

terms to compute IDF score for

Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293 Warning: count(): Parameter must be an array or an object that implements Countable in phar:///Applications/MAMP/htdocs/git/phpDocumentor.phar/vendor/twig/twig/lib/Twig/Extension/Core.php on line 1293

Returns

array —

$idf 1D-array saying the inverse document frequency for each term