Constants

TOKENIZER

TOKENIZER

Constant storing the string

CONTROL_WORD_INDICATOR

CONTROL_WORD_INDICATOR

Indicates the control word for programming languages

REGEX_INITIAL_POSITION

REGEX_INITIAL_POSITION

Indicates the control word for programming languages

Properties

$meta_words_list

$meta_words_list : array

A list of meta words that might be extracted from a query

Type

array

$programming_language_map

$programming_language_map : array

A list of meta words that might be extracted from a query

Type

array

$tokenizers

$tokenizers : 

Tokenizer objects that have been loaded so far

@var array

Type

Methods

extractWordStringPageSummary()

extractWordStringPageSummary(array  $page) : string

Converts a summary of a web page into a string of space separated words

Parameters

array $page

associative array of page summary data. Contains title, description, and links fields

Returns

string —

the concatenated words extracted from the page summary

extractPhrases()

extractPhrases(string  $string, string  $lang = null, string  $index_name = null, boolean  $exact_match = false, integer  $threshold = \seekquarry\yioop\configs\MIN_RESULTS_TO_GROUP) : array

Extracts all phrases (sequences of adjacent words) from $string. Does not extract terms within those phrase. Array key indicates position of phrase

Parameters

string $string

subject to extract phrases from

string $lang

locale tag for stemming

string $index_name

name of index to be used as a reference when extracting phrases

boolean $exact_match

whether the match has to be exact or not

integer $threshold

roughly causes a stop to extracting more phrases if exceed $threshold (still might get more than $threshold back, only when detect have more stop)

Returns

array —

of phrases

extractPhrasesAndCount()

extractPhrasesAndCount(string  $string, string  $lang = null) : array

Extracts all phrases (sequences of adjacent words) from $string. Does not extract terms within those phrase. Returns an associative array of phrase => number of occurrences of phrase

Parameters

string $string

subject to extract phrases from

string $lang

locale tag for stemming

Returns

array —

pairs of the form (phrase, number of occurrences)

extractPhrasesInLists()

extractPhrasesInLists(string  $string, string  $lang = null) : array

Extracts all phrases (sequences of adjacent words) from $string. Does extract terms within those phrase.

Parameters

string $string

subject to extract phrases from

string $lang

locale tag for stemming and other phrase processing related stuff

Returns

array —

word => list of positions at which the word occurred in the document

extractTermPositions()

extractTermPositions(string  $string, string  $lang) : array

Extracts from a $string an associative array of terms and position within $string of those terms

Parameters

string $string

text to extract terms and their positions from

string $lang

locale of text

Returns

array —

associative array of terms and positions

canonicalizePunctuatedTerms()

canonicalizePunctuatedTerms(\seekquarry\yioop\library\string&  $string,   $lang = null) 

This method tries to convert acronyms, e-mail, urls, etc into a format that does not involved punctuation that will be stripped as we extract phrases.

Parameters

\seekquarry\yioop\library\string& $string

a string of words, etc which might involve such terms

$lang

a language tag to use as part of the canonicalization process not used right now

hyphenateEntities()

hyphenateEntities(\seekquarry\yioop\library\string&  $string,   $lang = null) 

Given a string, hyphenates words in the string which appear in a bloom filter for the given locale as phrases.

Parameters

\seekquarry\yioop\library\string& $string

a string of words, etc which might involve such terms

$lang

a language tag to use as part of the canonicalization process

extractTermSentencePositionsTags()

extractTermSentencePositionsTags(string  $string, string  $lang = null, boolean  $extract_sentences = false) : array

Splits string according to punctuation and white space then extracts (stems/char grams) of terms and makes a position. Then splits string according to senttences and make a position list for sentences

Parameters

string $string

to extract terms from

string $lang

IANA tag to look up stemmer under

boolean $extract_sentences

whether to extract sentences to be used by question answering system

Returns

array —

of terms and n word grams in the order they appeared in string

stemCharGramSegment()

stemCharGramSegment(string  $string, string  $lang, boolean  $to_string = false) : mixed

Given a string splits it into terms by running any applicable segmenters, chargrammers, or stemmers of the given locale

Parameters

string $string

what to extract terms from

string $lang

locale tag to determine which stemmers, chargramming and segmentation needs to be done.

boolean $to_string

if the result should be imploded on space to a single string or left as an array of terms

Returns

mixed —

either an array of the terms computed from the string or a string where this array has been imploded on space

javaTokenizer()

javaTokenizer(string  $string, string  $lang) : array

Given a string tokenizes into Java tokens

Parameters

string $string

what to extract terms from

string $lang

indicates programming language

Returns

array —

the terms computed from the string

pythonTokenizer()

pythonTokenizer(string  $string, string  $lang) : array

Given a string tokenizes into Python tokens

Parameters

string $string

what to extract terms from

string $lang

indicates programming language

Returns

array —

the terms computed from the string

charGramTerms()

charGramTerms(array  $pre_terms, string  $lang) : array

Given an array of pre_terms returns the characters n-grams for the given terms where n is the length Yioop uses for the language in question. If a stemmer is used for language then n-gramming is not done and this just returns an empty array this method differs from getCharGramsTerm in that it may do checking of certain words and not char gram them. For example, it won't char gram urls.

Parameters

array $pre_terms

the terms to make n-grams for

string $lang

locale tag to determine n to be used for n-gramming

Returns

array —

the n-grams for the terms in question

getCharGramsTerm()

getCharGramsTerm(array  $terms, string  $lang) : array

Returns the characters n-grams for the given terms where n is the length Yioop uses for the language in question. If a stemmer is used for language then n-gramming is not done and this just returns an empty array

Parameters

array $terms

the terms to make n-grams for

string $lang

locale tag to determine n to be used for n-gramming

Returns

array —

the n-grams for the terms in question

getNGramsTerm()

getNGramsTerm(array  $terms, string  $n) : array

Returns the characters n-grams for the given terms where n is the length.

Parameters

array $terms

the terms to make n-grams for

string $n

the n to use in n-gramming

Returns

array —

the n-grams for the terms in question

segmentSegment()

segmentSegment(string  $segment, string  $lang) 

Given a string to segment into words (where strings might not contain spaces), this function segments them according to the given locales segmenter

Note: this method is not used when trying to extract keywords from urls. Instead, UrlParser::getWordsInHostUrl($url) is used.

Parameters

string $segment

string to split into terms

string $lang

IANA tag to look up segmenter under from some other language

stemTerms()

stemTerms(mixed  $string_or_array, string  $lang) : array

Splits supplied string based on white space, then stems each terms according to the stemmer for $lang if exists

Parameters

mixed $string_or_array

to extract stemmed terms from

string $lang

IANA tag to look up stemmer under

Returns

array —

stemmed terms if stemmer; terms otherwise

stemTermsK()

stemTermsK(mixed  $string_or_array, string  $lang, string  $keep_empties) : array

Splits supplied string based on white space, then stems each terms according to the stemmer for $lang if exists

Parameters

mixed $string_or_array

to extract stemmed terms from

string $lang

IANA tag to look up stemmer under

string $keep_empties

whether to keep empty sentences or not

Returns

array —

stemmed terms if stemmer; terms otherwise

getTokenizer()

getTokenizer(string  $lang) : object

Loads and instantiates a tokenizer object for a language if exists

Parameters

string $lang

IANA tag to look up stemmer under

Returns

object —

tokenizer with methods to process strings for a language

calculateMetas()

calculateMetas(\seekquarry\yioop\library\array&  $site) : array

Calculates the meta words to be associated with a given downloaded document. These words will be associated with the document in the index for (server:apache) even if the document itself did not contain them.

Parameters

\seekquarry\yioop\library\array& $site

associated array containing info about a downloaded (or read from archive) document.

Returns

array —

of meta words to be associate with this document

calculateLinkMetas()

calculateLinkMetas(string  $url, string  $link_host, string  $link_text, string  $site_url, array  $url_info = array(), array  $link_word_lists = array()) : array

Used to compute all the meta ids for a given link with $url and $link_text that was on a site with $site_url.

Parameters

string $url

url of the link

string $link_host

url of the host name of the link

string $link_text

text of the anchor tag link came from

string $site_url

url of the page link was on

array $url_info

key value pairs which may have been generated as part of the page processor

array $link_word_lists

list of words used in anchor text associated with this link and their positionns in the anchor text

Returns

array —

meta words associated with the link

reverseMaximalMatch()

reverseMaximalMatch(string  $segment, string  $locale, array  $additional_regexes = array()) : string

Used to split a string of text in the language given by $locale into space separated words. Ex: "acontinuousstringofwords" becomes "a continuous string of words". It operates by scanning from the end of the string to the front and splitting on the longest segment that is a word.

Parameters

string $segment

string to make into a string of space separated words

string $locale

IANA tag used to look up dictionary filter to use to do this segmenting

array $additional_regexes

which should be treated as a suffix

Returns

string —

space separated words

oneWord()

oneWord(string  $word_guess, string  $locale, array  $additional_regexes) : boolean

Checks if a given word guess is a single word with respect to a word detection bloom filter and regexes

Parameters

string $word_guess

word guess to be checked if a single word

string $locale

language to check if is word for

array $additional_regexes

used in checking for this locale if something should be considered a word

Returns

boolean —

true if a single word false otherwise

computeSafeSearchScore()

computeSafeSearchScore(array  $word_lists, integer  $len, string  $url = "") : integer

Scores documents according to the lack or nonlack of sexually explicit terms. Tries to work for several languages. Very crude classifier.

Parameters

array $word_lists

word => pos_list tuples

integer $len

length of text being examined in characters

string $url

optional url that the word_list came used to check against known porn sites

Returns

integer —

$score of how explicit document is

compressSentence()

compressSentence(string  $sentence_to_compress, string  $lang = null) : \seekquarry\yioop\library\the

Call the appropriate tokenizer sentence compression method

Parameters

string $sentence_to_compress

the sentence to compress

string $lang

locale tag for stemming

Returns

\seekquarry\yioop\library\the —

compressed sentence