TOKENIZER
TOKENIZER
Constant storing the string
Library of functions used to manipulate words and phrases
extractWordStringPageSummary(array $page) : string
Converts a summary of a web page into a string of space separated words
array | $page | associative array of page summary data. Contains title, description, and links fields |
the concatenated words extracted from the page summary
extractPhrases(string $string, string $lang = null, string $index_name = null, boolean $exact_match = false, integer $threshold = \seekquarry\yioop\configs\MIN_RESULTS_TO_GROUP) : array
Extracts all phrases (sequences of adjacent words) from $string. Does not extract terms within those phrase. Array key indicates position of phrase
string | $string | subject to extract phrases from |
string | $lang | locale tag for stemming |
string | $index_name | name of index to be used as a reference when extracting phrases |
boolean | $exact_match | whether the match has to be exact or not |
integer | $threshold | roughly causes a stop to extracting more phrases if exceed $threshold (still might get more than $threshold back, only when detect have more stop) |
of phrases
extractPhrasesAndCount(string $string, string $lang = null) : array
Extracts all phrases (sequences of adjacent words) from $string. Does not extract terms within those phrase. Returns an associative array of phrase => number of occurrences of phrase
string | $string | subject to extract phrases from |
string | $lang | locale tag for stemming |
pairs of the form (phrase, number of occurrences)
extractPhrasesInLists(string $string, string $lang = null) : array
Extracts all phrases (sequences of adjacent words) from $string. Does extract terms within those phrase.
string | $string | subject to extract phrases from |
string | $lang | locale tag for stemming and other phrase processing related stuff |
word => list of positions at which the word occurred in the document
extractTermPositions(string $string, string $lang) : array
Extracts from a $string an associative array of terms and position within $string of those terms
string | $string | text to extract terms and their positions from |
string | $lang | locale of text |
associative array of terms and positions
canonicalizePunctuatedTerms(\seekquarry\yioop\library\string& $string, $lang = null)
This method tries to convert acronyms, e-mail, urls, etc into a format that does not involved punctuation that will be stripped as we extract phrases.
\seekquarry\yioop\library\string& | $string | a string of words, etc which might involve such terms |
$lang | a language tag to use as part of the canonicalization process not used right now |
hyphenateEntities(\seekquarry\yioop\library\string& $string, $lang = null)
Given a string, hyphenates words in the string which appear in a bloom filter for the given locale as phrases.
\seekquarry\yioop\library\string& | $string | a string of words, etc which might involve such terms |
$lang | a language tag to use as part of the canonicalization process |
extractTermSentencePositionsTags(string $string, string $lang = null, boolean $extract_sentences = false) : array
Splits string according to punctuation and white space then extracts (stems/char grams) of terms and makes a position. Then splits string according to senttences and make a position list for sentences
string | $string | to extract terms from |
string | $lang | IANA tag to look up stemmer under |
boolean | $extract_sentences | whether to extract sentences to be used by question answering system |
of terms and n word grams in the order they appeared in string
stemCharGramSegment(string $string, string $lang, boolean $to_string = false) : mixed
Given a string splits it into terms by running any applicable segmenters, chargrammers, or stemmers of the given locale
string | $string | what to extract terms from |
string | $lang | locale tag to determine which stemmers, chargramming and segmentation needs to be done. |
boolean | $to_string | if the result should be imploded on space to a single string or left as an array of terms |
either an array of the terms computed from the string or a string where this array has been imploded on space
charGramTerms(array $pre_terms, string $lang) : array
Given an array of pre_terms returns the characters n-grams for the given terms where n is the length Yioop uses for the language in question. If a stemmer is used for language then n-gramming is not done and this just returns an empty array this method differs from getCharGramsTerm in that it may do checking of certain words and not char gram them. For example, it won't char gram urls.
array | $pre_terms | the terms to make n-grams for |
string | $lang | locale tag to determine n to be used for n-gramming |
the n-grams for the terms in question
getCharGramsTerm(array $terms, string $lang) : array
Returns the characters n-grams for the given terms where n is the length Yioop uses for the language in question. If a stemmer is used for language then n-gramming is not done and this just returns an empty array
array | $terms | the terms to make n-grams for |
string | $lang | locale tag to determine n to be used for n-gramming |
the n-grams for the terms in question
segmentSegment(string $segment, string $lang)
Given a string to segment into words (where strings might not contain spaces), this function segments them according to the given locales segmenter
Note: this method is not used when trying to extract keywords from urls. Instead, UrlParser::getWordsInHostUrl($url) is used.
string | $segment | string to split into terms |
string | $lang | IANA tag to look up segmenter under from some other language |
stemTerms(mixed $string_or_array, string $lang) : array
Splits supplied string based on white space, then stems each terms according to the stemmer for $lang if exists
mixed | $string_or_array | to extract stemmed terms from |
string | $lang | IANA tag to look up stemmer under |
stemmed terms if stemmer; terms otherwise
stemTermsK(mixed $string_or_array, string $lang, string $keep_empties) : array
Splits supplied string based on white space, then stems each terms according to the stemmer for $lang if exists
mixed | $string_or_array | to extract stemmed terms from |
string | $lang | IANA tag to look up stemmer under |
string | $keep_empties | whether to keep empty sentences or not |
stemmed terms if stemmer; terms otherwise
calculateMetas(\seekquarry\yioop\library\array& $site) : array
Calculates the meta words to be associated with a given downloaded document. These words will be associated with the document in the index for (server:apache) even if the document itself did not contain them.
\seekquarry\yioop\library\array& | $site | associated array containing info about a downloaded (or read from archive) document. |
of meta words to be associate with this document
calculateLinkMetas(string $url, string $link_host, string $link_text, string $site_url, array $url_info = array(), array $link_word_lists = array()) : array
Used to compute all the meta ids for a given link with $url and $link_text that was on a site with $site_url.
string | $url | url of the link |
string | $link_host | url of the host name of the link |
string | $link_text | text of the anchor tag link came from |
string | $site_url | url of the page link was on |
array | $url_info | key value pairs which may have been generated as part of the page processor |
array | $link_word_lists | list of words used in anchor text associated with this link and their positionns in the anchor text |
meta words associated with the link
reverseMaximalMatch(string $segment, string $locale, array $additional_regexes = array()) : string
Used to split a string of text in the language given by $locale into space separated words. Ex: "acontinuousstringofwords" becomes "a continuous string of words". It operates by scanning from the end of the string to the front and splitting on the longest segment that is a word.
string | $segment | string to make into a string of space separated words |
string | $locale | IANA tag used to look up dictionary filter to use to do this segmenting |
array | $additional_regexes | which should be treated as a suffix |
space separated words
oneWord(string $word_guess, string $locale, array $additional_regexes) : boolean
Checks if a given word guess is a single word with respect to a word detection bloom filter and regexes
string | $word_guess | word guess to be checked if a single word |
string | $locale | language to check if is word for |
array | $additional_regexes | used in checking for this locale if something should be considered a word |
true if a single word false otherwise
computeSafeSearchScore(array $word_lists, integer $len, string $url = "") : integer
Scores documents according to the lack or nonlack of sexually explicit terms. Tries to work for several languages. Very crude classifier.
array | $word_lists | word => pos_list tuples |
integer | $len | length of text being examined in characters |
string | $url | optional url that the word_list came used to check against known porn sites |
$score of how explicit document is
compressSentence(string $sentence_to_compress, string $lang = null) : \seekquarry\yioop\library\the
Call the appropriate tokenizer sentence compression method
string | $sentence_to_compress | the sentence to compress |
string | $lang | locale tag for stemming |
compressed sentence