MIN_DESCRIPTION_LENGTH
MIN_DESCRIPTION_LENGTH
the minimum length of a description before we stop appending additional link doc summaries
This is class is used to handle results for a given phrase search
__construct(string $db_name = \seekquarry\yioop\configs\DB_NAME,boolean $connect = true)
Sets up the database manager that will be used and name of the search engine database
string | $db_name | the name of the database for the search engine |
boolean | $connect | whether to connect to the database by default after making the datasource class |
getCrawlItem(string $url,array $machine_urls = null,string $index_name = ""): array
Get a summary of a document by the generation it is in and its offset into the corresponding WebArchive.
string | $url | of summary we are trying to look-up |
array | $machine_urls | an array of urls of yioop queue servers |
string | $index_name | timestamp of the index to do the lookup in |
summary data of the matching document
getCrawlItems(string $lookups,array $machine_urls = null,array $exclude_fields = array(),array $format_words = null,integer $description_length = self::DEFAULT_DESCRIPTION_LENGTH): array
Gets summaries for a set of document by their url, or by group of 5-tuples of the form (machine, key, index, generation, offset).
string | $lookups | things whose summaries we are trying to look up |
array | $machine_urls | an array of urls of yioop queue servers |
array | $exclude_fields | an array of fields which might be int the crawlItem but which should be excluded from the result. This will make the result smaller and so hopefully faster to transmit |
array | $format_words | words which should be highlighted in search snippets returned |
integer | $description_length | length of snippets to be returned for each search result |
of summary data for the matching documents
networkGetCrawlItems(string $lookups,array $machine_urls,array $exclude_fields = array(),array $format_words = null,integer $description_length = self::DEFAULT_DESCRIPTION_LENGTH): array
In a multiple queue server setting, gets summaries for a set of document by their url, or by group of 5-tuples of the form (machine, key, index, generation, offset). This makes an execMachines call to make a network request to the CrawlController's on each machine which in turn calls getCrawlItems (and thence nonNetworkGetCrawlItems) on each machine. The results are then sent back to networkGetCrawlItems and aggregated.
string | $lookups | things whose summaries we are trying to look up |
array | $machine_urls | an array of urls of yioop queue servers |
array | $exclude_fields | an array of fields which might be int the crawlItem but which should be excluded from the result. This will make the result smaller and so hopefully faster to transmit |
array | $format_words | words which should be highlighted in search snippets returned |
integer | $description_length | length of snippets to be returned for each search result |
of summary data for the matching documents
nonNetworkGetCrawlItems(string $lookups,array $exclude_fields = array(),array $format_words = null,integer $description_length = self::DEFAULT_DESCRIPTION_LENGTH): array
Gets summaries on a particular machine for a set of document by their url, or by group of 5-tuples of the form (machine, key, index, generation, offset) This may be used in either the single queue_server setting or it may be called indirectly by a particular machine's CrawlController as part of fufilling a network-based getCrawlItems request. $lookups contains items which are to be grouped (as came from same url or site with the same cache). So this function aggregates their descriptions.
string | $lookups | things whose summaries we are trying to look up |
array | $exclude_fields | an array of fields which might be int the crawlItem but which should be excluded from the result. This will make the result smaller and so hopefully faster to transmit |
array | $format_words | words which should be highlighted in search snippets returned |
integer | $description_length | length of snippets to be returned for each search result |
of summary data for the matching documents
lookupSummaryOffsetGeneration(string $url_or_key,string $index_name = "",boolean $is_key = false): array
Determines the offset into the summaries WebArchiveBundle and generation of the provided url (or hash_url) so that the info:url (info:base64_hash_url) summary can be retrieved. This assumes of course that the info:url meta word has been stored.
string | $url_or_key | either info:base64_hash_url or just a url to lookup |
string | $index_name | index into which to do the lookup |
boolean | $is_key | whether the string is info:base64_hash_url or just a url |
(offset, generation) into the web archive bundle
clearQuerySavePoint(integer $save_timestamp,array $machine_urls = null)
A save point is used to store to disk a sequence generation-doc-offset pairs of a particular mix query when doing an archive crawl of a crawl mix. This is used so that the mix can remember where it was the next time it is invoked by the web app on the machine in question.
This function deletes such a save point associated with a timestamp
integer | $save_timestamp | timestamp of save point to delete |
array | $machine_urls | machines on which to try to delete savepoint |
execMachines(string $command,array $machine_urls,string $arg = null,integer $num_machines,boolean $send_specs = false): array
This method is invoked by other ParallelModel (@see CrawlModel for examples) methods when they want to have their method performed on an array of other Yioop instances. The results returned can then be aggregated. The invocation sequence is crawlModelMethodA invokes execMachine with a list of urls of other Yioop instances. execMachine makes REST requests of those instances of the given command and optional arguments This request would be handled by a CrawlController which in turn calls crawlModelMethodA on the given Yioop instance, serializes the result and gives it back to execMachine and then back to the originally calling function.
string | $command | the ParallelModel method to invoke on the remote Yioop instances |
array | $machine_urls | machines to invoke this command on |
string | $arg | additional arguments to be passed to the remote machine |
integer | $num_machines | the integer to be used in calculating partition |
boolean | $send_specs | whether to send the queue_server, num fetcher info for given machine |
a list of outputs from each machine that was called.
fileGetContents(string $filename,boolean $force_read = false): string
Either a wrapper for file_get_contents, or if a WebSite object is being used to serve pages, it reads it in using blocking I/O file_get_contents() and caches it before return its string contents.
Note this function assumes that only the web server is performing I/O with this file. filemtime() can be used to see if a file on disk has been changed and then you can use $force_read = true below to force re- reading the file into the cache
string | $filename | name of file to get contents of |
boolean | $force_read | whether to force the file to be read from presistent storage rather than the cache |
contents of the file given by $filename
filePutContents(string $filename,string $data)
Either a wrapper for file_put_contents, or if a WebSite object is being used to serve pages, writes $data to the persistent file with name $filename. Saves a copy in the RAM cache if there is a copy already there.
string | $filename | name of file to write to persistent storages |
string | $data | string of data to store in file |
formatSinglePageResult(array $page,array $words = null,integer $description_length = self::DEFAULT_DESCRIPTION_LENGTH): array
Given a page summary, extracts snippets which are related to a set of search words. For each snippet, bold faces the search terms, and then creates a new summary array.
array | $page | a single search result summary |
array | $words | keywords (typically what was searched on) |
integer | $description_length | length of the description |
$page which has been snippified and bold faced
getSnippets(string $text,array $words,string $description_length): string
Given a string, extracts a snippets of text related to a given set of key words. For a given word a snippet is a window of characters to its left and right that is less than a maximum total number of characters.
There is also a rule that a snippet should avoid ending in the middle of a word
string | $text | haystack to extract snippet from |
array | $words | keywords used to make look in haystack |
string | $description_length | length of the description desired |
a concatenation of the extracted snippets of each word
boldKeywords(string $text,array $words): string
Given a string, wraps in bold html tags a set of key words it contains.
string | $text | haystack string to look for the key words |
array | $words | an array of words to bold face |
the resulting string after boldfacing has been applied
isSingleLocalhost(array $machine_urls,string $index_timestamp = -1): boolean
Used to determine if an action involves just one yioop instance on the current local machine or not
array | $machine_urls | urls of yioop instances to which the action applies |
string | $index_timestamp | if timestamp exists checks if the index has declared itself to be a no network index. |
whether it involves a single local yioop instance (true) or not (false)
searchArrayToWhereOrderClauses(array $search_array,array $any_fields = array('status')): array
Creates the WHERE and ORDER BY clauses for a query of a Yioop table such as USERS, ROLE, GROUP, which have associated search web forms. Searches are case insensitive
array | $search_array | each element of this is a quadruple name of a field, what comparison to perform, a value to check, and an order (ascending/descending) to sort by |
array | $any_fields | these fields if present in search array but with value "-1" will be skipped as part of the where clause but will be used for order by clause |
string for where clause, string for order by clause
getRows(integer $limit,integer $num,\seekquarry\yioop\models\int& $total,array $search_array = array(),array $args = null): array
Gets a range of rows which match the provided search criteria from $th provided table
integer | $limit | starting row from the potential results to return |
integer | $num | number of rows after start row to return |
\seekquarry\yioop\models\int& | $total | gets set with the total number of rows that can be returned by the given database query |
array | $search_array | each element of this is a quadruple name of a field, what comparison to perform, a value to check, and an order (ascending/descending) to sort by |
array | $args | additional values which may be used to get rows (what these are will typically depend on the subclass implementation) |
selectCallback(mixed $args = null): string
Controls which columns and the names of those columns from the tables underlying the given model should be return from a getRows call.
This defaults to *, but in general will be overriden in subclasses of Model
mixed | $args | any additional arguments which should be used to determine the columns |
a comma separated list of columns suitable for a SQL query
fromCallback(mixed $args = null): string
Controls which tables and the names of tables underlie the given model and should be used in a getRows call This defaults to the single table whose name is whatever is before Model in the name of the model. For example, by default on FooModel this method would return "FOO". If a different behavior, this can be overriden in subclasses of Model
mixed | $args | any additional arguments which should be used to determine these tables |
a comma separated list of tables suitable for a SQL query
whereCallback(mixed $args = null): string
Controls the WHERE clause of the SQL query that underlies the given model and should be used in a getRows call.
This defaults to an empty WHERE clause.
mixed | $args | additional arguments that might be used to construct the WHERE clause. |
a SQL WHERE clause
rowCallback(array $row,mixed $args): array
Called after as row is retrieved by getRows from the database to perform some manipulation that would be useful for this model.
For example, in CrawlModel, after a row representing a crawl mix has been gotten, this is used to perform an additional query to marshal its components. By default this method just returns this row unchanged.
array | $row | row as retrieved from database query |
mixed | $args | additional arguments that might be used by this callback |
$row after callback manipulation
postQueryCallback(array $rows): array
Called after getRows has retrieved all the rows that it would retrieve but before they are returned to give one last place where they could be further manipulated. For example, in MachineModel this callback is used to make parallel network calls to get the status of each machine returned by getRows. The default for this method is to leave the rows that would be returned unchanged
array | $rows | that have been calculated so far by getRows |
$rows after this final manipulation
rewriteMixQuery(string $query,object $mix): string
Rewrites a mix query so that it maps directly to a query about crawls
string | $query | the original before a rewrite |
object | $mix | a mix object saying how the mix is built out of crawls |
a rewritten query in terms of crawls
getPhrasePageResults(string $input_phrase,integer $low,integer $results_per_page = \seekquarry\yioop\configs\NUM_RESULTS_PER_PAGE,boolean $format = true,\seekquarry\yioop\models\SearchfiltersModel $filter = null,boolean $use_cache_if_allowed = true,integer $raw,array $queue_servers = array(),boolean $guess_semantics = true,integer $save_timestamp): array
Given a query phrase, returns formatted document summaries of the documents that match the phrase.
string | $input_phrase | the phrase to try to match |
integer | $low | return results beginning with the $low document |
integer | $results_per_page | how many results to return |
boolean | $format | whether to highlight in the returned summaries the matched text |
\seekquarry\yioop\models\SearchfiltersModel | $filter | Model responsible for keeping track of edited and deleted search results |
boolean | $use_cache_if_allowed | if true and USE_CACHE is true then an attempt will be made to look up the results in the file cache. Otherwise, items will be recomputed and then potentially restored in cache |
integer | $raw | ($raw == 0) normal grouping, ($raw == 1) no grouping done on data also no summaries returned (only lookup info), $raw > 1 return summaries but no grouping |
array | $queue_servers | a list of urls of yioop machines which might be used during lookup |
boolean | $guess_semantics | whether to do query rewriting before lookup |
integer | $save_timestamp | if this timestamp is nonzero, then save iterate position, so can resume on future queries that make use of the timestamp |
an array of summary data
parseWordStructConjunctiveQuery(\seekquarry\yioop\models\string& $phrase): array
Parses from a string phrase representing a conjunctive query, a struct consisting of the words keys searched for, the allowed and disallowed phrases, the weight that should be put on these query results, and which archive to use.
\seekquarry\yioop\models\string& | $phrase | string to extract struct from, if the phrase semantics is guessed or an if condition is processed the value of phrase will be altered. (Helps for feeding to network queries) |
struct representing the conjunctive query
extractMetaWordInfo(string $phrase): array
Given a query string, this method extracts meta words, which of these are "materialized" (i.e., should be encoded as part of word ids), disallowed phrases, the query string after meta words removed and ampersand substitution applied, the query string with meta words but apersand substitution applied, the index and the weights found as part of the query string. Finally, it extracts the locale_tag for the query
string | $phrase | the query string |
containing items listed above in the description of this method
guessSemantics(string $phrase): string
Ideally, this function tries to guess from the query what the user is looking for. For now, we are just doing simple things like when a query term is a url and rewriting it to the appropriate meta meta word.
string | $phrase | input query to guess semantics of |
a phrase that more closely matches the intentions of the query.
beginMatch(string $phrase,string $start_with,string $new_prefix,string $suffix = "",string $not_contains = array(),string $lang_tag = "en-US"): string
Matches terms (non white-char strings) in the language $lang_tag in $phrase that begin with $start_with and don't contain $not_contain, replaces $start_with with $new_prefix and adds $suffix to the end
string | $phrase | string to look for terms in |
string | $start_with | what we're looking to see if term begins with |
string | $new_prefix | what to change $start_with to |
string | $suffix | what to tack on to the end of the term if there is a match |
string | $not_contains | string match is not allowed to contain |
string | $lang_tag | what language the phrase must be in for the rule to apply |
$phrase after modifications have been made
endMatch(string $phrase,string $end_with,string $prefix,string $new_suffix = "",string $not_contains = array(),string $lang_tag = "en-US"): string
Matches terms (non white-char strings) in the language $lang_tag in $phrase that end with $end_with and don't contain $not_contain, replaces $end_with with $new_suffix (if not empty) and adds $prefix to the beginning
string | $phrase | string to look for terms in |
string | $end_with | what we're looking to see if term ends with |
string | $prefix | what to tack on to the start if there is a match |
string | $new_suffix | what to change $end_with to |
string | $not_contains | string match is not allowed to contain |
string | $lang_tag | what language the phrase must be in for the rule to apply |
$phrase after modifications have been made
getSummariesByHash(array $word_structs,integer $limit,integer $num,\seekquarry\yioop\models\SearchfiltersModel $filter,boolean $use_cache_if_allowed = true,integer $raw,array $queue_servers = array(),string $original_query = "",string $save_timestamp_name = "",array $format_words = null): array
Gets doc summaries of documents containing given words and meeting the additional provided criteria
array | $word_structs | an array of word_structs. Here a word_struct is an associative array with at least the following fields KEYS -- an array of word keys QUOTE_POSITIONS -- an array of positions of words that appeared in quotes (so need to be matched exactly) DISALLOW_PHRASES -- an array of words the document must not contain WEIGHT -- a weight to multiple scores returned from this iterator by INDEX_NAME -- an index timestamp to get results from |
integer | $limit | number of first document in order to return |
integer | $num | number of documents to return summaries of |
\seekquarry\yioop\models\SearchfiltersModel | $filter | Model responsible for keeping track of edited and deleted search results |
boolean | $use_cache_if_allowed | if true and USE_CACHE is true then an attempt will be made to look up the results in the file cache. Otherwise, items will be recomputed and then potentially restored in cache |
integer | $raw | ($raw == 0) normal grouping, ($raw > 0) no grouping done on data. if ($raw == 1) no lookups of summaries done |
array | $queue_servers | a list of urls of yioop machines which might be used during lookup |
string | $original_query | if set, the original query that corresponds to $word_structs |
string | $save_timestamp_name | if this timestamp is not empty, then save iterate position, so can resume on future queries that make use of the timestamp. If used then $limit ignored and get next $num docs after $save_timestamp 's previous iterate position. |
array | $format_words | words which should be highlighted in search snippets returned |
document summaries
getSummariesFromOffsets(\seekquarry\yioop\models\array& $pages,\seekquarry\yioop\models\array& $queue_servers,integer $raw,boolean $groups_with_docs,boolean $with_question_answer_info,array $format_words = null,integer $description_length = self::DEFAULT_DESCRIPTION_LENGTH): array
Used to lookup summary info for the pages provided (using their) self::SUMMARY_OFFSET field. If any of the lookup-ed summaries are HTTP Location redirect page's then looks these up in turn.
This method handles robot meta tags which might forbid indexing.
\seekquarry\yioop\models\array& | $pages | of page data without text summaries |
\seekquarry\yioop\models\array& | $queue_servers | array of queue server to find data on |
integer | $raw | only lookup locations if 0 |
boolean | $groups_with_docs | whether to return only groups that contain at least one doc as opposed to a groups with only links |
boolean | $with_question_answer_info | whether question answer info in summaries needs to be returned |
array | $format_words | words which should be highlighted in search snippets returned |
integer | $description_length | length of snippets to be returned for each search result |
pages with summaries added
getQueryIterator(array $word_structs,\seekquarry\yioop\models\SearchfiltersModel $filter,integer $raw,integer $to_retrieve,array $queue_servers = array(),string $original_query = "",string $save_timestamp_name = ""): \seekquarry\yioop\models\&object
Using the supplied $word_structs, contructs an iterator for getting results to a query
array | $word_structs | an array of word_structs. Here a word_struct is an associative array with at least the following fields KEYS -- an array of word keys QUOTE_POSITIONS -- an array of positions of words that appreared in quotes (so need to be matched exactly) DISALLOW_PHRASES -- an array of words the document must not contain WEIGHT -- a weight to multiple scores returned from this iterator by INDEX_NAME -- an index timestamp to get results from |
\seekquarry\yioop\models\SearchfiltersModel | $filter | Model responsible for keeping track of edited and deleted search results |
integer | $raw | ($raw == 0) normal grouping, ($raw == 1) no grouping done on data also no summaries returned (only lookup info), $raw > 1 return summaries but no grouping |
integer | $to_retrieve | number of items to retrieve from location in in interator |
array | $queue_servers | a list of urls of yioop machines which might be used during lookup |
string | $original_query | if set, the orginal query that corresponds to $word_structs |
string | $save_timestamp_name | if this timestamp is non empty, then when making iterator get sub-iterators to advance to gen doc_offset stored with respect to save_timestamp if exists. |
an iterator for iterating through results to the query