Constants

MIN_DESCRIPTION_LENGTH

MIN_DESCRIPTION_LENGTH

the minimum length of a description before we stop appending additional link doc summaries

SNIPPET_TITLE_LENGTH

SNIPPET_TITLE_LENGTH

MAX_SNIPPET_TITLE_LENGTH

MAX_SNIPPET_TITLE_LENGTH

SNIPPET_LENGTH_LEFT

SNIPPET_LENGTH_LEFT

SNIPPET_LENGTH_RIGHT

SNIPPET_LENGTH_RIGHT

MIN_SNIPPET_LENGTH

MIN_SNIPPET_LENGTH

DEFAULT_DESCRIPTION_LENGTH

DEFAULT_DESCRIPTION_LENGTH

Default maximum character length of a search summary

INFO_HASH_LEN

INFO_HASH_LEN

Length of info hash record phrse

Properties

$index_name

$index_name :string

Stores the name of the current index archive to use to get search results from

Type

string

$current_machine

$current_machine :integer

If known the id of the queue_server this belongs to

Type

integer

$db

$db :object

Reference to a DatasourceManager

Type

object

$db_name

$db_name :string

Name of the search engine database

Type

string

$private_db

$private_db :object

Reference to a private DatasourceManager

Type

object

$private_db_name

$private_db_name :string

Name of the private search engine database

Type

string

$edited_page_summaries

$edited_page_summaries :array

Associative array of page summaries which might be used to override default page summaries if set.

Type

array

$any_fields

$any_fields :array

These fields if present in $search_array (used by @see getRows() ), but with value "-1", will be skipped as part of the where clause but will be used for order by clause

Type

array

$search_table_column_map

$search_table_column_map :array

Associations of the form name of field for web forms => database column names/abbreviations

Type

array

$web_site

$web_site :object

Reference to a WebSite object in use to serve pages (if any)

Type

object

$cache

$cache :object

Cache object to be used if we are doing caching

Type

object

$additional_meta_words

$additional_meta_words :array

an associative array of additional meta words and the max description length of results if such a meta word is used this array is typically set in index.php

Type

array

$query_info

$query_info :array

Used to hold query statistics about the current query

Type

array

$programming_language_map

$programming_language_map :string

Used to hold extension of programming language which is used the language

Type

string

$program_indicator

$program_indicator :string

A indicator to indicate source code files

Type

string

Methods

__construct()

__construct(string  $db_name = \seekquarry\yioop\configs\DB_NAME,boolean  $connect = true)

Sets up the database manager that will be used and name of the search engine database

Parameters

string $db_name

the name of the database for the search engine

boolean $connect

whether to connect to the database by default after making the datasource class

getCrawlItem()

getCrawlItem(string  $url,array  $machine_urls = null,string  $index_name = ""): array

Get a summary of a document by the generation it is in and its offset into the corresponding WebArchive.

Parameters

string $url

of summary we are trying to look-up

array $machine_urls

an array of urls of yioop queue servers

string $index_name

timestamp of the index to do the lookup in

Returns

array —

summary data of the matching document

getCrawlItems()

getCrawlItems(string  $lookups,array  $machine_urls = null,array  $exclude_fields = array(),array  $format_words = null,integer  $description_length = self::DEFAULT_DESCRIPTION_LENGTH): array

Gets summaries for a set of document by their url, or by group of 5-tuples of the form (machine, key, index, generation, offset).

Parameters

string $lookups

things whose summaries we are trying to look up

array $machine_urls

an array of urls of yioop queue servers

array $exclude_fields

an array of fields which might be int the crawlItem but which should be excluded from the result. This will make the result smaller and so hopefully faster to transmit

array $format_words

words which should be highlighted in search snippets returned

integer $description_length

length of snippets to be returned for each search result

Returns

array —

of summary data for the matching documents

networkGetCrawlItems()

networkGetCrawlItems(string  $lookups,array  $machine_urls,array  $exclude_fields = array(),array  $format_words = null,integer  $description_length = self::DEFAULT_DESCRIPTION_LENGTH): array

In a multiple queue server setting, gets summaries for a set of document by their url, or by group of 5-tuples of the form (machine, key, index, generation, offset). This makes an execMachines call to make a network request to the CrawlController's on each machine which in turn calls getCrawlItems (and thence nonNetworkGetCrawlItems) on each machine. The results are then sent back to networkGetCrawlItems and aggregated.

Parameters

string $lookups

things whose summaries we are trying to look up

array $machine_urls

an array of urls of yioop queue servers

array $exclude_fields

an array of fields which might be int the crawlItem but which should be excluded from the result. This will make the result smaller and so hopefully faster to transmit

array $format_words

words which should be highlighted in search snippets returned

integer $description_length

length of snippets to be returned for each search result

Returns

array —

of summary data for the matching documents

nonNetworkGetCrawlItems()

nonNetworkGetCrawlItems(string  $lookups,array  $exclude_fields = array(),array  $format_words = null,integer  $description_length = self::DEFAULT_DESCRIPTION_LENGTH): array

Gets summaries on a particular machine for a set of document by their url, or by group of 5-tuples of the form (machine, key, index, generation, offset) This may be used in either the single queue_server setting or it may be called indirectly by a particular machine's CrawlController as part of fufilling a network-based getCrawlItems request. $lookups contains items which are to be grouped (as came from same url or site with the same cache). So this function aggregates their descriptions.

Parameters

string $lookups

things whose summaries we are trying to look up

array $exclude_fields

an array of fields which might be int the crawlItem but which should be excluded from the result. This will make the result smaller and so hopefully faster to transmit

array $format_words

words which should be highlighted in search snippets returned

integer $description_length

length of snippets to be returned for each search result

Returns

array —

of summary data for the matching documents

lookupSummaryOffsetGeneration()

lookupSummaryOffsetGeneration(string  $url_or_key,string  $index_name = "",boolean  $is_key = false): array

Determines the offset into the summaries WebArchiveBundle and generation of the provided url (or hash_url) so that the info:url (info:base64_hash_url) summary can be retrieved. This assumes of course that the info:url meta word has been stored.

Parameters

string $url_or_key

either info:base64_hash_url or just a url to lookup

string $index_name

index into which to do the lookup

boolean $is_key

whether the string is info:base64_hash_url or just a url

Returns

array —

(offset, generation) into the web archive bundle

clearQuerySavePoint()

clearQuerySavePoint(integer  $save_timestamp,array  $machine_urls = null)

A save point is used to store to disk a sequence generation-doc-offset pairs of a particular mix query when doing an archive crawl of a crawl mix. This is used so that the mix can remember where it was the next time it is invoked by the web app on the machine in question.

This function deletes such a save point associated with a timestamp

Parameters

integer $save_timestamp

timestamp of save point to delete

array $machine_urls

machines on which to try to delete savepoint

execMachines()

execMachines(string  $command,array  $machine_urls,string  $arg = null,integer  $num_machines,boolean  $send_specs = false): array

This method is invoked by other ParallelModel (@see CrawlModel for examples) methods when they want to have their method performed on an array of other Yioop instances. The results returned can then be aggregated. The invocation sequence is crawlModelMethodA invokes execMachine with a list of urls of other Yioop instances. execMachine makes REST requests of those instances of the given command and optional arguments This request would be handled by a CrawlController which in turn calls crawlModelMethodA on the given Yioop instance, serializes the result and gives it back to execMachine and then back to the originally calling function.

Parameters

string $command

the ParallelModel method to invoke on the remote Yioop instances

array $machine_urls

machines to invoke this command on

string $arg

additional arguments to be passed to the remote machine

integer $num_machines

the integer to be used in calculating partition

boolean $send_specs

whether to send the queue_server, num fetcher info for given machine

Returns

array —

a list of outputs from each machine that was called.

fileGetContents()

fileGetContents(string  $filename,boolean  $force_read = false): string

Either a wrapper for file_get_contents, or if a WebSite object is being used to serve pages, it reads it in using blocking I/O file_get_contents() and caches it before return its string contents.

Note this function assumes that only the web server is performing I/O with this file. filemtime() can be used to see if a file on disk has been changed and then you can use $force_read = true below to force re- reading the file into the cache

Parameters

string $filename

name of file to get contents of

boolean $force_read

whether to force the file to be read from presistent storage rather than the cache

Returns

string —

contents of the file given by $filename

filePutContents()

filePutContents(string  $filename,string  $data)

Either a wrapper for file_put_contents, or if a WebSite object is being used to serve pages, writes $data to the persistent file with name $filename. Saves a copy in the RAM cache if there is a copy already there.

Parameters

string $filename

name of file to write to persistent storages

string $data

string of data to store in file

createIfNecessaryDirectory()

createIfNecessaryDirectory(string  $directory): integer

Creates a directory and sets it to world permission if it doesn't aleady exist

Parameters

string $directory

name of directory to create

Returns

integer —

-1 on failure, 0 if already existed, 1 if created

formatSinglePageResult()

formatSinglePageResult(array  $page,array  $words = null,integer  $description_length = self::DEFAULT_DESCRIPTION_LENGTH): array

Given a page summary, extracts snippets which are related to a set of search words. For each snippet, bold faces the search terms, and then creates a new summary array.

Parameters

array $page

a single search result summary

array $words

keywords (typically what was searched on)

integer $description_length

length of the description

Returns

array —

$page which has been snippified and bold faced

getSnippets()

getSnippets(string  $text,array  $words,string  $description_length): string

Given a string, extracts a snippets of text related to a given set of key words. For a given word a snippet is a window of characters to its left and right that is less than a maximum total number of characters.

There is also a rule that a snippet should avoid ending in the middle of a word

Parameters

string $text

haystack to extract snippet from

array $words

keywords used to make look in haystack

string $description_length

length of the description desired

Returns

string —

a concatenation of the extracted snippets of each word

boldKeywords()

boldKeywords(string  $text,array  $words): string

Given a string, wraps in bold html tags a set of key words it contains.

Parameters

string $text

haystack string to look for the key words

array $words

an array of words to bold face

Returns

string —

the resulting string after boldfacing has been applied

getDbmsList()

getDbmsList(): array

Gets a list of all DBMS that work with the search engine

Returns

array —

Names of available data sources

loginDbms()

loginDbms(string  $dbms): boolean

Returns whether the provided dbms needs a login and password or not (sqlite or sqlite3)

Parameters

string $dbms

the name of a database management system

Returns

boolean —

true if needs a login and password; false otherwise

isSingleLocalhost()

isSingleLocalhost(array  $machine_urls,string  $index_timestamp = -1): boolean

Used to determine if an action involves just one yioop instance on the current local machine or not

Parameters

array $machine_urls

urls of yioop instances to which the action applies

string $index_timestamp

if timestamp exists checks if the index has declared itself to be a no network index.

Returns

boolean —

whether it involves a single local yioop instance (true) or not (false)

translateDb()

translateDb(string  $string_id,string  $locale_tag): mixed

Used to get the translation of a string_id stored in the database to the given locale.

Parameters

string $string_id

id to translate

string $locale_tag

to translate to

Returns

mixed —

translation if found, $string_id, otherwise

getUserId()

getUserId(string  $username): string

Get the user_id associated with a given username (In base class as used as an internal method in both signin and user models)

Parameters

string $username

the username to look up

Returns

string —

the corresponding userid

searchArrayToWhereOrderClauses()

searchArrayToWhereOrderClauses(array  $search_array,array  $any_fields = array('status')): array

Creates the WHERE and ORDER BY clauses for a query of a Yioop table such as USERS, ROLE, GROUP, which have associated search web forms. Searches are case insensitive

Parameters

array $search_array

each element of this is a quadruple name of a field, what comparison to perform, a value to check, and an order (ascending/descending) to sort by

array $any_fields

these fields if present in search array but with value "-1" will be skipped as part of the where clause but will be used for order by clause

Returns

array —

string for where clause, string for order by clause

getRows()

getRows(integer  $limit,integer  $num,\seekquarry\yioop\models\int&  $total,array  $search_array = array(),array  $args = null): array

Gets a range of rows which match the provided search criteria from $th provided table

Parameters

integer $limit

starting row from the potential results to return

integer $num

number of rows after start row to return

\seekquarry\yioop\models\int& $total

gets set with the total number of rows that can be returned by the given database query

array $search_array

each element of this is a quadruple name of a field, what comparison to perform, a value to check, and an order (ascending/descending) to sort by

array $args

additional values which may be used to get rows (what these are will typically depend on the subclass implementation)

Returns

array

selectCallback()

selectCallback(mixed  $args = null): string

Controls which columns and the names of those columns from the tables underlying the given model should be return from a getRows call.

This defaults to *, but in general will be overriden in subclasses of Model

Parameters

mixed $args

any additional arguments which should be used to determine the columns

Returns

string —

a comma separated list of columns suitable for a SQL query

fromCallback()

fromCallback(mixed  $args = null): string

Controls which tables and the names of tables underlie the given model and should be used in a getRows call This defaults to the single table whose name is whatever is before Model in the name of the model. For example, by default on FooModel this method would return "FOO". If a different behavior, this can be overriden in subclasses of Model

Parameters

mixed $args

any additional arguments which should be used to determine these tables

Returns

string —

a comma separated list of tables suitable for a SQL query

whereCallback()

whereCallback(mixed  $args = null): string

Controls the WHERE clause of the SQL query that underlies the given model and should be used in a getRows call.

This defaults to an empty WHERE clause.

Parameters

mixed $args

additional arguments that might be used to construct the WHERE clause.

Returns

string —

a SQL WHERE clause

rowCallback()

rowCallback(array  $row,mixed  $args): array

Called after as row is retrieved by getRows from the database to perform some manipulation that would be useful for this model.

For example, in CrawlModel, after a row representing a crawl mix has been gotten, this is used to perform an additional query to marshal its components. By default this method just returns this row unchanged.

Parameters

array $row

row as retrieved from database query

mixed $args

additional arguments that might be used by this callback

Returns

array —

$row after callback manipulation

postQueryCallback()

postQueryCallback(array  $rows): array

Called after getRows has retrieved all the rows that it would retrieve but before they are returned to give one last place where they could be further manipulated. For example, in MachineModel this callback is used to make parallel network calls to get the status of each machine returned by getRows. The default for this method is to leave the rows that would be returned unchanged

Parameters

array $rows

that have been calculated so far by getRows

Returns

array —

$rows after this final manipulation

indexExists()

indexExists(integer  $index_time_stamp): boolean

Returns whether there is a index with the provide timestamp

Parameters

integer $index_time_stamp

timestamp of the index to check if in cache

Returns

boolean —

whether it exists or not

rewriteMixQuery()

rewriteMixQuery(string  $query,object  $mix): string

Rewrites a mix query so that it maps directly to a query about crawls

Parameters

string $query

the original before a rewrite

object $mix

a mix object saying how the mix is built out of crawls

Returns

string —

a rewritten query in terms of crawls

getPhrasePageResults()

getPhrasePageResults(string  $input_phrase,integer  $low,integer  $results_per_page = \seekquarry\yioop\configs\NUM_RESULTS_PER_PAGE,boolean  $format = true,\seekquarry\yioop\models\SearchfiltersModel  $filter = null,boolean  $use_cache_if_allowed = true,integer  $raw,array  $queue_servers = array(),boolean  $guess_semantics = true,integer  $save_timestamp): array

Given a query phrase, returns formatted document summaries of the documents that match the phrase.

Parameters

string $input_phrase

the phrase to try to match

integer $low

return results beginning with the $low document

integer $results_per_page

how many results to return

boolean $format

whether to highlight in the returned summaries the matched text

\seekquarry\yioop\models\SearchfiltersModel $filter

Model responsible for keeping track of edited and deleted search results

boolean $use_cache_if_allowed

if true and USE_CACHE is true then an attempt will be made to look up the results in the file cache. Otherwise, items will be recomputed and then potentially restored in cache

integer $raw

($raw == 0) normal grouping, ($raw == 1) no grouping done on data also no summaries returned (only lookup info), $raw > 1 return summaries but no grouping

array $queue_servers

a list of urls of yioop machines which might be used during lookup

boolean $guess_semantics

whether to do query rewriting before lookup

integer $save_timestamp

if this timestamp is nonzero, then save iterate position, so can resume on future queries that make use of the timestamp

Returns

array —

an array of summary data

parseWordStructConjunctiveQuery()

parseWordStructConjunctiveQuery(\seekquarry\yioop\models\string&  $phrase): array

Parses from a string phrase representing a conjunctive query, a struct consisting of the words keys searched for, the allowed and disallowed phrases, the weight that should be put on these query results, and which archive to use.

Parameters

\seekquarry\yioop\models\string& $phrase

string to extract struct from, if the phrase semantics is guessed or an if condition is processed the value of phrase will be altered. (Helps for feeding to network queries)

Returns

array —

struct representing the conjunctive query

extractMetaWordInfo()

extractMetaWordInfo(string  $phrase): array

Given a query string, this method extracts meta words, which of these are "materialized" (i.e., should be encoded as part of word ids), disallowed phrases, the query string after meta words removed and ampersand substitution applied, the query string with meta words but apersand substitution applied, the index and the weights found as part of the query string. Finally, it extracts the locale_tag for the query

Parameters

string $phrase

the query string

Returns

array —

containing items listed above in the description of this method

guessSemantics()

guessSemantics(string  $phrase): string

Ideally, this function tries to guess from the query what the user is looking for. For now, we are just doing simple things like when a query term is a url and rewriting it to the appropriate meta meta word.

Parameters

string $phrase

input query to guess semantics of

Returns

string —

a phrase that more closely matches the intentions of the query.

beginMatch()

beginMatch(string  $phrase,string  $start_with,string  $new_prefix,string  $suffix = "",string  $not_contains = array(),string  $lang_tag = "en-US"): string

Matches terms (non white-char strings) in the language $lang_tag in $phrase that begin with $start_with and don't contain $not_contain, replaces $start_with with $new_prefix and adds $suffix to the end

Parameters

string $phrase

string to look for terms in

string $start_with

what we're looking to see if term begins with

string $new_prefix

what to change $start_with to

string $suffix

what to tack on to the end of the term if there is a match

string $not_contains

string match is not allowed to contain

string $lang_tag

what language the phrase must be in for the rule to apply

Returns

string —

$phrase after modifications have been made

endMatch()

endMatch(string  $phrase,string  $end_with,string  $prefix,string  $new_suffix = "",string  $not_contains = array(),string  $lang_tag = "en-US"): string

Matches terms (non white-char strings) in the language $lang_tag in $phrase that end with $end_with and don't contain $not_contain, replaces $end_with with $new_suffix (if not empty) and adds $prefix to the beginning

Parameters

string $phrase

string to look for terms in

string $end_with

what we're looking to see if term ends with

string $prefix

what to tack on to the start if there is a match

string $new_suffix

what to change $end_with to

string $not_contains

string match is not allowed to contain

string $lang_tag

what language the phrase must be in for the rule to apply

Returns

string —

$phrase after modifications have been made

parseIfConditions()

parseIfConditions(string  $phrase): string

Evaluates any if: conditional meta-words in the query string to calculate a new query string.

Parameters

string $phrase

original query string

Returns

string —

query string after if: meta words have been evaluated

getSummariesByHash()

getSummariesByHash(array  $word_structs,integer  $limit,integer  $num,\seekquarry\yioop\models\SearchfiltersModel  $filter,boolean  $use_cache_if_allowed = true,integer  $raw,array  $queue_servers = array(),string  $original_query = "",string  $save_timestamp_name = "",array  $format_words = null): array

Gets doc summaries of documents containing given words and meeting the additional provided criteria

Parameters

array $word_structs

an array of word_structs. Here a word_struct is an associative array with at least the following fields KEYS -- an array of word keys QUOTE_POSITIONS -- an array of positions of words that appeared in quotes (so need to be matched exactly) DISALLOW_PHRASES -- an array of words the document must not contain WEIGHT -- a weight to multiple scores returned from this iterator by INDEX_NAME -- an index timestamp to get results from

integer $limit

number of first document in order to return

integer $num

number of documents to return summaries of

\seekquarry\yioop\models\SearchfiltersModel $filter

Model responsible for keeping track of edited and deleted search results

boolean $use_cache_if_allowed

if true and USE_CACHE is true then an attempt will be made to look up the results in the file cache. Otherwise, items will be recomputed and then potentially restored in cache

integer $raw

($raw == 0) normal grouping, ($raw > 0) no grouping done on data. if ($raw == 1) no lookups of summaries done

array $queue_servers

a list of urls of yioop machines which might be used during lookup

string $original_query

if set, the original query that corresponds to $word_structs

string $save_timestamp_name

if this timestamp is not empty, then save iterate position, so can resume on future queries that make use of the timestamp. If used then $limit ignored and get next $num docs after $save_timestamp 's previous iterate position.

array $format_words

words which should be highlighted in search snippets returned

Returns

array —

document summaries

getSummariesFromOffsets()

getSummariesFromOffsets(\seekquarry\yioop\models\array&  $pages,\seekquarry\yioop\models\array&  $queue_servers,integer  $raw,boolean  $groups_with_docs,boolean  $with_question_answer_info,array  $format_words = null,integer  $description_length = self::DEFAULT_DESCRIPTION_LENGTH): array

Used to lookup summary info for the pages provided (using their) self::SUMMARY_OFFSET field. If any of the lookup-ed summaries are HTTP Location redirect page's then looks these up in turn.

This method handles robot meta tags which might forbid indexing.

Parameters

\seekquarry\yioop\models\array& $pages

of page data without text summaries

\seekquarry\yioop\models\array& $queue_servers

array of queue server to find data on

integer $raw

only lookup locations if 0

boolean $groups_with_docs

whether to return only groups that contain at least one doc as opposed to a groups with only links

boolean $with_question_answer_info

whether question answer info in summaries needs to be returned

array $format_words

words which should be highlighted in search snippets returned

integer $description_length

length of snippets to be returned for each search result

Returns

array —

pages with summaries added

getQueryIterator()

getQueryIterator(array  $word_structs,\seekquarry\yioop\models\SearchfiltersModel  $filter,integer  $raw,integer  $to_retrieve,array  $queue_servers = array(),string  $original_query = "",string  $save_timestamp_name = ""): \seekquarry\yioop\models\&object

Using the supplied $word_structs, contructs an iterator for getting results to a query

Parameters

array $word_structs

an array of word_structs. Here a word_struct is an associative array with at least the following fields KEYS -- an array of word keys QUOTE_POSITIONS -- an array of positions of words that appreared in quotes (so need to be matched exactly) DISALLOW_PHRASES -- an array of words the document must not contain WEIGHT -- a weight to multiple scores returned from this iterator by INDEX_NAME -- an index timestamp to get results from

\seekquarry\yioop\models\SearchfiltersModel $filter

Model responsible for keeping track of edited and deleted search results

integer $raw

($raw == 0) normal grouping, ($raw == 1) no grouping done on data also no summaries returned (only lookup info), $raw > 1 return summaries but no grouping

integer $to_retrieve

number of items to retrieve from location in in interator

array $queue_servers

a list of urls of yioop machines which might be used during lookup

string $original_query

if set, the orginal query that corresponds to $word_structs

string $save_timestamp_name

if this timestamp is non empty, then when making iterator get sub-iterators to advance to gen doc_offset stored with respect to save_timestamp if exists.

Returns

\seekquarry\yioop\models\&object —

an iterator for iterating through results to the query