MIN_DESCRIPTION_LENGTH
MIN_DESCRIPTION_LENGTH
the minimum length of a description before we stop appending additional link doc summaries
This is class is used to handle getting/setting crawl parameters, CRUD operations on current crawls, start, stop, status of crawls, getting cache files out of crawls, determining what is the default index to be used, marshalling/unmarshalling crawl mixes, and handling data from suggest-a-url forms
__construct(string $db_name = \seekquarry\yioop\configs\DB_NAME, boolean $connect = true)
Sets up the database manager that will be used and name of the search engine database
string | $db_name | the name of the database for the search engine |
boolean | $connect | whether to connect to the database by default after making the datasource class |
getCrawlItem(string $url, array $machine_urls = null, string $index_name = "") : array
Get a summary of a document by the generation it is in and its offset into the corresponding WebArchive.
string | $url | of summary we are trying to look-up |
array | $machine_urls | an array of urls of yioop queue servers |
string | $index_name | timestamp of the index to do the lookup in |
summary data of the matching document
getCrawlItems(string $lookups, array $machine_urls = null, array $exclude_fields = array(), array $format_words = null, integer $description_length = self::DEFAULT_DESCRIPTION_LENGTH) : array
Gets summaries for a set of document by their url, or by group of 5-tuples of the form (machine, key, index, generation, offset).
string | $lookups | things whose summaries we are trying to look up |
array | $machine_urls | an array of urls of yioop queue servers |
array | $exclude_fields | an array of fields which might be int the crawlItem but which should be excluded from the result. This will make the result smaller and so hopefully faster to transmit |
array | $format_words | words which should be highlighted in search snippets returned |
integer | $description_length | length of snippets to be returned for each search result |
of summary data for the matching documents
networkGetCrawlItems(string $lookups, array $machine_urls, array $exclude_fields = array(), array $format_words = null, integer $description_length = self::DEFAULT_DESCRIPTION_LENGTH) : array
In a multiple queue server setting, gets summaries for a set of document by their url, or by group of 5-tuples of the form (machine, key, index, generation, offset). This makes an execMachines call to make a network request to the CrawlController's on each machine which in turn calls getCrawlItems (and thence nonNetworkGetCrawlItems) on each machine. The results are then sent back to networkGetCrawlItems and aggregated.
string | $lookups | things whose summaries we are trying to look up |
array | $machine_urls | an array of urls of yioop queue servers |
array | $exclude_fields | an array of fields which might be int the crawlItem but which should be excluded from the result. This will make the result smaller and so hopefully faster to transmit |
array | $format_words | words which should be highlighted in search snippets returned |
integer | $description_length | length of snippets to be returned for each search result |
of summary data for the matching documents
nonNetworkGetCrawlItems(string $lookups, array $exclude_fields = array(), array $format_words = null, integer $description_length = self::DEFAULT_DESCRIPTION_LENGTH) : array
Gets summaries on a particular machine for a set of document by their url, or by group of 5-tuples of the form (machine, key, index, generation, offset) This may be used in either the single queue_server setting or it may be called indirectly by a particular machine's CrawlController as part of fufilling a network-based getCrawlItems request. $lookups contains items which are to be grouped (as came from same url or site with the same cache). So this function aggregates their descriptions.
string | $lookups | things whose summaries we are trying to look up |
array | $exclude_fields | an array of fields which might be int the crawlItem but which should be excluded from the result. This will make the result smaller and so hopefully faster to transmit |
array | $format_words | words which should be highlighted in search snippets returned |
integer | $description_length | length of snippets to be returned for each search result |
of summary data for the matching documents
lookupSummaryOffsetGeneration(string $url_or_key, string $index_name = "", boolean $is_key = false) : array
Determines the offset into the summaries WebArchiveBundle and generation of the provided url (or hash_url) so that the info:url (info:base64_hash_url) summary can be retrieved. This assumes of course that the info:url meta word has been stored.
string | $url_or_key | either info:base64_hash_url or just a url to lookup |
string | $index_name | index into which to do the lookup |
boolean | $is_key | whether the string is info:base64_hash_url or just a url |
(offset, generation) into the web archive bundle
clearQuerySavePoint(integer $save_timestamp, array $machine_urls = null)
A save point is used to store to disk a sequence generation-doc-offset pairs of a particular mix query when doing an archive crawl of a crawl mix. This is used so that the mix can remember where it was the next time it is invoked by the web app on the machine in question.
This function deletes such a save point associated with a timestamp
integer | $save_timestamp | timestamp of save point to delete |
array | $machine_urls | machines on which to try to delete savepoint |
execMachines(string $command, array $machine_urls, string $arg = null, integer $num_machines, boolean $send_specs = false) : array
This method is invoked by other ParallelModel (@see CrawlModel for examples) methods when they want to have their method performed on an array of other Yioop instances. The results returned can then be aggregated. The invocation sequence is crawlModelMethodA invokes execMachine with a list of urls of other Yioop instances. execMachine makes REST requests of those instances of the given command and optional arguments This request would be handled by a CrawlController which in turn calls crawlModelMethodA on the given Yioop instance, serializes the result and gives it back to execMachine and then back to the originally calling function.
string | $command | the ParallelModel method to invoke on the remote Yioop instances |
array | $machine_urls | machines to invoke this command on |
string | $arg | additional arguments to be passed to the remote machine |
integer | $num_machines | the integer to be used in calculating partition |
boolean | $send_specs | whether to send the queue_server, num fetcher info for given machine |
a list of outputs from each machine that was called.
fileGetContents(string $filename, boolean $force_read = false) : string
Either a wrapper for file_get_contents, or if a WebSite object is being used to serve pages, it reads it in using blocking I/O file_get_contents() and caches it before return its string contents.
Note this function assumes that only the web server is performing I/O with this file. filemtime() can be used to see if a file on disk has been changed and then you can use $force_read = true below to force re- reading the file into the cache
string | $filename | name of file to get contents of |
boolean | $force_read | whether to force the file to be read from presistent storage rather than the cache |
contents of the file given by $filename
filePutContents(string $filename, string $data)
Either a wrapper for file_put_contents, or if a WebSite object is being used to serve pages, writes $data to the persistent file with name $filename. Saves a copy in the RAM cache if there is a copy already there.
string | $filename | name of file to write to persistent storages |
string | $data | string of data to store in file |
formatSinglePageResult(array $page, array $words = null, integer $description_length = self::DEFAULT_DESCRIPTION_LENGTH) : array
Given a page summary, extracts snippets which are related to a set of search words. For each snippet, bold faces the search terms, and then creates a new summary array.
array | $page | a single search result summary |
array | $words | keywords (typically what was searched on) |
integer | $description_length | length of the description |
$page which has been snippified and bold faced
getSnippets(string $text, array $words, string $description_length) : string
Given a string, extracts a snippets of text related to a given set of key words. For a given word a snippet is a window of characters to its left and right that is less than a maximum total number of characters.
There is also a rule that a snippet should avoid ending in the middle of a word
string | $text | haystack to extract snippet from |
array | $words | keywords used to make look in haystack |
string | $description_length | length of the description desired |
a concatenation of the extracted snippets of each word
boldKeywords(string $text, array $words) : string
Given a string, wraps in bold html tags a set of key words it contains.
string | $text | haystack string to look for the key words |
array | $words | an array of words to bold face |
the resulting string after boldfacing has been applied
isSingleLocalhost(array $machine_urls, string $index_timestamp = -1) : boolean
Used to determine if an action involves just one yioop instance on the current local machine or not
array | $machine_urls | urls of yioop instances to which the action applies |
string | $index_timestamp | if timestamp exists checks if the index has declared itself to be a no network index. |
whether it involves a single local yioop instance (true) or not (false)
searchArrayToWhereOrderClauses(array $search_array, array $any_fields = array('status')) : array
Creates the WHERE and ORDER BY clauses for a query of a Yioop table such as USERS, ROLE, GROUP, which have associated search web forms. Searches are case insensitive
array | $search_array | each element of this is a quadruple name of a field, what comparison to perform, a value to check, and an order (ascending/descending) to sort by |
array | $any_fields | these fields if present in search array but with value "-1" will be skipped as part of the where clause but will be used for order by clause |
string for where clause, string for order by clause
getRows(integer $limit, integer $num, \seekquarry\yioop\models\int& $total, array $search_array = array(), array $args = null) : array
Gets a range of rows which match the provided search criteria from $th provided table
integer | $limit | starting row from the potential results to return |
integer | $num | number of rows after start row to return |
\seekquarry\yioop\models\int& | $total | gets set with the total number of rows that can be returned by the given database query |
array | $search_array | each element of this is a quadruple name of a field, what comparison to perform, a value to check, and an order (ascending/descending) to sort by |
array | $args | additional values which may be used to get rows (what these are will typically depend on the subclass implementation) |
selectCallback(mixed $args = null) : string
Controls which columns and the names of those columns from the tables underlying the given model should be return from a getRows call.
This defaults to *, but in general will be overriden in subclasses of Model
mixed | $args | any additional arguments which should be used to determine the columns |
a comma separated list of columns suitable for a SQL query
whereCallback(mixed $args = null) : string
Controls the WHERE clause of the SQL query that underlies the given model and should be used in a getRows call.
This defaults to an empty WHERE clause.
mixed | $args | additional arguments that might be used to construct the WHERE clause. |
a SQL WHERE clause
rowCallback(array $row, mixed $args) : array
{@inheritDoc}
array | $row | row as retrieved from database query |
mixed | $args | additional arguments that might be used by this callback. In this case, should be a boolean flag that says whether or not to add information about the components of the crawl mix |
$row after callback manipulation
postQueryCallback(array $rows) : array
Called after getRows has retrieved all the rows that it would retrieve but before they are returned to give one last place where they could be further manipulated. For example, in MachineModel this callback is used to make parallel network calls to get the status of each machine returned by getRows. The default for this method is to leave the rows that would be returned unchanged
array | $rows | that have been calculated so far by getRows |
$rows after this final manipulation
getCacheFile(string $machine, string $machine_uri, integer $partition, integer $offset, string $crawl_time, integer $instance_num = false) : array
Gets the cached version of a web page from the machine on which it was fetched.
Complete cached versions of web pages live on a fetcher machine for pre version 2.0 indexes. For these version, the queue server machine typically only maintains summaries. This method makes a REST request of a fetcher machine for a cached page and get the results back.
string | $machine | the ip address of domain name of the machine the cached page lives on |
string | $machine_uri | the path from document root on $machine where the yioop scripts live |
integer | $partition | the partition in the WebArchiveBundle the page is in |
integer | $offset | the offset in bytes into the WebArchive partition in the WebArchiveBundle at which the cached page lives. |
string | $crawl_time | the timestamp of the crawl the cache page is from |
integer | $instance_num | which fetcher instance for the particular fetcher crawled the page (if more than one), false otherwise |
page data of the cached page
setCurrentIndexDatabaseName( $timestamp)
Sets the IndexArchive that will be used for search results
$timestamp | the timestamp of the index archive. The timestamp is when the crawl was started. Currently, the timestamp appears as substring of the index archives directory name |
getDeltaFileInfo(string $dir, integer $timestamp, array $excludes) : array
Returns all the files in $dir or its subdirectories with modified times more recent than timestamp. The file which have in their path or name a string in the $excludes array will be exclude
string | $dir | a directory to traverse |
integer | $timestamp | used to check modified times against |
array | $excludes | an array of path substrings tot exclude |
of file structs consisting of name, modified time and size.
getMixList(integer $user_id, boolean $with_components = false) : array
Gets a list of all mixes of available crawls
integer | $user_id | user that we are getting a list of mixes for We have disabled mix sharing so for now this is all mixes |
boolean | $with_components | if false then don't load the factors that make up the crawl mix, just load the name of the mixes and their timestamps; otherwise, if true loads everything |
list of available crawls
getCrawlMix(string $timestamp, boolean $just_components = false) : array
Retrieves the weighting component of the requested crawl mix
string | $timestamp | of the requested crawl mix |
boolean | $just_components | says whether to find the mix name or just the components array. |
the crawls and their weights that make up the requested crawl mix.
isMixOwner(string $timestamp, string $user_id) : boolean
Returns whether there is a mix with the given $timestamp that $user_id owns. Currently mmix ownership is ignored and this is set to always return true;
string | $timestamp | to see if exists |
string | $user_id | id of would be owner |
true if owner; false otherwise
getSeedInfo(boolean $use_default = false) : array
Returns the initial sites that a new crawl will start with along with crawl parameters such as crawl order, allowed and disallowed crawl sites
boolean | $use_default | whether or not to use the Yioop! default crawl.ini file rather than the one created by the user. |
the first sites to crawl during the next crawl restrict_by_url, allowed, disallowed_sites
getCrawlSeedInfo(string $timestamp, array $machine_urls = null) : array
Returns the crawl parameters that were used during a given crawl
string | $timestamp | timestamp of the crawl to load the crawl parameters of |
array | $machine_urls | an array of urls of yioop queue servers |
the first sites to crawl during the next crawl restrict_by_url, allowed, disallowed_sites
setCrawlSeedInfo(string $timestamp, array $new_info, array $machine_urls = null)
Changes the crawl parameters of an existing crawl (can be while crawling) Not all fields are allowed to be updated
string | $timestamp | timestamp of the crawl to change |
array | $new_info | the new parameters |
array | $machine_urls | an array of urls of yioop queue servers |
appendSuggestSites(string $url) : string
Add new distinct urls to those already saved in the suggest_url_file If the supplied url is not new or the file size exceeds MAX_SUGGEST_URL_FILE_SIZE then it is not added.
string | $url | to add |
true if the url was added or already existed in the file; false otherwise
getInfoTimestamp(integer $timestamp, array $machine_urls = null) : array
Get a description associated with a Web Crawl or Crawl Mix
integer | $timestamp | of crawl or mix in question |
array | $machine_urls | an array of urls of yioop queue servers |
associative array containing item DESCRIPTION
sendStartCrawlMessage(array $crawl_params, array $seed_info = null, array $machine_urls = null, integer $num_fetchers)
Used to send a message to the queue servers to start a crawl
array | $crawl_params | has info like the time of the crawl, whether starting a new crawl or resuming an old one, etc. |
array | $seed_info | what urls to crawl, etc as from the crawl.ini file |
array | $machine_urls | an array of urls of yioop queue servers |
integer | $num_fetchers | number of fetchers on machine to start. This parameter and $channel are used to start the daemons running on the machines if they aren't already running |
startQueueServerFetchers(integer $channel, integer $num_fetchers) : boolean
Used to start QueueServers and Fetchers on current machine when it is detected that someone tried to start a crawl but hadn't started any queue servers or fetchers.
integer | $channel | channel of crawl to start |
integer | $num_fetchers | the number of fetchers on the current machine |
whether any processes were started
getCrawlList(boolean $return_arc_bundles = false, boolean $return_recrawls = false, array $machine_urls = null, boolean $cache = false) : array
Gets a list of all index archives of crawls that have been conducted
boolean | $return_arc_bundles | whether index bundles used for indexing arc or other archive bundles should be included in the lsit |
boolean | $return_recrawls | whether index archive bundles generated as a result of recrawling should be included in the result |
array | $machine_urls | an array of urls of yioop queue servers |
boolean | $cache | whether to try to get/set the data to a cache file |
available IndexArchiveBundle directories and their meta information this meta information includes the time of the crawl, its description, the number of pages downloaded, and the number of partitions used in storing the inverted index
aggregateCrawlList(array $list_strings, string $data_field = null) : array
When @see getCrawlList() is used in a multi-queue server this method used to integrate the crawl lists received by the different machines
array | $list_strings | serialized crawl list data from different queue servers |
string | $data_field | field of $list_strings to use for data |
list of crawls and their meta data
crawlStalled(array $machine_urls = null) : boolean
Determines if the length of time since any of the fetchers has spoken with any of the queue servers has exceeded CRAWL_TIMEOUT. If so, typically the caller of this method would do something such as officially stop the crawl.
array | $machine_urls | an array of urls of yioop queue servers |
whether the current crawl is stalled or not
aggregateStalled(array $stall_statuses, string $data_field = null) : array
When @see crawlStalled() is used in a multi-queue server this method used to integrate the stalled information received by the different machines
array | $stall_statuses | contains web encoded serialized data one one field of which has the boolean data concerning stalled statis |
string | $data_field | field of $stall_statuses to use for data if null then each element of $stall_statuses is a wen encoded serialized boolean |
crawlStatus(array $machine_urls = null) : array
Returns data about current crawl such as DESCRIPTION, TIMESTAMP, peak memory of various processes, most recent fetcher, most recent urls, urls seen, urls visited, etc.
array | $machine_urls | an array of urls of yioop queue servers on which the crawl is being conducted |
associative array of the said data
aggregateStatuses(array $status_strings, string $data_field = null) : array
When @see crawlStatus() is used in a multi-queue server this method used to integrate the status information received by the different machines
array | $status_strings | |
string | $data_field | field of $status_strings to use for data |
associative array of DESCRIPTION, TIMESTAMP, peak memory of various processes, most recent fetcher, most recent urls, urls seen, urls visited, etc.
combinedCrawlInfo(array $machine_urls = null, boolean $use_cache = false) : array
This method is used to reduce the number of network requests needed by the crawlStatus method of admin_controller. It returns an array containing the results of the @see crawlStalled
array | $machine_urls | an array of urls of yioop queue servers |
boolean | $use_cache | whether to try to use a cached version of the the crawl info or to always recompute it. |
containing three components one for each of the three kinds of results listed above
injectUrlsCurrentCrawl(string $timestamp, array $inject_urls, array $machine_urls = null)
Add the provided urls to the schedule directory of URLs that will be crawled
string | $timestamp | Unix timestamp of crawl to add to schedule of |
array | $inject_urls | urls to be added to the schedule of the active crawl |
array | $machine_urls | an array of urls of yioop queue servers |
countWords(array $words, array $machine_urls = null) : array
Computes for each word in an array of words a count of the total number of times it occurs in this crawl model's default index.
array | $words | words to find the counts for |
array | $machine_urls | machines to invoke this command on |
associative array of word => counts