\seekquarry\yioop\modelsCrawlModel

This is class is used to handle getting/setting crawl parameters, CRUD operations on current crawls, start, stop, status of crawls, getting cache files out of crawls, determining what is the default index to be used, marshalling/unmarshalling crawl mixes, and handling data from suggest-a-url forms

Summary

Methods

Properties

Constants

__construct()
getCrawlItem()
getCrawlItems()
networkGetCrawlItems()
nonNetworkGetCrawlItems()
lookupSummaryOffsetGeneration()
clearQuerySavePoint()
execMachines()
fileGetContents()
filePutContents()
createIfNecessaryDirectory()
formatSinglePageResult()
getSnippets()
boldKeywords()
getDbmsList()
loginDbms()
isSingleLocalhost()
translateDb()
getUserId()
searchArrayToWhereOrderClauses()
getRows()
selectCallback()
fromCallback()
whereCallback()
rowCallback()
postQueryCallback()
getCacheFile()
getCurrentIndexDatabaseName()
setCurrentIndexDatabaseName()
getDeltaFileInfo()
getMixList()
getCrawlMix()
getCrawlMixTimestamp()
isCrawlMix()
isMixOwner()
setCrawlMix()
deleteCrawlMix()
deleteCrawlMixIteratorState()
getSeedInfo()
setSeedInfo()
getCrawlSeedInfo()
setCrawlSeedInfo()
getChannel()
getSuggestSites()
appendSuggestSites()
clearSuggestSites()
getInfoTimestamp()
deleteCrawl()
clearCrawlCaches()
sendStartCrawlMessage()
startQueueServerFetchers()
sendStopCrawlMessage()
getCrawlList()
aggregateCrawlList()
crawlStalled()
aggregateStalled()
crawlStatus()
aggregateStatuses()
combinedCrawlInfo()
injectUrlsCurrentCrawl()
countWords()

$index_name
$current_machine
$db
$db_name
$private_db
$private_db_name
$edited_page_summaries
$any_fields
$search_table_column_map
$web_site
$cache
$suggest_url_file

MIN_DESCRIPTION_LENGTH
SNIPPET_TITLE_LENGTH
MAX_SNIPPET_TITLE_LENGTH
SNIPPET_LENGTH_LEFT
SNIPPET_LENGTH_RIGHT
MIN_SNIPPET_LENGTH
DEFAULT_DESCRIPTION_LENGTH

No protected methods found

No protected properties found

N/A

No private methods found

No private properties found

N/A

File: src/models/CrawlModel.php
Package: Default
Class hierarchy: \seekquarry\yioop\models\Model

\seekquarry\yioop\models\ParallelModel

\seekquarry\yioop\models\CrawlModel

Constants

MIN_DESCRIPTION_LENGTH

MIN_DESCRIPTION_LENGTH

the minimum length of a description before we stop appending additional link doc summaries

SNIPPET_TITLE_LENGTH

SNIPPET_TITLE_LENGTH

MAX_SNIPPET_TITLE_LENGTH

MAX_SNIPPET_TITLE_LENGTH

SNIPPET_LENGTH_LEFT

SNIPPET_LENGTH_LEFT

SNIPPET_LENGTH_RIGHT

SNIPPET_LENGTH_RIGHT

MIN_SNIPPET_LENGTH

MIN_SNIPPET_LENGTH

DEFAULT_DESCRIPTION_LENGTH

DEFAULT_DESCRIPTION_LENGTH

Default maximum character length of a search summary

Properties

$index_name

$index_name : string

Stores the name of the current index archive to use to get search results from

Type

string

$current_machine

$current_machine : integer

If known the id of the queue_server this belongs to

Type

integer

$db

$db : object

Reference to a DatasourceManager

Type

object

$db_name

$db_name : string

Name of the search engine database

Type

string

$private_db

$private_db : object

Reference to a private DatasourceManager

Type

object

$private_db_name

$private_db_name : string

Name of the private search engine database

Type

string

$edited_page_summaries

$edited_page_summaries : array

Associative array of page summaries which might be used to override default page summaries if set.

Type

array

$any_fields

$any_fields : array

These fields if present in $search_array (used by @see getRows() ), but with value "-1", will be skipped as part of the where clause but will be used for order by clause

Type

array

$search_table_column_map

$search_table_column_map : array

Used to map between search crawl mix form variables and database columns

Type

array

$web_site

$web_site : object

Reference to a WebSite object in use to serve pages (if any)

Type

object

$cache

$cache : object

Cache object to be used if we are doing caching

Type

object

$suggest_url_file

$suggest_url_file : string

File to be used to store suggest-a-url form data

Type

string

Methods

__construct()

__construct(string  $db_name = \seekquarry\yioop\configs\DB_NAME, boolean  $connect = true)

Sets up the database manager that will be used and name of the search engine database

Parameters

string	$db_name	the name of the database for the search engine
boolean	$connect	whether to connect to the database by default after making the datasource class

getCrawlItem()

getCrawlItem(string  $url, array  $machine_urls = null, string  $index_name = "") : array

Get a summary of a document by the generation it is in and its offset into the corresponding WebArchive.

Parameters

string	$url	of summary we are trying to look-up
array	$machine_urls	an array of urls of yioop queue servers
string	$index_name	timestamp of the index to do the lookup in

Returns

array —

summary data of the matching document

getCrawlItems()

getCrawlItems(string  $lookups, array  $machine_urls = null, array  $exclude_fields = array(), array  $format_words = null, integer  $description_length = self::DEFAULT_DESCRIPTION_LENGTH) : array

Gets summaries for a set of document by their url, or by group of 5-tuples of the form (machine, key, index, generation, offset).

Parameters

string	$lookups	things whose summaries we are trying to look up
array	$machine_urls	an array of urls of yioop queue servers
array	$exclude_fields	an array of fields which might be int the crawlItem but which should be excluded from the result. This will make the result smaller and so hopefully faster to transmit
array	$format_words	words which should be highlighted in search snippets returned
integer	$description_length	length of snippets to be returned for each search result

Returns

array —

of summary data for the matching documents

networkGetCrawlItems()

networkGetCrawlItems(string  $lookups, array  $machine_urls, array  $exclude_fields = array(), array  $format_words = null, integer  $description_length = self::DEFAULT_DESCRIPTION_LENGTH) : array

In a multiple queue server setting, gets summaries for a set of document by their url, or by group of 5-tuples of the form (machine, key, index, generation, offset). This makes an execMachines call to make a network request to the CrawlController's on each machine which in turn calls getCrawlItems (and thence nonNetworkGetCrawlItems) on each machine. The results are then sent back to networkGetCrawlItems and aggregated.

Parameters

string	$lookups	things whose summaries we are trying to look up
array	$machine_urls	an array of urls of yioop queue servers
array	$exclude_fields	an array of fields which might be int the crawlItem but which should be excluded from the result. This will make the result smaller and so hopefully faster to transmit
array	$format_words	words which should be highlighted in search snippets returned
integer	$description_length	length of snippets to be returned for each search result

Returns

array —

of summary data for the matching documents

nonNetworkGetCrawlItems()

nonNetworkGetCrawlItems(string  $lookups, array  $exclude_fields = array(), array  $format_words = null, integer  $description_length = self::DEFAULT_DESCRIPTION_LENGTH) : array

Gets summaries on a particular machine for a set of document by their url, or by group of 5-tuples of the form (machine, key, index, generation, offset) This may be used in either the single queue_server setting or it may be called indirectly by a particular machine's CrawlController as part of fufilling a network-based getCrawlItems request. $lookups contains items which are to be grouped (as came from same url or site with the same cache). So this function aggregates their descriptions.

Parameters

string	$lookups	things whose summaries we are trying to look up
array	$exclude_fields	an array of fields which might be int the crawlItem but which should be excluded from the result. This will make the result smaller and so hopefully faster to transmit
array	$format_words	words which should be highlighted in search snippets returned
integer	$description_length	length of snippets to be returned for each search result

Returns

array —

of summary data for the matching documents

lookupSummaryOffsetGeneration()

lookupSummaryOffsetGeneration(string  $url_or_key, string  $index_name = "", boolean  $is_key = false) : array

Determines the offset into the summaries WebArchiveBundle and generation of the provided url (or hash_url) so that the info:url (info:base64_hash_url) summary can be retrieved. This assumes of course that the info:url meta word has been stored.

Parameters

string	$url_or_key	either info:base64_hash_url or just a url to lookup
string	$index_name	index into which to do the lookup
boolean	$is_key	whether the string is info:base64_hash_url or just a url

Returns

array —

(offset, generation) into the web archive bundle

clearQuerySavePoint()

clearQuerySavePoint(integer  $save_timestamp, array  $machine_urls = null)

A save point is used to store to disk a sequence generation-doc-offset pairs of a particular mix query when doing an archive crawl of a crawl mix. This is used so that the mix can remember where it was the next time it is invoked by the web app on the machine in question.

This function deletes such a save point associated with a timestamp

Parameters

integer	$save_timestamp	timestamp of save point to delete
array	$machine_urls	machines on which to try to delete savepoint

execMachines()

execMachines(string  $command, array  $machine_urls, string  $arg = null, integer  $num_machines, boolean  $send_specs = false) : array

This method is invoked by other ParallelModel (@see CrawlModel for examples) methods when they want to have their method performed on an array of other Yioop instances. The results returned can then be aggregated. The invocation sequence is crawlModelMethodA invokes execMachine with a list of urls of other Yioop instances. execMachine makes REST requests of those instances of the given command and optional arguments This request would be handled by a CrawlController which in turn calls crawlModelMethodA on the given Yioop instance, serializes the result and gives it back to execMachine and then back to the originally calling function.

Parameters

string	$command	the ParallelModel method to invoke on the remote Yioop instances
array	$machine_urls	machines to invoke this command on
string	$arg	additional arguments to be passed to the remote machine
integer	$num_machines	the integer to be used in calculating partition
boolean	$send_specs	whether to send the queue_server, num fetcher info for given machine

Returns

array —

a list of outputs from each machine that was called.

fileGetContents()

fileGetContents(string  $filename, boolean  $force_read = false) : string

Either a wrapper for file_get_contents, or if a WebSite object is being used to serve pages, it reads it in using blocking I/O file_get_contents() and caches it before return its string contents.

Note this function assumes that only the web server is performing I/O with this file. filemtime() can be used to see if a file on disk has been changed and then you can use $force_read = true below to force re- reading the file into the cache

Parameters

string	$filename	name of file to get contents of
boolean	$force_read	whether to force the file to be read from presistent storage rather than the cache

Returns

string —

contents of the file given by $filename

filePutContents()

filePutContents(string  $filename, string  $data)

Either a wrapper for file_put_contents, or if a WebSite object is being used to serve pages, writes $data to the persistent file with name $filename. Saves a copy in the RAM cache if there is a copy already there.

Parameters

string	$filename	name of file to write to persistent storages
string	$data	string of data to store in file

createIfNecessaryDirectory()

createIfNecessaryDirectory(string  $directory) : integer

Creates a directory and sets it to world permission if it doesn't aleady exist

Parameters

string

$directory

name of directory to create

Returns

integer —

-1 on failure, 0 if already existed, 1 if created

formatSinglePageResult()

formatSinglePageResult(array  $page, array  $words = null, integer  $description_length = self::DEFAULT_DESCRIPTION_LENGTH) : array

Given a page summary, extracts snippets which are related to a set of search words. For each snippet, bold faces the search terms, and then creates a new summary array.

Parameters

array	$page	a single search result summary
array	$words	keywords (typically what was searched on)
integer	$description_length	length of the description

Returns

array —

$page which has been snippified and bold faced

getSnippets()

getSnippets(string  $text, array  $words, string  $description_length) : string

Given a string, extracts a snippets of text related to a given set of key words. For a given word a snippet is a window of characters to its left and right that is less than a maximum total number of characters.

There is also a rule that a snippet should avoid ending in the middle of a word

Parameters

string	$text	haystack to extract snippet from
array	$words	keywords used to make look in haystack
string	$description_length	length of the description desired

Returns

string —

a concatenation of the extracted snippets of each word

boldKeywords()

boldKeywords(string  $text, array  $words) : string

Given a string, wraps in bold html tags a set of key words it contains.

Parameters

string	$text	haystack string to look for the key words
array	$words	an array of words to bold face

Returns

string —

the resulting string after boldfacing has been applied

getDbmsList()

getDbmsList() : array

Gets a list of all DBMS that work with the search engine

Returns

array —

Names of available data sources

loginDbms()

loginDbms(string  $dbms) : boolean

Returns whether the provided dbms needs a login and password or not (sqlite or sqlite3)

Parameters

string

$dbms

the name of a database management system

Returns

boolean —

true if needs a login and password; false otherwise

isSingleLocalhost()

isSingleLocalhost(array  $machine_urls, string  $index_timestamp = -1) : boolean

Used to determine if an action involves just one yioop instance on the current local machine or not

Parameters

array	$machine_urls	urls of yioop instances to which the action applies
string	$index_timestamp	if timestamp exists checks if the index has declared itself to be a no network index.

Returns

boolean —

whether it involves a single local yioop instance (true) or not (false)

translateDb()

translateDb(string  $string_id, string  $locale_tag) : mixed

Used to get the translation of a string_id stored in the database to the given locale.

Parameters

string	$string_id	id to translate
string	$locale_tag	to translate to

Returns

mixed —

translation if found, $string_id, otherwise

getUserId()

getUserId(string  $username) : string

Get the user_id associated with a given username (In base class as used as an internal method in both signin and user models)

Parameters

string

$username

the username to look up

Returns

string —

the corresponding userid

searchArrayToWhereOrderClauses()

searchArrayToWhereOrderClauses(array  $search_array, array  $any_fields = array('status')) : array

Creates the WHERE and ORDER BY clauses for a query of a Yioop table such as USERS, ROLE, GROUP, which have associated search web forms. Searches are case insensitive

Parameters

array	$search_array	each element of this is a quadruple name of a field, what comparison to perform, a value to check, and an order (ascending/descending) to sort by
array	$any_fields	these fields if present in search array but with value "-1" will be skipped as part of the where clause but will be used for order by clause

Returns

array —

string for where clause, string for order by clause

getRows()

getRows(integer  $limit, integer  $num, \seekquarry\yioop\models\int&  $total, array  $search_array = array(), array  $args = null) : array

Gets a range of rows which match the provided search criteria from $th provided table

Parameters

integer	$limit	starting row from the potential results to return
integer	$num	number of rows after start row to return
\seekquarry\yioop\models\int&	$total	gets set with the total number of rows that can be returned by the given database query
array	$search_array	each element of this is a quadruple name of a field, what comparison to perform, a value to check, and an order (ascending/descending) to sort by
array	$args	additional values which may be used to get rows (what these are will typically depend on the subclass implementation)

Returns

array

selectCallback()

selectCallback(mixed  $args = null) : string

Controls which columns and the names of those columns from the tables underlying the given model should be return from a getRows call.

This defaults to *, but in general will be overriden in subclasses of Model

Parameters

mixed

$args

any additional arguments which should be used to determine the columns

Returns

string —

a comma separated list of columns suitable for a SQL query

fromCallback()

fromCallback(mixed  $args = null) : string

{@inheritDoc}

Parameters

mixed

$args

any additional arguments which should be used to determine these tables (in this case none)

Returns

string —

a comma separated list of tables suitable for a SQL query

whereCallback()

whereCallback(mixed  $args = null) : string

Controls the WHERE clause of the SQL query that underlies the given model and should be used in a getRows call.

This defaults to an empty WHERE clause.

Parameters

mixed

$args

additional arguments that might be used to construct the WHERE clause.

Returns

string —

a SQL WHERE clause

rowCallback()

rowCallback(array  $row, mixed  $args) : array

{@inheritDoc}

Parameters

array	$row	row as retrieved from database query
mixed	$args	additional arguments that might be used by this callback. In this case, should be a boolean flag that says whether or not to add information about the components of the crawl mix

Returns

array —

$row after callback manipulation

postQueryCallback()

postQueryCallback(array  $rows) : array

Called after getRows has retrieved all the rows that it would retrieve but before they are returned to give one last place where they could be further manipulated. For example, in MachineModel this callback is used to make parallel network calls to get the status of each machine returned by getRows. The default for this method is to leave the rows that would be returned unchanged

Parameters

array

$rows

that have been calculated so far by getRows

Returns

array —

$rows after this final manipulation

getCacheFile()

getCacheFile(string  $machine, string  $machine_uri, integer  $partition, integer  $offset, string  $crawl_time, integer  $instance_num = false) : array

Gets the cached version of a web page from the machine on which it was fetched.

Complete cached versions of web pages live on a fetcher machine for pre version 2.0 indexes. For these version, the queue server machine typically only maintains summaries. This method makes a REST request of a fetcher machine for a cached page and get the results back.

Parameters

string	$machine	the ip address of domain name of the machine the cached page lives on
string	$machine_uri	the path from document root on $machine where the yioop scripts live
integer	$partition	the partition in the WebArchiveBundle the page is in
integer	$offset	the offset in bytes into the WebArchive partition in the WebArchiveBundle at which the cached page lives.
string	$crawl_time	the timestamp of the crawl the cache page is from
integer	$instance_num	which fetcher instance for the particular fetcher crawled the page (if more than one), false otherwise

Returns

array —

page data of the cached page

getCurrentIndexDatabaseName()

getCurrentIndexDatabaseName() : string

Gets the name (aka timestamp) of the current index archive to be used to handle search queries

Returns

string —

the timestamp of the archive

setCurrentIndexDatabaseName()

setCurrentIndexDatabaseName(  $timestamp)

Sets the IndexArchive that will be used for search results

Parameters

$timestamp

the timestamp of the index archive. The timestamp is when the crawl was started. Currently, the timestamp appears as substring of the index archives directory name

getDeltaFileInfo()

getDeltaFileInfo(string  $dir, integer  $timestamp, array  $excludes) : array

Returns all the files in $dir or its subdirectories with modified times more recent than timestamp. The file which have in their path or name a string in the $excludes array will be exclude

Parameters

string	$dir	a directory to traverse
integer	$timestamp	used to check modified times against
array	$excludes	an array of path substrings tot exclude

Returns

array —

of file structs consisting of name, modified time and size.

getMixList()

getMixList(integer  $user_id, boolean  $with_components = false) : array

Gets a list of all mixes of available crawls

Parameters

integer	$user_id	user that we are getting a list of mixes for We have disabled mix sharing so for now this is all mixes
boolean	$with_components	if false then don't load the factors that make up the crawl mix, just load the name of the mixes and their timestamps; otherwise, if true loads everything

Returns

array —

list of available crawls

getCrawlMix()

getCrawlMix(string  $timestamp, boolean  $just_components = false) : array

Retrieves the weighting component of the requested crawl mix

Parameters

string	$timestamp	of the requested crawl mix
boolean	$just_components	says whether to find the mix name or just the components array.

Returns

array —

the crawls and their weights that make up the requested crawl mix.

getCrawlMixTimestamp()

getCrawlMixTimestamp(string  $mix_name) : mixed

Returns the timestamp associated with a mix name;

Parameters

string

$mix_name

name to lookup

Returns

mixed —

timestamp associated with name if exists false otherwise

isCrawlMix()

isCrawlMix(string  $timestamp) : boolean

Returns whether the supplied timestamp corresponds to a crawl mix

Parameters

string

$timestamp

of the requested crawl mix

Returns

boolean —

true if it does; false otherwise

isMixOwner()

isMixOwner(string  $timestamp, string  $user_id) : boolean

Returns whether there is a mix with the given $timestamp that $user_id owns. Currently mmix ownership is ignored and this is set to always return true;

Parameters

string	$timestamp	to see if exists
string	$user_id	id of would be owner

Returns

boolean —

true if owner; false otherwise

setCrawlMix()

setCrawlMix(array  $mix)

Stores in DB the supplied crawl mix object

Parameters

array

$mix

an associative array representing the crawl mix object

deleteCrawlMix()

deleteCrawlMix(integer  $timestamp)

Deletes from the DB the crawl mix ans its associated components and fragments

Parameters

integer

$timestamp

of the mix to delete

deleteCrawlMixIteratorState()

deleteCrawlMixIteratorState(integer  $timestamp)

Deletes the archive iterator and savepoint files created during the process of iterating through a crawl mix.

Parameters

integer

$timestamp

The timestamp of the crawl mix

getSeedInfo()

getSeedInfo(boolean  $use_default = false) : array

Returns the initial sites that a new crawl will start with along with crawl parameters such as crawl order, allowed and disallowed crawl sites

Parameters

boolean

$use_default

whether or not to use the Yioop! default crawl.ini file rather than the one created by the user.

Returns

array —

the first sites to crawl during the next crawl restrict_by_url, allowed, disallowed_sites

setSeedInfo()

setSeedInfo(array  $info)

Writes a crawl.ini file with the provided data to the user's WORK_DIRECTORY

Parameters

array

$info

an array containing information about the crawl

getCrawlSeedInfo()

getCrawlSeedInfo(string  $timestamp, array  $machine_urls = null) : array

Returns the crawl parameters that were used during a given crawl

Parameters

string	$timestamp	timestamp of the crawl to load the crawl parameters of
array	$machine_urls	an array of urls of yioop queue servers

Returns

array —

the first sites to crawl during the next crawl restrict_by_url, allowed, disallowed_sites

setCrawlSeedInfo()

setCrawlSeedInfo(string  $timestamp, array  $new_info, array  $machine_urls = null)

Changes the crawl parameters of an existing crawl (can be while crawling) Not all fields are allowed to be updated

Parameters

string	$timestamp	timestamp of the crawl to change
array	$new_info	the new parameters
array	$machine_urls	an array of urls of yioop queue servers

getChannel()

getChannel(integer  $timestamp) : integer

Gets the channel of the crawl with the given timestamp

Parameters

integer

$timestamp

of crawl to get channel for

Returns

integer —

$channel used by that crawl

getSuggestSites()

getSuggestSites() : array

Returns an array of urls which were stored via the suggest-a-url form in suggest_view.php

Returns

array —

urls that have been suggested

appendSuggestSites()

appendSuggestSites(string  $url) : string

Add new distinct urls to those already saved in the suggest_url_file If the supplied url is not new or the file size exceeds MAX_SUGGEST_URL_FILE_SIZE then it is not added.

Parameters

string

$url

to add

Returns

string —

true if the url was added or already existed in the file; false otherwise

clearSuggestSites()

clearSuggestSites()

Resets the suggest_url_file to be the empty file

getInfoTimestamp()

getInfoTimestamp(integer  $timestamp, array  $machine_urls = null) : array

Get a description associated with a Web Crawl or Crawl Mix

Parameters

integer	$timestamp	of crawl or mix in question
array	$machine_urls	an array of urls of yioop queue servers

Returns

array —

associative array containing item DESCRIPTION

deleteCrawl()

deleteCrawl(string  $timestamp, array  $machine_urls = null)

Deletes the crawl with the supplied timestamp if it exists. Also deletes any crawl mixes making use of this crawl

Parameters

string	$timestamp	a Unix timestamp
array	$machine_urls	an array of urls of yioop queue servers

clearCrawlCaches()

clearCrawlCaches()

Clears several memory and file caches related to crawls and networking.

sendStartCrawlMessage()

sendStartCrawlMessage(array  $crawl_params, array  $seed_info = null, array  $machine_urls = null, integer  $num_fetchers)

Used to send a message to the queue servers to start a crawl

Parameters

array	$crawl_params	has info like the time of the crawl, whether starting a new crawl or resuming an old one, etc.
array	$seed_info	what urls to crawl, etc as from the crawl.ini file
array	$machine_urls	an array of urls of yioop queue servers
integer	$num_fetchers	number of fetchers on machine to start. This parameter and $channel are used to start the daemons running on the machines if they aren't already running

startQueueServerFetchers()

startQueueServerFetchers(integer  $channel, integer  $num_fetchers) : boolean

Used to start QueueServers and Fetchers on current machine when it is detected that someone tried to start a crawl but hadn't started any queue servers or fetchers.

Parameters

integer	$channel	channel of crawl to start
integer	$num_fetchers	the number of fetchers on the current machine

Returns

boolean —

whether any processes were started

sendStopCrawlMessage()

sendStopCrawlMessage(  $channel, array  $machine_urls = null)

Used to send a message to the queue servers to stop a crawl

Parameters

	$channel	of crawl to stop
array	$machine_urls	an array of urls of yioop queue servers

getCrawlList()

getCrawlList(boolean  $return_arc_bundles = false, boolean  $return_recrawls = false, array  $machine_urls = null, boolean  $cache = false) : array

Gets a list of all index archives of crawls that have been conducted

Parameters

boolean	$return_arc_bundles	whether index bundles used for indexing arc or other archive bundles should be included in the lsit
boolean	$return_recrawls	whether index archive bundles generated as a result of recrawling should be included in the result
array	$machine_urls	an array of urls of yioop queue servers
boolean	$cache	whether to try to get/set the data to a cache file

Returns

array —

available IndexArchiveBundle directories and their meta information this meta information includes the time of the crawl, its description, the number of pages downloaded, and the number of partitions used in storing the inverted index

aggregateCrawlList()

aggregateCrawlList(array  $list_strings, string  $data_field = null) : array

When @see getCrawlList() is used in a multi-queue server this method used to integrate the crawl lists received by the different machines

Parameters

array	$list_strings	serialized crawl list data from different queue servers
string	$data_field	field of $list_strings to use for data

Returns

array —

list of crawls and their meta data

crawlStalled()

crawlStalled(array  $machine_urls = null) : boolean

Determines if the length of time since any of the fetchers has spoken with any of the queue servers has exceeded CRAWL_TIMEOUT. If so, typically the caller of this method would do something such as officially stop the crawl.

Parameters

array

$machine_urls

an array of urls of yioop queue servers

Returns

boolean —

whether the current crawl is stalled or not

aggregateStalled()

aggregateStalled(array  $stall_statuses, string  $data_field = null) : array

When @see crawlStalled() is used in a multi-queue server this method used to integrate the stalled information received by the different machines

Parameters

array	$stall_statuses	contains web encoded serialized data one one field of which has the boolean data concerning stalled statis
string	$data_field	field of $stall_statuses to use for data if null then each element of $stall_statuses is a wen encoded serialized boolean

Returns

array

crawlStatus()

crawlStatus(array  $machine_urls = null) : array

Returns data about current crawl such as DESCRIPTION, TIMESTAMP, peak memory of various processes, most recent fetcher, most recent urls, urls seen, urls visited, etc.

Parameters

array

$machine_urls

an array of urls of yioop queue servers on which the crawl is being conducted

Returns

array —

associative array of the said data

aggregateStatuses()

aggregateStatuses(array  $status_strings, string  $data_field = null) : array

When @see crawlStatus() is used in a multi-queue server this method used to integrate the status information received by the different machines

Parameters

array	$status_strings
string	$data_field	field of $status_strings to use for data

Returns

array —

associative array of DESCRIPTION, TIMESTAMP, peak memory of various processes, most recent fetcher, most recent urls, urls seen, urls visited, etc.

combinedCrawlInfo()

combinedCrawlInfo(array  $machine_urls = null, boolean  $use_cache = false) : array

This method is used to reduce the number of network requests needed by the crawlStatus method of admin_controller. It returns an array containing the results of the @see crawlStalled

Parameters

array	$machine_urls	an array of urls of yioop queue servers
boolean	$use_cache	whether to try to use a cached version of the the crawl info or to always recompute it.

Returns

array —

containing three components one for each of the three kinds of results listed above

injectUrlsCurrentCrawl()

injectUrlsCurrentCrawl(string  $timestamp, array  $inject_urls, array  $machine_urls = null)

Add the provided urls to the schedule directory of URLs that will be crawled

Parameters

string	$timestamp	Unix timestamp of crawl to add to schedule of
array	$inject_urls	urls to be added to the schedule of the active crawl
array	$machine_urls	an array of urls of yioop queue servers

countWords()

countWords(array  $words, array  $machine_urls = null) : array

Computes for each word in an array of words a count of the total number of times it occurs in this crawl model's default index.

Parameters

array	$words	words to find the counts for
array	$machine_urls	machines to invoke this command on

Returns

array —

associative array of word => counts