\seekquarry\yioop\executablesQueueServer

Command line program responsible for managing Yioop crawls.

It maintains a queue of urls that are going to be scheduled to be seen. It also keeps track of what has been seen and robots.txt info. Its last responsibility is to create and populate the IndexArchiveBundle that is used by the search front end.

Summary

Methods
Properties
Constants
__construct()
start()
loop()
checkBothProcessesRunning()
checkProcessRunning()
checkRepeatingCrawlSwap()
processCrawlData()
isAScheduler()
isAIndexer()
isOnlyIndexer()
isOnlyScheduler()
writeArchiveCrawlInfo()
processRecrawlRobotUrls()
processRecrawlRobotArchive()
getDataArchiveFileData()
processRecrawlDataArchive()
handleAdminMessages()
stopCrawl()
writeAdminMessage()
dumpQueueToSchedules()
shutdownDictionary()
runPostProcessingPlugins()
indexSave()
startCrawl()
initializeIndexBundle()
updateDisallowedQuotaSites()
initializeWebQueue()
clearWebQueue()
checkUpdateCrawlParameters()
deleteOrphanedBundles()
join()
processDataFile()
updateMostRecentFetcher()
processIndexData()
processIndexArchive()
constrainIndexerMemoryUsage()
processRobotUrls()
processRobotArchive()
processEtagExpires()
processEtagExpiresArchive()
deleteRobotData()
processQueueUrls()
processDataArchive()
dumpBigScheduleToSmall()
writeCrawlStatus()
calculateScheduleMetaInfo()
produceFetchBatch()
getEarliestSlot()
cullNoncrawlableSites()
allowedToCrawlSite()
disallowedToCrawlSite()
withinQuota()
$db
$allowed_sites
$disallowed_sites
$allow_disallow_cache_time
$quota_clear_time
$quota_sites
$quota_sites_keys
$channel
$repeat_type
$sleep_start
$sleep_duration
$crawl_order
$max_depth
$robot_txt
$summarizer_option
$page_range_request
$max_description_len
$page_recrawl_frequency
$crawl_type
$crawl_index
$restrict_sites_by_url
$indexed_file_types
$all_file_types
$cache_pages
$page_rules
$web_queue
$index_archive
$crawl_time
$waiting_hosts
$most_recent_fetcher
$last_index_save_time
$index_dirty
$archive_modified_time
$indexing_plugins
$indexing_plugins_data
$hourly_crawl_data
$server_type
$server_name
$process_name
$debug
$info_parameter_map
No constants found
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Properties

$db

$db : object

Reference to a database object. Used since has directory manipulation functions

Type

object

$allowed_sites

$allowed_sites : array

Web-sites that crawler can crawl. If used, ONLY these will be crawled

Type

array

$disallowed_sites

$disallowed_sites : array

Web-sites that the crawler must not crawl

Type

array

$allow_disallow_cache_time

$allow_disallow_cache_time : integer

Microtime used to look up cache $allowed_sites and $disallowed_sites filtering data structures

Type

integer

$quota_clear_time

$quota_clear_time : integer

Timestamp of lst time download from site quotas were cleared

Type

integer

$quota_sites

$quota_sites : array

Web-sites that have an hourly crawl quota

Type

array

$quota_sites_keys

$quota_sites_keys : array

Cache of array_keys of $quota_sites

Type

array

$channel

$channel : integer

Channel that queue server listens to messages for

Type

integer

$repeat_type

$repeat_type : integer

Controls whether a repeating crawl (negative man no) is being done and if so its frequency in second

Type

integer

$sleep_start

$sleep_start : string

If a crawl quiescent period is being used with the crawl, then this stores the time of day at which that period starts

Type

string

$sleep_duration

$sleep_duration : string

If a crawl quiescent period is being used with the crawl, then this sproperty will be positive and indicate the number of seconds duration for the quiescent period.

Type

string

$crawl_order

$crawl_order : string

Constant saying the method used to order the priority queue for the crawl

Type

string

$max_depth

$max_depth : string

Constant saying the depth from the seeds crawl can go to

Type

string

$robot_txt

$robot_txt : integer

One of a fixed set of values which are used to control to what extent Yioop follows robots.txt files: ALWAYS_FOLLOW_ROBOTS, ALLOW_LANDING_ROBOTS, IGNORE_ROBOTS

Type

integer

$summarizer_option

$summarizer_option : string

Stores the name of the summarizer used for crawling.

Possible values are Basic and Centroid

Type

string

$page_range_request

$page_range_request : integer

Maximum number of bytes to download of a webpage

Type

integer

$max_description_len

$max_description_len : integer

Max number of chars to extract for description from a page to index.

Only words in the description are indexed.

Type

integer

$page_recrawl_frequency

$page_recrawl_frequency : integer

Number of days between resets of the page url filter If nonpositive, then never reset filter

Type

integer

$crawl_type

$crawl_type : string

Indicates the kind of crawl being performed: self::WEB_CRAWL indicates a new crawl of the web; self::ARCHIVE_CRAWL indicates a crawl of an existing web archive

Type

string

$crawl_index

$crawl_index : string

If the crawl_type is self::ARCHIVE_CRAWL, then crawl_index is the timestamp of the existing archive to crawl

Type

string

$restrict_sites_by_url

$restrict_sites_by_url : boolean

Says whether the $allowed_sites array is being used or not

Type

boolean

$indexed_file_types

$indexed_file_types : array

List of file extensions supported for the crawl

Type

array

$all_file_types

$all_file_types : array

List of all known file extensions including those not used for crawl

Type

array

$cache_pages

$cache_pages : boolean

Used in schedules to tell the fetcher whether or not to cache pages

Type

boolean

$page_rules

$page_rules : array

Used to add page rules to be applied to downloaded pages to schedules that the fetcher will use (and hence apply the page )

Type

array

$web_queue

$web_queue : object

Holds the WebQueueBundle for the crawl. This bundle encapsulates the priority queue of urls that specifies what to crawl next

Type

object

$index_archive

$index_archive : object

Holds the IndexArchiveBundle for the current crawl. This encapsulates the inverted index word-->documents for the crawls as well as document summaries of each document.

Type

object

$crawl_time

$crawl_time : integer

The timestamp of the current active crawl

Type

integer

$waiting_hosts

$waiting_hosts : array

This is a list of hosts whose robots.txt file had a Crawl-delay directive and which we have produced a schedule with urls for, but we have not heard back from the fetcher who was processing those urls. Hosts on this list will not be scheduled for more downloads until the fetcher with earlier urls has gotten back to the queue server.

Type

array

$most_recent_fetcher

$most_recent_fetcher : string

IP address as a string of the fetcher that most recently spoke with the queue server.

Type

string

$last_index_save_time

$last_index_save_time : integer

Last time index was saved to disk

Type

integer

$index_dirty

$index_dirty : integer

flags for whether the index has data to be written to disk

Type

integer

$archive_modified_time

$archive_modified_time : integer

This keeps track of the time the current archive info was last modified This way the queue server knows if the user has changed the crawl parameters during the crawl.

Type

integer

$indexing_plugins

$indexing_plugins : array

This is a list of indexing_plugins which might do post processing after the crawl. The plugins postProcessing function is called if it is selected in the crawl options page.

Type

array

$indexing_plugins_data

$indexing_plugins_data : array

This is a array of crawl parameters for indexing_plugins which might do post processing after the crawl.

Type

array

$hourly_crawl_data

$hourly_crawl_data : array

This is a list of hourly (timestamp, number_of_urls_crawled) statistics

Type

array

$server_type

$server_type : mixed

Used to say what kind of queue server this is (one of BOTH, INDEXER, SCHEDULER)

Type

mixed

$server_name

$server_name : string

String used to describe this kind of queue server (Indexer, Scheduler, etc. in the log files.

Type

string

$process_name

$process_name : string

String used for naming log files and for naming the processes which run related to the queue server

Type

string

$debug

$debug : string

Holds the value of a debug message that might have been sent from the command line during the current execution of loop();

Type

string

$info_parameter_map

$info_parameter_map : array

A mapping between class field names and parameters which might be sent to a queue server via an info associative array.

Type

array

Methods

__construct()

__construct() 

Creates a Queue Server Daemon

start()

start() 

This is the function that should be called to get the queue server to start. Calls init to handle the command line arguments then enters the queue server's main loop

loop()

loop() 

Main runtime loop of the queue server.

Loops until a stop message received, check for start, stop, resume crawl messages, deletes any WebQueueBundle for which an IndexArchiveBundle does not exist. Processes

checkBothProcessesRunning()

checkBothProcessesRunning(array  $info) 

Checks to make sure both the indexer process and the scheduler processes are running and if not restart the stopped process

Parameters

array $info

information about queue server state used to determine if a crawl is active.

checkProcessRunning()

checkProcessRunning(string  $process, array  $info) 

Checks to make sure the given process (either Indexer or Scheduler) is running, and if not, restart it.

Parameters

string $process

should be either self::INDEXER or self::SCHEDULER

array $info

information about queue server state used to determine if a crawl is active.

checkRepeatingCrawlSwap()

checkRepeatingCrawlSwap() : boolean

Check for a repeating crawl whether it is time to swap between the active and search crawls.

Returns

boolean —

true if the time to swap has come

processCrawlData()

processCrawlData(boolean  $blocking = false) 

Main body of queue server loop where indexing, scheduling, robot file processing is done.

Parameters

boolean $blocking

this method might be called by the indexer subcomponent when a merge tier phase is ongoing to allow for other processing to occur. If so, we don't want a regress where the indexer calls this code calls the indexer etc. If the blocking flag is set then the indexer subcomponent won't be called

isAScheduler()

isAScheduler() : boolean

Used to check if the current queue server process is acting a url scheduler for fetchers

Returns

boolean —

whether it is or not

isAIndexer()

isAIndexer() : boolean

Used to check if the current queue server process is acting a indexer of data coming from fetchers

Returns

boolean —

whether it is or not

isOnlyIndexer()

isOnlyIndexer() : boolean

Used to check if the current queue server process is acting only as a indexer of data coming from fetchers (and not some other activity like scheduler as well)

Returns

boolean —

whether it is or not

isOnlyScheduler()

isOnlyScheduler() : boolean

Used to check if the current queue server process is acting only as a indexer of data coming from fetchers (and not some other activity like indexer as well)

Returns

boolean —

whether it is or not

writeArchiveCrawlInfo()

writeArchiveCrawlInfo() 

Used to write info about the current recrawl to file as well as to process any recrawl data files received

processRecrawlRobotUrls()

processRecrawlRobotUrls() 

Even during a recrawl the fetcher may send robot data to the queue server. This function prints a log message and calls another function to delete this useless robot file.

processRecrawlRobotArchive()

processRecrawlRobotArchive(string  $file) 

Even during a recrawl the fetcher may send robot data to the queue server. This function delete the passed robot file.

Parameters

string $file

robot file to delete

getDataArchiveFileData()

getDataArchiveFileData(string  $file) : array

Used to get a data archive file (either during a normal crawl or a recrawl). After uncompressing this file (which comes via the web server through fetch_controller, from the fetcher), it sets which fetcher sent it and also returns the sites contained in it.

Parameters

string $file

name of archive data file

Returns

array —

sites contained in the file from the fetcher

processRecrawlDataArchive()

processRecrawlDataArchive(String  $file) 

Processes fetcher data file information during a recrawl

Parameters

String $file

a file which contains the info to process

handleAdminMessages()

handleAdminMessages(array  $info) : array

Handles messages passed via files to the QueueServer.

These files are typically written by the CrawlDaemon::init() when QueueServer is run using command-line argument

Parameters

array $info

associative array with info about current state of queue server

Returns

array —

an updates version $info reflecting changes that occurred during the handling of the admin messages files.

stopCrawl()

stopCrawl() 

Used to stop the currently running crawl gracefully so that it can be restarted. This involved writing the queue's contents back to schedules, making the crawl's dictionary all the same tier and running any indexing_plugins.

writeAdminMessage()

writeAdminMessage(string  $message) 

Used to write an admin crawl status message during a start or stop crawl.

Parameters

string $message

to write into crawl_status.txt this will show up in the web crawl status element.

dumpQueueToSchedules()

dumpQueueToSchedules(boolean  $for_reschedule = false) 

When a crawl is being shutdown, this function is called to write the contents of the web queue bundle back to schedules. This allows crawls to be resumed without losing urls. This function can also be called if the queue gets clogged to reschedule its contents for a later time.

Parameters

boolean $for_reschedule

if the call was to reschedule the urls to be crawled at a later time as opposed to being used to save the urls because the crawl is being halted.

shutdownDictionary()

shutdownDictionary() 

During crawl shutdown, this function is called to do a final save and merge of the crawl dictionary, so that it is ready to serve queries.

runPostProcessingPlugins()

runPostProcessingPlugins() 

During crawl shutdown this is called to run any post processing plugins

indexSave()

indexSave() 

Saves the index_archive and, in particular, its current shard to disk

startCrawl()

startCrawl(array  $info) 

Begins crawling base on time, order, restricted site $info Setting up a crawl involves creating a queue bundle and an index archive bundle

Parameters

array $info

parameter for the crawl

initializeIndexBundle()

initializeIndexBundle(array  $info = array(), array  $try_to_set_from_old_index = null) 

Function used to set up an indexer's IndexArchiveBundle or DoubleIndexBundle according to the current crawl parameters or the values stored in an existing bundle.

Parameters

array $info

if initializing a new crawl this should contain the crawl parameters

array $try_to_set_from_old_index

parameters of the crawl to try to set from values already stored in archive info, other parameters are assumed to have been updated since.

updateDisallowedQuotaSites()

updateDisallowedQuotaSites() 

This is called whenever the crawl options are modified to parse from the disallowed sites, those sites of the format: site#quota where quota is the number of urls allowed to be downloaded in an hour from the site. These sites are then deleted from disallowed_sites and added to $this->quota sites. An entry in $this->quota_sites has the format: $quota_site => [$quota, $num_urls_downloaded_this_hr]

initializeWebQueue()

initializeWebQueue() 

This method sets up a WebQueueBundle according to the current crawl order so that it can receive urls and prioritize them.

clearWebQueue()

clearWebQueue() 

Delete all the urls from the web queue does not affect filters

checkUpdateCrawlParameters()

checkUpdateCrawlParameters() 

Checks to see if the parameters by which the active crawl are being conducted have been modified since the last time the values were put into queue server field variables. If so, it updates the values to to their new values

deleteOrphanedBundles()

deleteOrphanedBundles() 

Delete all the queue bundles and schedules that don't have an associated index bundle as this means that crawl has been deleted.

join()

join() 

This is a callback method that IndexArchiveBundle will periodically call when it processes a method that take a long time. This allows for instance continued processing of index data while say a dictionary merge is being performed.

processDataFile()

processDataFile(string  $base_dir, string  $callback_method, boolean  $blocking = false) 

Generic function used to process Data, Index, and Robot info schedules Finds the first file in the the direcotry of schedules of the given type, and calls the appropriate callback method for that type.

Parameters

string $base_dir

directory for of schedules

string $callback_method

what method should be called to handle a schedule

boolean $blocking

this method might be called by the indexer subcomponent when a merge tier phase is ongoing to allow for other processing to occur. If so, we don't want a regress where the indexer calls this code calls the indexer etc. If the blocking flag is set then the indexer subcomponent won't be called

updateMostRecentFetcher()

updateMostRecentFetcher() 

Determines the most recent fetcher that has spoken with the web server of this queue server and stored the result in the field variable most_recent_fetcher

processIndexData()

processIndexData(boolean  $blocking) 

Sets up the directory to look for a file of unprocessed index archive data from fetchers then calls the function processDataFile to process the oldest file found

Parameters

boolean $blocking

this method might be called by the indexer subcomponent when a merge tier phase is ongoing to allow for other processing to occur. If so, we don't want a regress where the indexer calls this code calls the indexer etc. If the blocking flag is set then the indexer subcomponent won't be called

processIndexArchive()

processIndexArchive(string  $file, boolean  $blocking) 

Adds the summary and index data in $file to summary bundle and word index

Parameters

string $file

containing web pages summaries and a mini-inverted index for their content

boolean $blocking

this method might be called by the indexer subcomponent when a merge tier phase is ongoing to allow for other processing to occur. If so, we don't want a regress where the indexer calls this code calls the indexer etc. If the blocking flag is set then the indexer subcomponent won't be called

constrainIndexerMemoryUsage()

constrainIndexerMemoryUsage() 

Tries to prevent Indexer from crashing do to excessive memory use.

If Indexer is using more that C\MEMORY_FILL_FACTOR of its allowed memory, tries to free memory by saving index bundle to disk, freeing memory, then reloading.

processRobotUrls()

processRobotUrls() 

Checks how old the oldest robot data is and dumps if older then a threshold, then sets up the path to the robot schedule directory and tries to process a file of robots.txt robot paths data from there

processRobotArchive()

processRobotArchive(string  $file) 

Reads in $file of robot data adding host-paths to the disallowed robot filter and setting the delay in the delay filter of crawled delayed hosts

Parameters

string $file

file to read of robot data, is removed after processing

processEtagExpires()

processEtagExpires() 

Process cache page validation data files sent by Fetcher

processEtagExpiresArchive()

processEtagExpiresArchive(string  $file) 

Processes a cache page validation data file. Extracts key-value pairs from the file and inserts into the B-Tree used for storing cache page validation data.

Parameters

string $file

is the cache page validation data file written by Fetchers.

deleteRobotData()

deleteRobotData() 

Deletes all Robot informations stored by the QueueServer.

This function is called roughly every CACHE_ROBOT_TXT_TIME. It forces the crawler to redownload robots.txt files before hosts can be continued to be crawled. This ensures if the cache robots.txt file is never too old. Thus, if someone changes it to allow or disallow the crawler it will be noticed reasonably promptly.

processQueueUrls()

processQueueUrls() 

Checks for a new crawl file or a schedule data for the current crawl and if such a exists then processes its contents adding the relevant urls to the priority queue

processDataArchive()

processDataArchive(string  $file) 

Process a file of to-crawl urls adding to or adjusting the weight in the PriorityQueue of those which have not been seen. Also updates the queue with seen url info

Parameters

string $file

containing serialized to crawl and seen url info

dumpBigScheduleToSmall()

dumpBigScheduleToSmall(\seekquarry\yioop\executables\array&  $sites) 

Used to split a large schedule of to crawl sites into small ones (which are written to disk) which can be handled by processDataArchive

It is possible that a large schedule file is created if someone pastes more than MAX_FETCH_SIZE many urls into the initial seed sites of a crawl in the UI.

Parameters

\seekquarry\yioop\executables\array& $sites

array containing to crawl data

writeCrawlStatus()

writeCrawlStatus(array  $sites) 

Writes status information about the current crawl so that the webserver app can use it for its display.

Parameters

array $sites

contains the most recently crawled sites

calculateScheduleMetaInfo()

calculateScheduleMetaInfo(integer  $schedule_time) : string

Used to create encode a string representing with meta info for a fetcher schedule.

Parameters

integer $schedule_time

timestamp of the schedule

Returns

string —

base64 encoded meta info

produceFetchBatch()

produceFetchBatch() 

Produces a schedule.txt file of url data for a fetcher to crawl next.

The hard part of scheduling is to make sure that the overall crawl process obeys robots.txt files. This involves checking the url is in an allowed path for that host and it also involves making sure the Crawl-delay directive is respected. The first fetcher that contacts the server requesting data to crawl will get the schedule.txt produced by produceFetchBatch() at which point it will be unlinked (these latter thing are controlled in FetchController).

getEarliestSlot()

getEarliestSlot(integer  $index, \seekquarry\yioop\executables\array&  $arr) : integer

Gets the first unfilled schedule slot after $index in $arr

A schedule of sites for a fetcher to crawl consists of MAX_FETCH_SIZE many slots earch of which could eventually hold url information. This function is used to schedule slots for crawl-delayed host.

Parameters

integer $index

location to begin searching for an empty slot

\seekquarry\yioop\executables\array& $arr

list of slots to look in

Returns

integer —

index of first available slot

cullNoncrawlableSites()

cullNoncrawlableSites() 

Used to remove from the queue urls that are no longer crawlable because the allowed and disallowed sites have changed.

allowedToCrawlSite()

allowedToCrawlSite(string  $url) : boolean

Checks if url belongs to a list of sites that are allowed to be crawled and that the file type is crawlable

Parameters

string $url

url to check

Returns

boolean —

whether is allowed to be crawled or not

disallowedToCrawlSite()

disallowedToCrawlSite(string  $url) : boolean

Checks if url belongs to a list of sites that aren't supposed to be crawled

Parameters

string $url

url to check

Returns

boolean —

whether is shouldn't be crawled

withinQuota()

withinQuota(string  $url, integer  $bump_count = 1) : boolean

Checks if the $url is from a site which has an hourly quota to download.

If so, it bumps the quota count and return true; false otherwise. This method also resets the quota queue every over

Parameters

string $url

to check if within quota

integer $bump_count

how much to bump quota count if url is from a site with a quota

Returns

boolean —

whether $url exceeds the hourly quota of the site it is from