$db
$db : object
Reference to a database object. Used since has directory manipulation functions
Command line program responsible for managing Yioop crawls.
It maintains a queue of urls that are going to be scheduled to be seen. It also keeps track of what has been seen and robots.txt info. Its last responsibility is to create and populate the IndexArchiveBundle that is used by the search front end.
$waiting_hosts : array
This is a list of hosts whose robots.txt file had a Crawl-delay directive and which we have produced a schedule with urls for, but we have not heard back from the fetcher who was processing those urls. Hosts on this list will not be scheduled for more downloads until the fetcher with earlier urls has gotten back to the queue server.
checkProcessRunning(string $process, array $info)
Checks to make sure the given process (either Indexer or Scheduler) is running, and if not, restart it.
string | $process | should be either self::INDEXER or self::SCHEDULER |
array | $info | information about queue server state used to determine if a crawl is active. |
processCrawlData(boolean $blocking = false)
Main body of queue server loop where indexing, scheduling, robot file processing is done.
boolean | $blocking | this method might be called by the indexer subcomponent when a merge tier phase is ongoing to allow for other processing to occur. If so, we don't want a regress where the indexer calls this code calls the indexer etc. If the blocking flag is set then the indexer subcomponent won't be called |
getDataArchiveFileData(string $file) : array
Used to get a data archive file (either during a normal crawl or a recrawl). After uncompressing this file (which comes via the web server through fetch_controller, from the fetcher), it sets which fetcher sent it and also returns the sites contained in it.
string | $file | name of archive data file |
sites contained in the file from the fetcher
handleAdminMessages(array $info) : array
Handles messages passed via files to the QueueServer.
These files are typically written by the CrawlDaemon::init() when QueueServer is run using command-line argument
array | $info | associative array with info about current state of queue server |
an updates version $info reflecting changes that occurred during the handling of the admin messages files.
dumpQueueToSchedules(boolean $for_reschedule = false)
When a crawl is being shutdown, this function is called to write the contents of the web queue bundle back to schedules. This allows crawls to be resumed without losing urls. This function can also be called if the queue gets clogged to reschedule its contents for a later time.
boolean | $for_reschedule | if the call was to reschedule the urls to be crawled at a later time as opposed to being used to save the urls because the crawl is being halted. |
initializeIndexBundle(array $info = array(), array $try_to_set_from_old_index = null)
Function used to set up an indexer's IndexArchiveBundle or DoubleIndexBundle according to the current crawl parameters or the values stored in an existing bundle.
array | $info | if initializing a new crawl this should contain the crawl parameters |
array | $try_to_set_from_old_index | parameters of the crawl to try to set from values already stored in archive info, other parameters are assumed to have been updated since. |
updateDisallowedQuotaSites()
This is called whenever the crawl options are modified to parse from the disallowed sites, those sites of the format: site#quota where quota is the number of urls allowed to be downloaded in an hour from the site. These sites are then deleted from disallowed_sites and added to $this->quota sites. An entry in $this->quota_sites has the format: $quota_site => [$quota, $num_urls_downloaded_this_hr]
processDataFile(string $base_dir, string $callback_method, boolean $blocking = false)
Generic function used to process Data, Index, and Robot info schedules Finds the first file in the the direcotry of schedules of the given type, and calls the appropriate callback method for that type.
string | $base_dir | directory for of schedules |
string | $callback_method | what method should be called to handle a schedule |
boolean | $blocking | this method might be called by the indexer subcomponent when a merge tier phase is ongoing to allow for other processing to occur. If so, we don't want a regress where the indexer calls this code calls the indexer etc. If the blocking flag is set then the indexer subcomponent won't be called |
processIndexData(boolean $blocking)
Sets up the directory to look for a file of unprocessed index archive data from fetchers then calls the function processDataFile to process the oldest file found
boolean | $blocking | this method might be called by the indexer subcomponent when a merge tier phase is ongoing to allow for other processing to occur. If so, we don't want a regress where the indexer calls this code calls the indexer etc. If the blocking flag is set then the indexer subcomponent won't be called |
processIndexArchive(string $file, boolean $blocking)
Adds the summary and index data in $file to summary bundle and word index
string | $file | containing web pages summaries and a mini-inverted index for their content |
boolean | $blocking | this method might be called by the indexer subcomponent when a merge tier phase is ongoing to allow for other processing to occur. If so, we don't want a regress where the indexer calls this code calls the indexer etc. If the blocking flag is set then the indexer subcomponent won't be called |
processEtagExpiresArchive(string $file)
Processes a cache page validation data file. Extracts key-value pairs from the file and inserts into the B-Tree used for storing cache page validation data.
string | $file | is the cache page validation data file written by Fetchers. |
deleteRobotData()
Deletes all Robot informations stored by the QueueServer.
This function is called roughly every CACHE_ROBOT_TXT_TIME. It forces the crawler to redownload robots.txt files before hosts can be continued to be crawled. This ensures if the cache robots.txt file is never too old. Thus, if someone changes it to allow or disallow the crawler it will be noticed reasonably promptly.
dumpBigScheduleToSmall(\seekquarry\yioop\executables\array& $sites)
Used to split a large schedule of to crawl sites into small ones (which are written to disk) which can be handled by processDataArchive
It is possible that a large schedule file is created if someone pastes more than MAX_FETCH_SIZE many urls into the initial seed sites of a crawl in the UI.
\seekquarry\yioop\executables\array& | $sites | array containing to crawl data |
produceFetchBatch()
Produces a schedule.txt file of url data for a fetcher to crawl next.
The hard part of scheduling is to make sure that the overall crawl process obeys robots.txt files. This involves checking the url is in an allowed path for that host and it also involves making sure the Crawl-delay directive is respected. The first fetcher that contacts the server requesting data to crawl will get the schedule.txt produced by produceFetchBatch() at which point it will be unlinked (these latter thing are controlled in FetchController).
getEarliestSlot(integer $index, \seekquarry\yioop\executables\array& $arr) : integer
Gets the first unfilled schedule slot after $index in $arr
A schedule of sites for a fetcher to crawl consists of MAX_FETCH_SIZE many slots earch of which could eventually hold url information. This function is used to schedule slots for crawl-delayed host.
integer | $index | location to begin searching for an empty slot |
\seekquarry\yioop\executables\array& | $arr | list of slots to look in |
index of first available slot
withinQuota(string $url, integer $bump_count = 1) : boolean
Checks if the $url is from a site which has an hourly quota to download.
If so, it bumps the quota count and return true; false otherwise. This method also resets the quota queue every over
string | $url | to check if within quota |
integer | $bump_count | how much to bump quota count if url is from a site with a quota |
whether $url exceeds the hourly quota of the site it is from