\seekquarry\yioop\executablesFetcher

This class is responsible for fetching web pages for the SeekQuarry/Yioop search engine

Fetcher periodically queries the queue server asking for web pages to fetch. It gets at most MAX_FETCH_SIZE many web pages from the queue_server in one go. It then fetches these pages. Pages are fetched in batches of NUM_MULTI_CURL_PAGES many pages. Each SEEN_URLS_BEFORE_UPDATE_SCHEDULER many downloaded pages (not including robot pages), the fetcher sends summaries back to the machine on which the queue_server lives. It does this by making a request of the web server on that machine and POSTs the data to the yioop web app. This data is handled by the FetchController class. The summary data can include up to four things: (1) robot.txt data, (2) summaries of each web page downloaded in the batch, (3), a list of future urls to add to the to-crawl queue, and (4) a partial inverted index saying for each word that occurred in the current SEEN_URLS_BEFORE_UPDATE_SCHEDULER documents batch, what documents it occurred in. The inverted index also associates to each word document pair several scores. More information on these scores can be found in the documentation for \seekquarry\yioop\executables\buildMiniInvertedIndex()

Summary

Methods
Properties
Constants
__construct()
pageProcessor()
start()
loop()
downloadPagesWebCrawl()
downloadPagesArchiveCrawl()
deleteOldCrawls()
checkCrawlTime()
checkScheduler()
checkArchiveScheduler()
exceedMemoryThreshold()
selectCurrentServerAndUpdateIfNeeded()
setCrawlParamsFromArray()
getFetchSites()
reschedulePages()
processFetchPages()
getPageThumbs()
cullNoncrawlableSites()
allowedToCrawlSite()
disallowedToCrawlSite()
pruneLinks()
copySiteFields()
processSubdocs()
updateFoundSites()
addToCrawlSites()
updateScheduler()
compressAndUnsetSeenUrls()
uploadCrawlData()
buildMiniInvertedIndex()
$db
$name_server
$queue_servers
$current_server
$page_processors
$plugin_processors
$plugin_hash
$restrict_sites_by_url
$indexed_file_types
$all_file_types
$allowed_sites
$disallowed_sites
$allow_disallow_cache_time
$page_rule_parser
$web_archive
$crawl_time
$check_crawl_time
$num_multi_curl
$slow_start_mode
$to_crawl
$to_crawl_again
$found_sites
$schedule_time
$sum_seen_site_description_length
$sum_seen_title_length
$sum_seen_site_link_length
$num_seen_sites
$channel
$crawl_order
$max_depth
$summarizer_option
$crawl_type
$arc_type
$arc_dir
$archive_iterator
$recrawl_check_scheduler
$crawl_index
$cache_pages
$fetcher_num
$page_range_request
$max_description_len
$hosts_with_errors
$no_process_links
$post_max_size
$minimum_fetch_loop_time
$active_classifiers
$scrapers
$active_rankers
$total_git_urls
$all_git_urls
$programming_language_extension
$tor_proxy
$proxy_servers
$debug
DEFAULT_POST_MAX_SIZE
REPOSITORY_GIT
GIT_URL_CONTINUE
INDICATOR_NONE
HEX_NULL_CHARACTER
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Constants

DEFAULT_POST_MAX_SIZE

DEFAULT_POST_MAX_SIZE

Before receiving any data from a queue server's web app this is the default assumed post_max_size in bytes

REPOSITORY_GIT

REPOSITORY_GIT

constant indicating Git repository

GIT_URL_CONTINUE

GIT_URL_CONTINUE

constant indicating Git repository

INDICATOR_NONE

INDICATOR_NONE

An indicator to tell no actions to be taken

HEX_NULL_CHARACTER

HEX_NULL_CHARACTER

A indicator to represent next position after the access code in Git tree object

Properties

$db

$db : object

Reference to a database object. Used since has directory manipulation functions

Type

object

$name_server

$name_server : array

Urls or IP address of the web_server used to administer this instance of yioop. Used to figure out available queue_servers to contact for crawling data

Type

array

$queue_servers

$queue_servers : array

Array of Urls or IP addresses of the queue_servers to get sites to crawl from

Type

array

$current_server

$current_server : integer

Index into $queue_servers of the server get schedule from (or last one we got the schedule from)

Type

integer

$page_processors

$page_processors : array

An associative array of (mimetype => name of processor class to handle) pairs.

Type

array

$plugin_processors

$plugin_processors : array

An associative array of (page processor => array of indexing plugin name associated with the page processor). It is used to determine after a page is processed which plugins' pageProcessing($page, $url) method should be called

Type

array

$plugin_hash

$plugin_hash : string

Hash used to keep track of whether $plugin_processors info needs to be changed

Type

string

$restrict_sites_by_url

$restrict_sites_by_url : boolean

Says whether the $allowed_sites array is being used or not

Type

boolean

$indexed_file_types

$indexed_file_types : array

List of file extensions supported for the crawl

Type

array

$all_file_types

$all_file_types : array

List of all known file extensions including those not used for crawl

Type

array

$allowed_sites

$allowed_sites : array

Web-sites that crawler can crawl. If used, ONLY these will be crawled

Type

array

$disallowed_sites

$disallowed_sites : array

Web-sites that the crawler must not crawl

Type

array

$allow_disallow_cache_time

$allow_disallow_cache_time : integer

Microtime used to look up cache $allowed_sites and $disallowed_sites filtering data structures

Type

integer

$page_rule_parser

$page_rule_parser : array

Holds the parsed page rules which will be applied to document summaries before finally storing and indexing them

Type

array

$web_archive

$web_archive : object

WebArchiveBundle used to store complete web pages and auxiliary data

Type

object

$crawl_time

$crawl_time : integer

Timestamp of the current crawl

Type

integer

$check_crawl_time

$check_crawl_time : integer

The last time the name server was checked for a crawl time

Type

integer

$num_multi_curl

$num_multi_curl : integer

For a web crawl only the number of web pages to download in one go. The constant SLOW_START can be used to reduce this number from C\NUM_MULTI_CURL_PAGES for the first hour of a crawl

Type

integer

$slow_start_mode

$slow_start_mode : boolean

Used to check if we are currently operaing in slow star mode

Type

boolean

$to_crawl

$to_crawl : array

Contains the list of web pages to crawl from a queue_server

Type

array

$to_crawl_again

$to_crawl_again : array

Contains the list of web pages to crawl that failed on first attempt (we give them one more try before bailing on them)

Type

array

$found_sites

$found_sites : array

Summary information for visited sites that the fetcher hasn't sent to a queue_server yet

Type

array

$schedule_time

$schedule_time : integer

Timestamp from a queue_server of the current schedule of sites to download. This is sent back to the server once this schedule is completed to help the queue server implement crawl-delay if needed.

Type

integer

$sum_seen_site_description_length

$sum_seen_site_description_length : integer

The sum of the number of words of all the page descriptions for the current crawl. This is used in computing document statistics.

Type

integer

$sum_seen_title_length

$sum_seen_title_length : integer

The sum of the number of words of all the page titles for the current crawl. This is used in computing document statistics.

Type

integer

$sum_seen_site_link_length

$sum_seen_site_link_length : integer

The sum of the number of words in all the page links for the current crawl. This is used in computing document statistics.

Type

integer

$num_seen_sites

$num_seen_sites : integer

Number of sites crawled in the current crawl

Type

integer

$channel

$channel : integer

Channel that queue server listens to messages for

Type

integer

$crawl_order

$crawl_order : string

Stores the name of the ordering used to crawl pages. This is used in a switch/case when computing weights of urls to be crawled before sending these new urls back to a queue_server.

Type

string

$max_depth

$max_depth : integer

Maximum depth fetcher should extract need seed urls to

Type

integer

$summarizer_option

$summarizer_option : string

Stores the name of the summarizer used for crawling.

Possible values are self::BASIC, self::GRAPH_BASED_SUMMARIZER, self::CENTROID_SUMMARIZER and self::CENTROID_WEIGHTED_SUMMARIZER

Type

string

$crawl_type

$crawl_type : string

Indicates the kind of crawl being performed: self::WEB_CRAWL indicates a new crawl of the web; self::ARCHIVE_CRAWL indicates a crawl of an existing web archive

Type

string

$arc_type

$arc_type : string

For an archive crawl, holds the name of the type of archive being iterated over (this is the class name of the iterator, without the word 'Iterator')

Type

string

$arc_dir

$arc_dir : string

For a non-web archive crawl, holds the path to the directory that contains the archive files and their description (web archives have a different structure and are already distributed across machines and fetchers)

Type

string

$archive_iterator

$archive_iterator : object

If an web archive crawl (i.e. a re-crawl) is active then this field holds the iterator object used to iterate over the archive

Type

object

$recrawl_check_scheduler

$recrawl_check_scheduler : boolean

Keeps track of whether during the recrawl we should notify a queue_server scheduler about our progress in mini-indexing documents in the archive

Type

boolean

$crawl_index

$crawl_index : string

If the crawl_type is self::ARCHIVE_CRAWL, then crawl_index is the timestamp of the existing archive to crawl

Type

string

$cache_pages

$cache_pages : boolean

Whether to cache pages or just the summaries

Type

boolean

$fetcher_num

$fetcher_num : string

Which fetcher instance we are (if fetcher run as a job and more that one)

Type

string

$page_range_request

$page_range_request : integer

Maximum number of bytes to download of a webpage

Type

integer

$max_description_len

$max_description_len : integer

Max number of chars to extract for description from a page to index.

Only words in the description are indexed.

Type

integer

$hosts_with_errors

$hosts_with_errors : array

An array to keep track of hosts which have had a lot of http errors

Type

array

$no_process_links

$no_process_links : boolean

When processing recrawl data this says to assume the data has already had its inks extracted into a field and so this doesn't have to be done in a separate step

Type

boolean

$post_max_size

$post_max_size : integer

Maximum number of bytes which can be uploaded to the current queue server's web app in one go

Type

integer

$minimum_fetch_loop_time

$minimum_fetch_loop_time : integer

Fetcher must wait at least this long between multi-curl requests.

The value below is dynamically determined but is at least as large as MINIMUM_FETCH_LOOP_TIME

Type

integer

$active_classifiers

$active_classifiers : array

Contains which classifiers to use for the current crawl Classifiers can be used to label web documents with a meta word if the classifiers threshold is met

Type

array

$scrapers

$scrapers : array

Contains an array of scrapers used to extract the import content from particular kind of HTML pages, for example, pages generated by a particular content management system.

Type

array

$active_rankers

$active_rankers : array

Contains which classifiers to use for the current crawl that are being used to rank web documents. The score that the classifier gives to a document is used for this ranking purposes

Type

array

$total_git_urls

$total_git_urls : integer

To keep track of total number of Git internal urls

Type

integer

$all_git_urls

$all_git_urls : array

To store all the internal git urls fetched

Type

array

$programming_language_extension

$programming_language_extension : array

To map programming languages with their extensions

Type

array

$tor_proxy

$tor_proxy : string

If this is not null and a .onion url is detected then this url will used as a proxy server to download the .onion url

Type

string

$proxy_servers

$proxy_servers : array

an array of proxy servers to use rather than to directly download web pages from the current machine. If is the empty array, then we just directly download from the current machine

Type

array

$debug

$debug : string

Holds the value of a debug message that might have been sent from the command line during the current execution of loop();

Type

string

Methods

__construct()

__construct() 

Sets up the field variables so that crawling can begin

pageProcessor()

pageProcessor(string  $type) : object

Return the fetcher's copy of a page processor for the given mimetype.

Parameters

string $type

mimetype want a processor for

Returns

object —

a page processor for that mime type of false if that mimetype can't be handled

start()

start() 

This is the function that should be called to get the fetcher to start fetching. Calls init to handle the command-line arguments then enters the fetcher's main loop

loop()

loop() 

Main loop for the fetcher.

Checks for stop message, checks queue server if crawl has changed and for new pages to crawl. Loop gets a group of next pages to crawl if there are pages left to crawl (otherwise sleep 5 seconds). It downloads these pages, deduplicates them, and updates the found site info with the result before looping again.

downloadPagesWebCrawl()

downloadPagesWebCrawl() : array

Get a list of urls from the current fetch batch provided by the queue server. Then downloads these pages. Finally, reschedules, if possible, pages that did not successfully get downloaded.

Returns

array —

an associative array of web pages and meta data fetched from the internet

downloadPagesArchiveCrawl()

downloadPagesArchiveCrawl() : array

Extracts NUM_MULTI_CURL_PAGES from the curent Archive Bundle that is being recrawled.

Returns

array —

an associative array of web pages and meta data from the archive bundle being iterated over

deleteOldCrawls()

deleteOldCrawls(array  $still_active_crawls) 

Deletes any crawl web archive bundles not in the provided array of crawls

Parameters

array $still_active_crawls

those crawls which should not be deleted, so all others will be deleted

checkCrawlTime()

checkCrawlTime() : boolean

Makes a request of the name server machine to get the timestamp of the currently running crawl to see if it changed

If the timestamp has changed save the rest of the current fetch batch, then load any existing fetch from the new crawl; otherwise, set the crawl to empty. Also, handles deleting old crawls on this fetcher machine based on a list of current crawls on the name server.

Returns

boolean —

true if loaded a fetch batch due to time change

checkScheduler()

checkScheduler() : mixed

Get status, current crawl, crawl order, and new site information from the queue_server.

Returns

mixed —

array or bool. If we are doing a web crawl and we still have pages to crawl then true, if the scheduler page fails to download then false, otherwise, returns an array of info from the scheduler.

checkArchiveScheduler()

checkArchiveScheduler() : array

During an archive crawl this method is used to get from the name server a collection of pages to process. The fetcher will later process these and send summaries to various queue_servers.

Returns

array —

containing archive page data

exceedMemoryThreshold()

exceedMemoryThreshold() : boolean

Function to check if memory for this fetcher instance is getting low relative to what the system will allow.

Returns

boolean —

whether available memory is getting low

selectCurrentServerAndUpdateIfNeeded()

selectCurrentServerAndUpdateIfNeeded(boolean  $at_least_once) 

At least once, and while memory is low picks at server at random and send any fetcher data we have to it.

Parameters

boolean $at_least_once

whether to send to the site info to at least queue server or to send only if memory is above threshold

setCrawlParamsFromArray()

setCrawlParamsFromArray(\seekquarry\yioop\executables\array&  $info) 

Sets parameters for fetching based on provided info struct ($info typically would come from the queue server)

Parameters

\seekquarry\yioop\executables\array& $info

struct with info about the kind of crawl, timestamp of index, crawl order, etc.

getFetchSites()

getFetchSites() : array

Prepare an array of up to NUM_MULTI_CURL_PAGES' worth of sites to be downloaded in one go using the to_crawl array. Delete these sites from the to_crawl array.

Returns

array —

sites which are ready to be downloaded

reschedulePages()

reschedulePages(\seekquarry\yioop\executables\array&  $site_pages) : \seekquarry\yioop\executables\an

Sorts out pages for which no content was downloaded so that they can be scheduled to be crawled again.

Parameters

\seekquarry\yioop\executables\array& $site_pages

pages to sort

Returns

\seekquarry\yioop\executables\an —

array conisting of two array downloaded pages and not downloaded pages.

processFetchPages()

processFetchPages(array  $site_pages) : array

Processes an array of downloaded web pages with the appropriate page processor.

Summary data is extracted from each non robots.txt file in the array. Disallowed paths and crawl-delays are extracted from robots.txt files.

Parameters

array $site_pages

a collection of web pages to process

Returns

array —

summary data extracted from these pages

getPageThumbs()

getPageThumbs(\seekquarry\yioop\executables\array&  $sites) 

Adds thumbs for websites with a self::THUMB_URL field by downloading the linked to images and making a thumb from it.

Parameters

\seekquarry\yioop\executables\array& $sites

associative array of web sites information to add thumbs for. At least one site in the array should have a self::THUMB_URL field that we want have the thumb of

cullNoncrawlableSites()

cullNoncrawlableSites() 

Used to remove from the to_crawl urls those that are no longer crawlable because the allowed and disallowed sites have changed.

allowedToCrawlSite()

allowedToCrawlSite(string  $url) : boolean

Checks if url belongs to a list of sites that are allowed to be crawled and that the file type is crawlable

Parameters

string $url

url to check

Returns

boolean —

whether is allowed to be crawled or not

disallowedToCrawlSite()

disallowedToCrawlSite(string  $url) : boolean

Checks if url belongs to a list of sites that aren't supposed to be crawled

Parameters

string $url

url to check

Returns

boolean —

whether is shouldn't be crawled

pruneLinks()

pruneLinks(\seekquarry\yioop\executables\array&  $doc_info, string  $field = \seekquarry\yioop\library\CrawlConstants::LINKS, integer  $member_cache_time) 

Page processors are allowed to extract up to MAX_LINKS_TO_EXTRACT This method attempts to cull from the doc_info struct the best MAX_LINKS_PER_PAGE. Currently, this is done by first removing links of filetype or sites the crawler is forbidden from crawl.

Then a crude estimate of the information contained in the links test: strlen(gzip(text)) is used to extract the best remaining links.

Parameters

\seekquarry\yioop\executables\array& $doc_info

a string with a CrawlConstants::LINKS subarray This subarray in turn contains url => text pairs.

string $field

field for links default is CrawlConstants::LINKS

integer $member_cache_time

says how long allowed and disallowed url info should be caches by urlMemberSiteArray

copySiteFields()

copySiteFields(integer  $i, array  $site, \seekquarry\yioop\executables\array&  $summarized_site_pages, \seekquarry\yioop\executables\array&  $stored_site_pages) 

Copies fields from the array of site data to the $i indexed element of the $summarized_site_pages and $stored_site_pages array

Parameters

integer $i

index to copy to

array $site

web page info to copy

\seekquarry\yioop\executables\array& $summarized_site_pages

array of summaries of web pages

\seekquarry\yioop\executables\array& $stored_site_pages

array of cache info of web pages

processSubdocs()

processSubdocs(\seekquarry\yioop\executables\int&  $i, array  $site, \seekquarry\yioop\executables\array&  $summarized_site_pages, \seekquarry\yioop\executables\array&  $stored_site_pages) 

The pageProcessing method of an IndexingPlugin generates a self::SUBDOCS array of additional "micro-documents" that might have been in the page. This methods adds these documents to the summaried_size_pages and stored_site_pages arrays constructed during the execution of processFetchPages()

Parameters

\seekquarry\yioop\executables\int& $i

index to begin adding subdocs at

array $site

web page that subdocs were from and from which some subdoc summary info is copied

\seekquarry\yioop\executables\array& $summarized_site_pages

array of summaries of web pages

\seekquarry\yioop\executables\array& $stored_site_pages

array of cache info of web pages

updateFoundSites()

updateFoundSites(array  $sites, boolean  $force_send = false) 

Updates the $this->found_sites array with data from the most recently downloaded sites. This means updating the following sub arrays: the self::ROBOT_PATHS, self::TO_CRAWL. It checks if there are still more urls to crawl or if self::SEEN_URLS has grown larger than SEEN_URLS_BEFORE_UPDATE_SCHEDULER. If so, a mini index is built and, the queue server is called with the data.

Parameters

array $sites

site data to use for the update

boolean $force_send

whether to force send data back to queue_server or rely on usual thresholds before sending

addToCrawlSites()

addToCrawlSites(array  $link_urls, integer  $old_weight_pair, string  $site_hash, string  $old_url, integer  $num_common, boolean  $from_sitemap = false) 

Used to add a set of links from a web page to the array of sites which need to be crawled.

Parameters

array $link_urls

an array of urls to be crawled

integer $old_weight_pair

the weight and depth of the web page the links came from (high 3 bytes for former low byte latter)

string $site_hash

a hash of the web_page on which the link was found, for use in deduplication

string $old_url

url of page where links came from

integer $num_common

number of company level domains in common between $link_urls and $old_url

boolean $from_sitemap

whether the links are coming from a sitemap

updateScheduler()

updateScheduler() 

Updates the queue_server about sites that have been crawled.

This method is called if there are currently no more sites to crawl or if SEEN_URLS_BEFORE_UPDATE_SCHEDULER many pages have been processed. It creates a inverted index of the non robot pages crawled and then compresses and does a post request to send the page summary data, robot data, to crawl url data, and inverted index back to the server. In the event that the server doesn't acknowledge it loops and tries again after a delay until the post is successful. At this point, memory for this data is freed.

compressAndUnsetSeenUrls()

compressAndUnsetSeenUrls() : string

Computes a string of compressed urls from the seen urls and extracted links destined for the current queue server. Then unsets these values from $this->found_sites

Returns

string —

of compressed urls

uploadCrawlData()

uploadCrawlData(string  $queue_server, array  $byte_counts, array  $post_data) 

Sends to crawl, robot, and index data to the current queue server.

If this data is more than post_max_size, it splits it into chunks which are then reassembled by the queue server web app before being put into the appropriate schedule sub-directory.

Parameters

string $queue_server

url of the current queue server

array $byte_counts

has four fields: TOTAL, ROBOT, SCHEDULE, INDEX. These give the number of bytes overall for the 'data' field of $post_data and for each of these components.

array $post_data

data to be uploaded to the queue server web app

buildMiniInvertedIndex()

buildMiniInvertedIndex() 

Builds an inverted index shard (word --> {docs it appears in}) for the current batch of SEEN_URLS_BEFORE_UPDATE_SCHEDULER many pages.

This inverted index shard is then merged by a queue_server into the inverted index of the current generation of the crawl. The complete inverted index for the whole crawl is built out of these inverted indexes for generations. The point of computing a partial inverted index on the fetcher is to reduce some of the computational burden on the queue server. The resulting mini index computed by buildMiniInvertedIndex() is stored in $this->found_sites[self::INVERTED_INDEX]