DEFAULT_POST_MAX_SIZE
DEFAULT_POST_MAX_SIZE
Before receiving any data from a queue server's web app this is the default assumed post_max_size in bytes
This class is responsible for fetching web pages for the SeekQuarry/Yioop search engine
Fetcher periodically queries the queue server asking for web pages to fetch. It gets at most MAX_FETCH_SIZE many web pages from the queue_server in one go. It then fetches these pages. Pages are fetched in batches of NUM_MULTI_CURL_PAGES many pages. Each SEEN_URLS_BEFORE_UPDATE_SCHEDULER many downloaded pages (not including robot pages), the fetcher sends summaries back to the machine on which the queue_server lives. It does this by making a request of the web server on that machine and POSTs the data to the yioop web app. This data is handled by the FetchController class. The summary data can include up to four things: (1) robot.txt data, (2) summaries of each web page downloaded in the batch, (3), a list of future urls to add to the to-crawl queue, and (4) a partial inverted index saying for each word that occurred in the current SEEN_URLS_BEFORE_UPDATE_SCHEDULER documents batch, what documents it occurred in. The inverted index also associates to each word document pair several scores. More information on these scores can be found in the documentation for \seekquarry\yioop\executables\buildMiniInvertedIndex()
loop()
Main loop for the fetcher.
Checks for stop message, checks queue server if crawl has changed and for new pages to crawl. Loop gets a group of next pages to crawl if there are pages left to crawl (otherwise sleep 5 seconds). It downloads these pages, deduplicates them, and updates the found site info with the result before looping again.
downloadPagesWebCrawl() : array
Get a list of urls from the current fetch batch provided by the queue server. Then downloads these pages. Finally, reschedules, if possible, pages that did not successfully get downloaded.
an associative array of web pages and meta data fetched from the internet
checkCrawlTime() : boolean
Makes a request of the name server machine to get the timestamp of the currently running crawl to see if it changed
If the timestamp has changed save the rest of the current fetch batch, then load any existing fetch from the new crawl; otherwise, set the crawl to empty. Also, handles deleting old crawls on this fetcher machine based on a list of current crawls on the name server.
true if loaded a fetch batch due to time change
checkScheduler() : mixed
Get status, current crawl, crawl order, and new site information from the queue_server.
array or bool. If we are doing a web crawl and we still have pages to crawl then true, if the scheduler page fails to download then false, otherwise, returns an array of info from the scheduler.
selectCurrentServerAndUpdateIfNeeded(boolean $at_least_once)
At least once, and while memory is low picks at server at random and send any fetcher data we have to it.
boolean | $at_least_once | whether to send to the site info to at least queue server or to send only if memory is above threshold |
setCrawlParamsFromArray(\seekquarry\yioop\executables\array& $info)
Sets parameters for fetching based on provided info struct ($info typically would come from the queue server)
\seekquarry\yioop\executables\array& | $info | struct with info about the kind of crawl, timestamp of index, crawl order, etc. |
reschedulePages(\seekquarry\yioop\executables\array& $site_pages) : \seekquarry\yioop\executables\an
Sorts out pages for which no content was downloaded so that they can be scheduled to be crawled again.
\seekquarry\yioop\executables\array& | $site_pages | pages to sort |
array conisting of two array downloaded pages and not downloaded pages.
processFetchPages(array $site_pages) : array
Processes an array of downloaded web pages with the appropriate page processor.
Summary data is extracted from each non robots.txt file in the array. Disallowed paths and crawl-delays are extracted from robots.txt files.
array | $site_pages | a collection of web pages to process |
summary data extracted from these pages
getPageThumbs(\seekquarry\yioop\executables\array& $sites)
Adds thumbs for websites with a self::THUMB_URL field by downloading the linked to images and making a thumb from it.
\seekquarry\yioop\executables\array& | $sites | associative array of web sites information to add thumbs for. At least one site in the array should have a self::THUMB_URL field that we want have the thumb of |
pruneLinks(\seekquarry\yioop\executables\array& $doc_info, string $field = \seekquarry\yioop\library\CrawlConstants::LINKS, integer $member_cache_time)
Page processors are allowed to extract up to MAX_LINKS_TO_EXTRACT This method attempts to cull from the doc_info struct the best MAX_LINKS_PER_PAGE. Currently, this is done by first removing links of filetype or sites the crawler is forbidden from crawl.
Then a crude estimate of the information contained in the links test: strlen(gzip(text)) is used to extract the best remaining links.
\seekquarry\yioop\executables\array& | $doc_info | a string with a CrawlConstants::LINKS subarray This subarray in turn contains url => text pairs. |
string | $field | field for links default is CrawlConstants::LINKS |
integer | $member_cache_time | says how long allowed and disallowed url info should be caches by urlMemberSiteArray |
copySiteFields(integer $i, array $site, \seekquarry\yioop\executables\array& $summarized_site_pages, \seekquarry\yioop\executables\array& $stored_site_pages)
Copies fields from the array of site data to the $i indexed element of the $summarized_site_pages and $stored_site_pages array
integer | $i | index to copy to |
array | $site | web page info to copy |
\seekquarry\yioop\executables\array& | $summarized_site_pages | array of summaries of web pages |
\seekquarry\yioop\executables\array& | $stored_site_pages | array of cache info of web pages |
processSubdocs(\seekquarry\yioop\executables\int& $i, array $site, \seekquarry\yioop\executables\array& $summarized_site_pages, \seekquarry\yioop\executables\array& $stored_site_pages)
The pageProcessing method of an IndexingPlugin generates a self::SUBDOCS array of additional "micro-documents" that might have been in the page. This methods adds these documents to the summaried_size_pages and stored_site_pages arrays constructed during the execution of processFetchPages()
\seekquarry\yioop\executables\int& | $i | index to begin adding subdocs at |
array | $site | web page that subdocs were from and from which some subdoc summary info is copied |
\seekquarry\yioop\executables\array& | $summarized_site_pages | array of summaries of web pages |
\seekquarry\yioop\executables\array& | $stored_site_pages | array of cache info of web pages |
updateFoundSites(array $sites, boolean $force_send = false)
Updates the $this->found_sites array with data from the most recently downloaded sites. This means updating the following sub arrays: the self::ROBOT_PATHS, self::TO_CRAWL. It checks if there are still more urls to crawl or if self::SEEN_URLS has grown larger than SEEN_URLS_BEFORE_UPDATE_SCHEDULER. If so, a mini index is built and, the queue server is called with the data.
array | $sites | site data to use for the update |
boolean | $force_send | whether to force send data back to queue_server or rely on usual thresholds before sending |
addToCrawlSites(array $link_urls, integer $old_weight_pair, string $site_hash, string $old_url, integer $num_common, boolean $from_sitemap = false)
Used to add a set of links from a web page to the array of sites which need to be crawled.
array | $link_urls | an array of urls to be crawled |
integer | $old_weight_pair | the weight and depth of the web page the links came from (high 3 bytes for former low byte latter) |
string | $site_hash | a hash of the web_page on which the link was found, for use in deduplication |
string | $old_url | url of page where links came from |
integer | $num_common | number of company level domains in common between $link_urls and $old_url |
boolean | $from_sitemap | whether the links are coming from a sitemap |
updateScheduler()
Updates the queue_server about sites that have been crawled.
This method is called if there are currently no more sites to crawl or if SEEN_URLS_BEFORE_UPDATE_SCHEDULER many pages have been processed. It creates a inverted index of the non robot pages crawled and then compresses and does a post request to send the page summary data, robot data, to crawl url data, and inverted index back to the server. In the event that the server doesn't acknowledge it loops and tries again after a delay until the post is successful. At this point, memory for this data is freed.
uploadCrawlData(string $queue_server, array $byte_counts, array $post_data)
Sends to crawl, robot, and index data to the current queue server.
If this data is more than post_max_size, it splits it into chunks which are then reassembled by the queue server web app before being put into the appropriate schedule sub-directory.
string | $queue_server | url of the current queue server |
array | $byte_counts | has four fields: TOTAL, ROBOT, SCHEDULE, INDEX. These give the number of bytes overall for the 'data' field of $post_data and for each of these components. |
array | $post_data | data to be uploaded to the queue server web app |
buildMiniInvertedIndex()
Builds an inverted index shard (word --> {docs it appears in}) for the current batch of SEEN_URLS_BEFORE_UPDATE_SCHEDULER many pages.
This inverted index shard is then merged by a queue_server into the inverted index of the current generation of the crawl. The complete inverted index for the whole crawl is built out of these inverted indexes for generations. The point of computing a partial inverted index on the fetcher is to reduce some of the computational burden on the queue server. The resulting mini index computed by buildMiniInvertedIndex() is stored in $this->found_sites[self::INVERTED_INDEX]