$local_ip_cache
$local_ip_cache : array
a small cache of dns - ip addresses lookups for machines that are part of this yioop cluster
Code used to manage HTTP or Gopher requests from one or more URLS
getPages(array $sites, boolean $timer = false, integer $page_range_request = \seekquarry\yioop\configs\PAGE_RANGE_REQUEST, string $temp_dir = "", string $key = \seekquarry\yioop\library\CrawlConstants::URL, string $value = \seekquarry\yioop\library\CrawlConstants::PAGE, boolean $minimal = false, array $post_data = null, boolean $follow = false, string $tor_proxy = "", array $proxy_servers = array()) : array
Make multi_curl requests for an array of sites with urls or onion urls
array | $sites | an array containing urls of pages to request |
boolean | $timer | flag, true means print timing statistics to log |
integer | $page_range_request | maximum number of bytes to download/page 0 means download all |
string | $temp_dir | folder to store temporary ip header info |
string | $key | the component of $sites[$i] that has the value of a url to get defaults to URL |
string | $value | component of $sites[$i] in which to store the page that was gotten |
boolean | $minimal | if true do a faster request of pages by not doing things like extract HTTP headers sent, etcs |
array | $post_data | data to be POST'd to each site |
boolean | $follow | whether to follow redirects or not |
string | $tor_proxy | url of a proxy that knows how to download .onion urls |
array | $proxy_servers | if not [], then an array of proxy server to use rather than to directly download web pages from the current machine |
an updated array with the contents of those pages
prepareUrlHeaders(string $url, array $proxy_servers = array(), string $temp_dir = "") : array
Curl requests are typically done using cache data which is stored after ### at the end of urls if this is possible. To make this work. The http Host: with the url is added a header after the for the curl request. The job of this function is to do this replace
string | $url | site to download with ip address at end potentially afte ### |
array | $proxy_servers | if not empty an array of proxy servers used to crawl through |
string | $temp_dir | folder to store temporary ip header info |
3-tuple (orig url, url with replacement, http header array)
computePageHash(\seekquarry\yioop\library\string& $page) : string
Computes a hash of a string containing page data for use in deduplication of pages with similar content
\seekquarry\yioop\library\string& | $page | reference to web page data |
8 byte hash to identify page contents
parseHeaderPage(string $header_and_page, string $value = \seekquarry\yioop\library\CrawlConstants::PAGE) : array
Splits an http response document into the http headers sent and the web page returned. Parses out useful information from the header and return an array of these two parts and the useful info.
string | $header_and_page | string of downloaded data |
string | $value | field to store the page portion of page |
info array consisting of a header, page for an http response, as well as parsed from the header the server, server version, operating system, encoding, and date information.
getPage(string $site, array $post_data = null, boolean $check_for_errors = false, string $user_password = null, $timeout = \seekquarry\yioop\configs\SINGLE_PAGE_TIMEOUT) : string
Make a curl request for the provided url
string | $site | url of page to request |
array | $post_data | any data to be POST'd to the URL |
boolean | $check_for_errors | whether or not to check the response for the words, NOTICE, WARNING, FATAL which might indicate an error on the server |
string | $user_password | username:password to use for connection if needed (optional) |
$timeout | how long to wait for page download to complete |
the contents of what the curl request fetched
checkResponseForErrors(string $response)
Given the results of a getPage call, check whether or not the response had the words NOTICE, WARNING, FATAL which might indicate an error on the server. If it does, then the $response string is sent to the crawlLog
string | $response | getPage response in which to check for errors |