\seekquarry\yioop\libraryFetchUrl

Code used to manage HTTP or Gopher requests from one or more URLS

Summary

Methods
Properties
Constants
getPages()
prepareUrlHeaders()
computePageHash()
parseHeaderPage()
getCurlIp()
getPage()
checkResponseForErrors()
$local_ip_cache
No constants found
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Properties

$local_ip_cache

$local_ip_cache : array

a small cache of dns - ip addresses lookups for machines that are part of this yioop cluster

Type

array

Methods

getPages()

getPages(array  $sites, boolean  $timer = false, integer  $page_range_request = \seekquarry\yioop\configs\PAGE_RANGE_REQUEST, string  $temp_dir = "", string  $key = \seekquarry\yioop\library\CrawlConstants::URL, string  $value = \seekquarry\yioop\library\CrawlConstants::PAGE, boolean  $minimal = false, array  $post_data = null, boolean  $follow = false, string  $tor_proxy = "", array  $proxy_servers = array()) : array

Make multi_curl requests for an array of sites with urls or onion urls

Parameters

array $sites

an array containing urls of pages to request

boolean $timer

flag, true means print timing statistics to log

integer $page_range_request

maximum number of bytes to download/page 0 means download all

string $temp_dir

folder to store temporary ip header info

string $key

the component of $sites[$i] that has the value of a url to get defaults to URL

string $value

component of $sites[$i] in which to store the page that was gotten

boolean $minimal

if true do a faster request of pages by not doing things like extract HTTP headers sent, etcs

array $post_data

data to be POST'd to each site

boolean $follow

whether to follow redirects or not

string $tor_proxy

url of a proxy that knows how to download .onion urls

array $proxy_servers

if not [], then an array of proxy server to use rather than to directly download web pages from the current machine

Returns

array —

an updated array with the contents of those pages

prepareUrlHeaders()

prepareUrlHeaders(string  $url, array  $proxy_servers = array(), string  $temp_dir = "") : array

Curl requests are typically done using cache data which is stored after ### at the end of urls if this is possible. To make this work. The http Host: with the url is added a header after the for the curl request. The job of this function is to do this replace

Parameters

string $url

site to download with ip address at end potentially afte ###

array $proxy_servers

if not empty an array of proxy servers used to crawl through

string $temp_dir

folder to store temporary ip header info

Returns

array —

3-tuple (orig url, url with replacement, http header array)

computePageHash()

computePageHash(\seekquarry\yioop\library\string&  $page) : string

Computes a hash of a string containing page data for use in deduplication of pages with similar content

Parameters

\seekquarry\yioop\library\string& $page

reference to web page data

Returns

string —

8 byte hash to identify page contents

parseHeaderPage()

parseHeaderPage(string  $header_and_page, string  $value = \seekquarry\yioop\library\CrawlConstants::PAGE) : array

Splits an http response document into the http headers sent and the web page returned. Parses out useful information from the header and return an array of these two parts and the useful info.

Parameters

string $header_and_page

string of downloaded data

string $value

field to store the page portion of page

Returns

array —

info array consisting of a header, page for an http response, as well as parsed from the header the server, server version, operating system, encoding, and date information.

getCurlIp()

getCurlIp(string  $header) : string

Computes the IP address from http get-responser header

Parameters

string $header

contains complete transcript of HTTP get/response

Returns

string —

IPv4 address as a string of dot separated quads.

getPage()

getPage(string  $site, array  $post_data = null, boolean  $check_for_errors = false, string  $user_password = null,   $timeout = \seekquarry\yioop\configs\SINGLE_PAGE_TIMEOUT) : string

Make a curl request for the provided url

Parameters

string $site

url of page to request

array $post_data

any data to be POST'd to the URL

boolean $check_for_errors

whether or not to check the response for the words, NOTICE, WARNING, FATAL which might indicate an error on the server

string $user_password

username:password to use for connection if needed (optional)

$timeout

how long to wait for page download to complete

Returns

string —

the contents of what the curl request fetched

checkResponseForErrors()

checkResponseForErrors(string  $response) 

Given the results of a getPage call, check whether or not the response had the words NOTICE, WARNING, FATAL which might indicate an error on the server. If it does, then the $response string is sent to the crawlLog

Parameters

string $response

getPage response in which to check for errors