\seekquarry\yioop\libraryFetchUrl

Code used to manage HTTP or Gopher requests from one or more URLS

Summary

Methods

Properties

Constants

getPages()
prepareUrlHeaders()
computePageHash()
parseHeaderPage()
getCurlIp()
getPage()
checkResponseForErrors()

$local_ip_cache

No constants found

No protected methods found

No protected properties found

N/A

No private methods found

No private properties found

N/A

File: src/library/FetchUrl.php
Package: Default
Class hierarchy: \seekquarry\yioop\library\FetchUrl
Implements: \seekquarry\yioop\library\CrawlConstants

Properties

$local_ip_cache

$local_ip_cache : array

a small cache of dns - ip addresses lookups for machines that are part of this yioop cluster

Type

array

Methods

getPages()

getPages(array  $sites, boolean  $timer = false, integer  $page_range_request = \seekquarry\yioop\configs\PAGE_RANGE_REQUEST, string  $temp_dir = "", string  $key = \seekquarry\yioop\library\CrawlConstants::URL, string  $value = \seekquarry\yioop\library\CrawlConstants::PAGE, boolean  $minimal = false, array  $post_data = null, boolean  $follow = false, string  $tor_proxy = "", array  $proxy_servers = array()) : array

Make multi_curl requests for an array of sites with urls or onion urls

Parameters

array	$sites	an array containing urls of pages to request
boolean	$timer	flag, true means print timing statistics to log
integer	$page_range_request	maximum number of bytes to download/page 0 means download all
string	$temp_dir	folder to store temporary ip header info
string	$key	the component of $sites[$i] that has the value of a url to get defaults to URL
string	$value	component of $sites[$i] in which to store the page that was gotten
boolean	$minimal	if true do a faster request of pages by not doing things like extract HTTP headers sent, etcs
array	$post_data	data to be POST'd to each site
boolean	$follow	whether to follow redirects or not
string	$tor_proxy	url of a proxy that knows how to download .onion urls
array	$proxy_servers	if not [], then an array of proxy server to use rather than to directly download web pages from the current machine

Returns

array —

an updated array with the contents of those pages

prepareUrlHeaders()

prepareUrlHeaders(string  $url, array  $proxy_servers = array(), string  $temp_dir = "") : array

Curl requests are typically done using cache data which is stored after ### at the end of urls if this is possible. To make this work. The http Host: with the url is added a header after the for the curl request. The job of this function is to do this replace

Parameters

string	$url	site to download with ip address at end potentially afte ###
array	$proxy_servers	if not empty an array of proxy servers used to crawl through
string	$temp_dir	folder to store temporary ip header info

Returns

array —

3-tuple (orig url, url with replacement, http header array)

computePageHash()

computePageHash(\seekquarry\yioop\library\string&  $page) : string

Computes a hash of a string containing page data for use in deduplication of pages with similar content

Parameters

\seekquarry\yioop\library\string&

$page

reference to web page data

Returns

string —

8 byte hash to identify page contents

parseHeaderPage()

parseHeaderPage(string  $header_and_page, string  $value = \seekquarry\yioop\library\CrawlConstants::PAGE) : array

Splits an http response document into the http headers sent and the web page returned. Parses out useful information from the header and return an array of these two parts and the useful info.

Parameters

string	$header_and_page	string of downloaded data
string	$value	field to store the page portion of page

Returns

array —

info array consisting of a header, page for an http response, as well as parsed from the header the server, server version, operating system, encoding, and date information.

getCurlIp()

getCurlIp(string  $header) : string

Computes the IP address from http get-responser header

Parameters

string

$header

contains complete transcript of HTTP get/response

Returns

string —

IPv4 address as a string of dot separated quads.

getPage()

getPage(string  $site, array  $post_data = null, boolean  $check_for_errors = false, string  $user_password = null,   $timeout = \seekquarry\yioop\configs\SINGLE_PAGE_TIMEOUT) : string

Make a curl request for the provided url

Parameters

string	$site	url of page to request
array	$post_data	any data to be POST'd to the URL
boolean	$check_for_errors	whether or not to check the response for the words, NOTICE, WARNING, FATAL which might indicate an error on the server
string	$user_password	username:password to use for connection if needed (optional)
	$timeout	how long to wait for page download to complete

Returns

string —

the contents of what the curl request fetched

checkResponseForErrors()

checkResponseForErrors(string  $response)

Given the results of a getPage call, check whether or not the response had the words NOTICE, WARNING, FATAL which might indicate an error on the server. If it does, then the $response string is sent to the crawlLog

Parameters

string

$response

getPage response in which to check for errors