\seekquarry\yioop\libraryWebQueueBundle

Encapsulates the data structures needed to have a queue of to crawl urls

(hash of url, weights) are stored in a PriorityQueue,
(hash of url, index in PriorityQueue, offset of url in WebArchive) is stored
in a HashTable
urls are stored in a WebArchive in an uncompressed format

Summary

Methods

Properties

Constants

__construct()
addUrlsQueue()
containsUrlQueue()
adjustQueueWeight()
setQueueFlag()
removeQueue()
peekQueue()
printContents()
getContents()
normalize()
openUrlArchive()
closeUrlArchive()
addSeenUrlFilter()
differenceSeenUrls()
addGotRobotTxtFilter()
containsGotRobotTxt()
addRobotPaths()
checkRobotOkay()
getRobotTxtAge()
addDNSCache()
dnsLookup()
getUrlFilterAge()
setCrawlDelay()
getCrawlDelay()
constructHashTable()
lookupHashTable()
deleteHashTable()
insertHashTable()
rebuildHashTable()
rebuildUrlTable()
emptyRobotData()
emptyDNSCache()
emptyUrlFilter()
notify()
notifyFlush()

$dir_name
$filter_size
$num_urls_ram
$min_or_max
$to_crawl_queue
$to_crawl_table
$hash_rebuild_count
$max_hash_ops_before_rebuild
$to_crawl_archive
$url_exists_filter_bundle
$got_robottxt_filter
$dns_table
$robot_table
$robot_archive
$crawl_delay_filter
$etag_btree
$notify_buffer

max_url_archive_offset
HASH_KEY_SIZE
HASH_VALUE_SIZE
IP_SIZE
NO_FLAGS
ROBOT
SCHEDULABLE
INT_SIZE
NOTIFY_BUFFER_SIZE

No protected methods found

No protected properties found

N/A

No private methods found

No private properties found

N/A

File: src/library/WebQueueBundle.php
Package: Default
Class hierarchy: \seekquarry\yioop\library\WebQueueBundle
Implements: \seekquarry\yioop\library\Notifier

Constants

max_url_archive_offset

max_url_archive_offset

The largest offset for the url WebArchive before we rebuild it.

Entries are never deleted from the url WebArchive even if they are deleted from the priority queue. So when we pass this value we make a new WebArchive containing only those urls which are still in the queue.

HASH_KEY_SIZE

HASH_KEY_SIZE

Number of bytes in for hash table key

HASH_VALUE_SIZE

HASH_VALUE_SIZE

4 bytes offset, 4 bytes index, 4 bytes flags

IP_SIZE

IP_SIZE

Length of an IPv6 ip address (IPv4 address are padded)

NO_FLAGS

NO_FLAGS

Url type flag

ROBOT

ROBOT

Url type flag

SCHEDULABLE

SCHEDULABLE

Url type flag

INT_SIZE

INT_SIZE

Size of int

NOTIFY_BUFFER_SIZE

NOTIFY_BUFFER_SIZE

Size of notify buffer

Properties

$dir_name

$dir_name : string

The folder name of this WebQueueBundle

Type

string

$filter_size

$filter_size : integer

Number items that can be stored in a partition of the page exists filter bundle

Type

integer

$num_urls_ram

$num_urls_ram : integer

number of entries the priority queue used by this web queue bundle can store

Type

integer

$min_or_max

$min_or_max : integer

whether polling the first element of the priority queue returns the smallest or largest weighted element. This is set to a constant specified in PriorityQueue

Type

integer

$to_crawl_queue

$to_crawl_queue : object

the PriorityQueue used by this WebQueueBundle

Type

object

$to_crawl_table

$to_crawl_table : object

the HashTable used by this WebQueueBundle

Type

object

$hash_rebuild_count

$hash_rebuild_count : integer

Current count of the number of non-read operation performed on the WebQueueBundles's hash table since the last time it was rebuilt.

Type

integer

$max_hash_ops_before_rebuild

$max_hash_ops_before_rebuild : integer

Number of non-read operations on the hash table before it needs to be rebuilt.

Type

integer

$to_crawl_archive

$to_crawl_archive : object

WebArchive used to store urls that are to be crawled

Type

object

$url_exists_filter_bundle

$url_exists_filter_bundle : object

BloomFilter used to keep track of which urls we've already seen

Type

object

$got_robottxt_filter

$got_robottxt_filter : object

BloomFilter used to store which hosts whose robots.txt file we have already download

Type

object

$dns_table

$dns_table : object

host-ip table used for dns look-up, comes from robot.txt data and deleted with same frequency

Type

object

$robot_table

$robot_table : object

HashTable used to store offsets into WebArchive that stores robot paths

Type

object

$robot_archive

$robot_archive : object

WebArchive used to store paths coming from robots.txt files

Type

object

$crawl_delay_filter

$crawl_delay_filter : object

BloomFilter used to keep track of crawl delay in seconds for a given host

Type

object

$etag_btree

$etag_btree :

Holds the B-Tree used for saving etag and expires http data

Type

$notify_buffer

$notify_buffer : array

Associative array of (hash_url => index) pairs saying where a hash_url has moved in the priority queue. These moves are buffered and later re-stored in the hash table when notifyFlush called

Type

array

Methods

__construct()

__construct(string  $dir_name, integer  $filter_size, integer  $num_urls_ram, string  $min_or_max)

Makes a WebQueueBundle with the provided parameters

Parameters

string	$dir_name	folder name used by this WebQueueBundle
integer	$filter_size	size of each partition in the page exists BloomFilterBundle
integer	$num_urls_ram	number of entries in ram for the priority queue
string	$min_or_max	when the priority queue maintain the heap property with respect to the least or the largest weight

addUrlsQueue()

addUrlsQueue(array  $url_pairs)

Adds an array of (url, weight) pairs to the WebQueueBundle.

Parameters

array

$url_pairs

a list of pairs to add

containsUrlQueue()

containsUrlQueue(string  $url) : boolean

Check is the url queue already contains the given url

Parameters

string

$url

what to check

Returns

boolean —

whether it is contained in the queue yet or not

adjustQueueWeight()

adjustQueueWeight(string  $url, float  $delta, boolean  $flush = true)

Adjusts the weight of the given url in the priority queue by amount delta

In a page importance crawl. a given web page casts its votes on who to crawl next by splitting its crawl money amongst its child links. This entails a mechanism for adusting weights of elements in the priority queue periodically is necessary. This function is used to solve this problem.

Parameters

string	$url	url whose weight in queue we want to adjust
float	$delta	change in weight (usually positive).
boolean	$flush

setQueueFlag()

setQueueFlag(string  $url, integer  $flag)

Sets the flag which provides additional information about the kind of url, for a url already stored in the queue. For instance, might say if it is a robots.txt url, or if the url has already passed the robots.txt test, or if it has a crawl-delay

Parameters

string	$url	url whose weight in queue we want to adjust
integer	$flag	should be one of self::ROBOT, self::NO_FLAGS, self::SCHEDULABLE or self::SCHEDULABLE + crawl_delay

removeQueue()

removeQueue(string  $url, boolean  $isHash = false)

Removes a url from the priority queue.

This method would typical be called during a crawl after the given url is scheduled to be crawled. It only deletes the item from the bundles priority queue and hash table -- not from the web archive.

Parameters

string	$url	the url or hash of url to delete
boolean	$isHash	flag to say whether or not is the hash of a url

peekQueue()

peekQueue(integer  $i = 1, resource  $fh = null) : mixed

Gets the url and weight of the ith entry in the priority queue

Parameters

integer	$i	entry to look up
resource	$fh	a file handle to the WebArchive for urls

Returns

mixed —

false on error, otherwise the ordered 4-tuple in an array

printContents()

printContents()

Pretty prints the contents of the queue bundle in order

getContents()

getContents() : array

Gets the contents of the queue bundle as an array of ordered url,weight, flag triples

Returns

array —

a list of ordered url, weight, falg triples

normalize()

normalize(integer  $new_total = \seekquarry\yioop\configs\NUM_URLS_QUEUE_RAM)

Makes the weight sum of the to-crawl priority queue sum to $new_total

Parameters

integer

$new_total

amount weights should sum to. All weights will be scaled by the same factor.

openUrlArchive()

openUrlArchive(string  $mode = "r") : resource

Opens the url WebArchive associated with this queue bundle in the given read/write mode

Parameters

string

$mode

the read/write mode to open the archive with

Returns

resource —

a file handle to the WebArchive file

closeUrlArchive()

closeUrlArchive(resource  $fh)

Closes a file handle to the url WebArchive

Parameters

resource

$fh

a valid handle to the url WebArchive file

addSeenUrlFilter()

addSeenUrlFilter(string  $url)

Adds the supplied url to the url_exists_filter_bundle

Parameters

string

$url

url to add

differenceSeenUrls()

differenceSeenUrls(\seekquarry\yioop\library\array&  $url_array, array  $field_names = null)

Removes all url objects from $url_array which have been seen

Parameters

\seekquarry\yioop\library\array&	$url_array	objects to check if have been seen
array	$field_names	an array of components of a url_array element which contain a url to check if seen

addGotRobotTxtFilter()

addGotRobotTxtFilter(string  $host)

Adds the supplied $host to the got_robottxt_filter

Parameters

string

$host

url to add

containsGotRobotTxt()

containsGotRobotTxt(string  $host) : boolean

Checks if we have a fresh copy of robots.txt info for $host

Parameters

string

$host

url to check

Returns

boolean —

whether we do or not

addRobotPaths()

addRobotPaths(string  $host, array  $paths)

Adds all the paths for a host to the Robots Web Archive.

Parameters

string	$host	name that the paths are to be added for.
array	$paths	an array with two keys CrawlConstants::ALLOWED_SITES and CrawlConstants::DISALLOWED_SITES. For each key one has an array of paths

checkRobotOkay()

checkRobotOkay(string  $url) : boolean

Checks if the given $url is allowed to be crawled based on stored robots.txt info.

Parameters

string

$url

to check

Returns

boolean —

whether it was allowed or not

getRobotTxtAge()

getRobotTxtAge() : integer

Gets the timestamp of the oldest robot data still stored in the queue bundle

Returns

integer —

a Unix timestamp

addDNSCache()

addDNSCache(string  $host, string  $ip_address)

Add an entry to the web_queue_bundles DNS cache

Parameters

string	$host	hostname to add to DNS Lookup table
string	$ip_address	in presentation format (not as int) to add to table

dnsLookup()

dnsLookup(string  $host) : \seekquarry\yioop\library\value

Add an entry to the web_queue_bundles DNS cache

Parameters

string

$host

hostname to add to DNS Lookup table

Returns

\seekquarry\yioop\library\value

getUrlFilterAge()

getUrlFilterAge() : integer

Gets the timestamp of the oldest url filter data still stored in the queue bundle

Returns

integer —

a Unix timestamp

setCrawlDelay()

setCrawlDelay(string  $host, integer  $value)

Sets the Crawl-delay of $host to passes $value in seconds

Parameters

string	$host	a host to set the Crawl-delay for
integer	$value	a delay in seconds up to 255

getCrawlDelay()

getCrawlDelay(string  $host) : integer

Gets the Crawl-delay of $host from the crawl delay bloom filter

Parameters

string

$host

site to check for a Crawl-delay

Returns

integer —

the crawl-delay in seconds or -1 if $host has no delay

constructHashTable()

constructHashTable(string  $name, integer  $num_values) : object

Mainly, a Factory style wrapper around the HashTable's constructor.

However, this function also sets up a rebuild frequency. It is used as part of the process of keeping the to crawl table from having too many entries

Parameters

string	$name	filename to store the hash table persistently
integer	$num_values	size of HashTable's arraya

Returns

object —

the newly built hash table

lookupHashTable()

lookupHashTable(string  $key, integer  $return_probe_value = \seekquarry\yioop\library\HashTable::RETURN_VALUE) : mixed

Looks up $key in the to-crawl hash table

Parameters

string	$key	the things to look up
integer	$return_probe_value	one of self::ALWAYS_RETURN_PROBE, self::RETURN_PROBE_ON_KEY_FOUND, self::RETURN_VALUE, or self::BOTH. Here value means the value associated with the key and probe is either the location in the array where the key was found or the first location in the array where it was determined the key could not be found.

Returns

mixed —

would be string if the value is being returned, otherwise, false if the key is not found

deleteHashTable()

deleteHashTable(string  $key, integer  $probe = false)

Removes an entries from the to crawl hash table

Parameters

string	$key	usually a hash of a url
integer	$probe	if the location in the hash table is already known to be $probe then this variable can be used to save a lookup

insertHashTable()

insertHashTable(string  $key, string  $value, integer  $probe = false) : boolean

Inserts the $key, $value pair into this web queue's to crawl table

Parameters

string	$key	intended to be a hash of a url
string	$value	intended to be offset into a webarchive for urls together with an index into the priority queue
integer	$probe	if the location in the hash table is already known to be $probe then this variable can be used to save a lookup

Returns

boolean —

whether the insert was a success or not

rebuildHashTable()

rebuildHashTable()

Makes a new HashTable without deleted rows

The hash table in Yioop is implemented using open addressing. i.e., We store key value pair in the table itself and if there is a collision we look for the next available slot. Two codes are use to indicate space available in the table. One to indicate empty never used, the other used to indicate empty but previously used and deleted. The reason you need two codes is to ensure that if somebody inserted an item B, it hashes to the same value as A and we move to the next empty slot, to store B, then if we delete A we should still be able to lookup B. The problem is as the table gets reused a lot, it tends to fill up with a lot of deleted entries making lookup times get more and more linear in the hash table size. By rebuilding the table we mitigate against this problem. By choosing the rebuild frequency appropriately, the amortized cost of this operation is only O(1).

rebuildUrlTable()

rebuildUrlTable()

Since offsets are integers, even if the queue is kept relatively small, periodically we will need to rebuild the archive for storing urls.

emptyRobotData()

emptyRobotData() : string

Delete the Bloom filters used to store robots.txt file info.

Then construct empty new ones. This is called roughly once a day so that robots files will be reloaded and so the policies used won't be too old.

Returns

string —

$message with what happened during empty process

emptyDNSCache()

emptyDNSCache() : string

Delete the Hash table used to store DNS lookup info.

Then construct an empty new one. This is called roughly once a day at the same time as

Returns

string —

$message with what happened during empty process

emptyUrlFilter()

emptyUrlFilter()

Empty the crawled url filter for this web queue bundle; resets the the timestamp of the last time this filter was emptied.

notify()

notify(integer  $index, array  $data)

Callback which is called when an item in the priority queue changes position. The position is updated in the hash table.

The priority queue stores (hash of url, weight). The hash table stores (hash of url, web_archive offset to url, index priority queue). This method actually buffers changes in $this->notify_buffer, they are applied when notifyFlush is called or when the buffer reaches self::NOTIFY_BUFFER_SIZE.

Parameters

integer	$index	new index in priority queue
array	$data	(hash url, weight)

notifyFlush()

notifyFlush()

Used to flush changes of hash_url indexes caused by adjusting weights in the bundle's priority queue to its hash table.