\seekquarry\yioop\library\archive_bundle_iteratorsArchiveBundleIterator

Abstract class used to model iterating documents indexed in an WebArchiveBundle or set of such bundles.

Summary

Methods
Properties
Constants
saveCheckpoint()
restoreCheckpoint()
seekPage()
weight()
nextPages()
reset()
$iterate_timestamp
$result_timestamp
$end_of_iterator
$result_dir
No constants found
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Properties

$iterate_timestamp

$iterate_timestamp : integer

Timestamp of the archive that is being iterated over

Type

integer

$result_timestamp

$result_timestamp : integer

Timestamp of the archive that is being used to store results in

Type

integer

$end_of_iterator

$end_of_iterator : boolean

Whether or not the iterator still has more documents

Type

boolean

$result_dir

$result_dir : string

The path to the directory where the iteration status is stored.

Type

string

Methods

saveCheckpoint()

saveCheckpoint(array  $info = array()) 

Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.

Parameters

array $info

any extra info a subclass wants to save

restoreCheckpoint()

restoreCheckpoint() : array

Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.

Returns

array —

the data serialized when saveCheckpoint was called

seekPage()

seekPage(  $limit) 

Advances the iterator to the $limit page, with as little additional processing as possible

Parameters

$limit

page to advance to

weight()

weight(  $site) : mixed

Estimates the important of the site according to the weighting of the particular archive iterator

Parameters

$site

an associative array containing info about a web page

Returns

mixed —

a 4-bit number or false if iterator doesn't uses default ranking method

nextPages()

nextPages(integer  $num, boolean  $no_process = false) : array

Gets the next $num many docs from the iterator

Parameters

integer $num

number of docs to get

boolean $no_process

do not do any processing on page data

Returns

array —

associative arrays for $num pages

reset()

reset() 

Resets the iterator to the start of the archive bundle