\seekquarry\yioop\library\archive_bundle_iteratorsWebArchiveBundleIterator

Class used to model iterating documents indexed in an WebArchiveBundle. This would typically be for the purpose of re-indexing these documents.

Summary

Methods
Properties
Constants
saveCheckpoint()
restoreCheckpoint()
seekPage()
weight()
nextPages()
reset()
getArchiveName()
__construct()
$iterate_timestamp
$result_timestamp
$end_of_iterator
$result_dir
$num_partitions
$partition
$partition_index
$current_partition_num
$overall_index
$count
$archive
$fetcher_prefix
No constants found
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Properties

$iterate_timestamp

$iterate_timestamp : integer

Timestamp of the archive that is being iterated over

Type

integer

$result_timestamp

$result_timestamp : integer

Timestamp of the archive that is being used to store results in

Type

integer

$end_of_iterator

$end_of_iterator : boolean

Whether or not the iterator still has more documents

Type

boolean

$result_dir

$result_dir : string

The path to the directory where the iteration status is stored.

Type

string

$num_partitions

$num_partitions : integer

Number of web archive objects in this web archive bundle

Type

integer

$partition

$partition : integer

The current web archive in the bundle that is being iterated over

Type

integer

$partition_index

$partition_index : integer

The item within the current partition to be returned next

Type

integer

$current_partition_num

$current_partition_num : integer

Index of web archive in the web archive bundle that the iterator is currently getting results from

Type

integer

$overall_index

$overall_index : integer

Index between 0 and $this->count of where the iterator is at

Type

integer

$count

$count : integer

Number of documents in the web archive bundle being iterated over

Type

integer

$archive

$archive : object

The web archive bundle being iterated over

Type

object

$fetcher_prefix

$fetcher_prefix : string

The fetcher prefix associated with this archive.

Type

string

Methods

saveCheckpoint()

saveCheckpoint(array  $info = array()) 

Saves the current state so that a new instantiation can pick up just after the last batch of pages extracted.

Parameters

array $info

data needed to restore where we are in the process of iterating through archive.

restoreCheckpoint()

restoreCheckpoint() : array

Restores state from a previous instantiation, after the last batch of pages extracted.

Returns

array —

the data serialized when saveCheckpoint was called

seekPage()

seekPage(  $limit) 

Advances the iterator to the $limit page, with as little additional processing as possible

Parameters

$limit

page to advance to

weight()

weight(  $site) : boolean

Estimates the importance of the site according to the weighting of the particular archive iterator

Parameters

$site

an associative array containing info about a web page

Returns

boolean —

false we assume files were crawled roughly according to page importance so we use default estimate of doc rank

nextPages()

nextPages(integer  $num, boolean  $no_process = false) : array

Gets the next $num many docs from the iterator

Parameters

integer $num

number of docs to get

boolean $no_process

this flag is inherited from base class but does not do anything in this case

Returns

array —

associative arrays for $num pages

reset()

reset() 

Resets the iterator to the start of the archive bundle

getArchiveName()

getArchiveName(string  $timestamp) : string

Returns the path to an archive given its timestamp.

Parameters

string $timestamp

the archive timestamp

Returns

string —

the path to the archive, based off of the fetcher prefix used when this iterator was constructed

__construct()

__construct(string  $prefix, string  $iterate_timestamp, string  $result_timestamp) 

Creates a web archive iterator with the given parameters.

Parameters

string $prefix

fetcher number this bundle is associated with

string $iterate_timestamp

timestamp of the web archive bundle to iterate over the pages of

string $result_timestamp

timestamp of the web archive bundle results are being stored in