\seekquarry\yioop\library\archive_bundle_iteratorsMixArchiveBundleIterator

Used to do an archive crawl based on the results of a crawl mix.

the query terms for this crawl mix will have site:any raw 1 appended to them

Summary

Methods
Properties
Constants
saveCheckpoint()
restoreCheckpoint()
seekPage()
weight()
nextPages()
reset()
__construct()
getArchiveName()
$iterate_timestamp
$result_timestamp
$end_of_iterator
$result_dir
$mix_timestamp
$limit
No constants found
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Properties

$iterate_timestamp

$iterate_timestamp : integer

Timestamp of the archive that is being iterated over

Type

integer

$result_timestamp

$result_timestamp : integer

Used to hold timestamp of the index archive bundle of output results

Type

integer

$end_of_iterator

$end_of_iterator : boolean

Whether or not the iterator still has more documents

Type

boolean

$result_dir

$result_dir : string

The path to the directory where the iteration status is stored.

Type

string

$mix_timestamp

$mix_timestamp : integer

Used to hold timestamp of the crawl mix being used to iterate over

Type

integer

$limit

$limit : integer

count of how far out into the crawl mix we've gone.

Type

integer

Methods

saveCheckpoint()

saveCheckpoint(array  $info = array()) 

Saves the current state so that a new instantiation can pick up just after the last batch of pages extracted.

Parameters

array $info

data needed to restore where we are in the process of iterating through archive. By default save fields LIMIT and END_OF_ITERATOR

restoreCheckpoint()

restoreCheckpoint() : array

Restores state from a previous instantiation, after the last batch of pages extracted.

Returns

array —

the data serialized when saveCheckpoint was called

seekPage()

seekPage(  $limit) 

Advances the iterator to the $limit page, with as little additional processing as possible

Parameters

$limit

page to advance to

weight()

weight(  $site) : boolean

Estimates the importance of the site according to the weighting of the particular archive iterator

Parameters

$site

an associative array containing info about a web page

Returns

boolean —

false we assume files were crawled roughly according to page importance so we use default estimate of doc rank

nextPages()

nextPages(integer  $num, boolean  $no_process = false) : array

Gets the next $num many docs from the iterator

Parameters

integer $num

number of docs to get

boolean $no_process

this flag is inherited from base class but does not do anything in this case

Returns

array —

associative arrays for $num pages

reset()

reset() 

Resets the iterator to the start of the archive bundle

__construct()

__construct(string  $mix_timestamp, string  $result_timestamp) 

Creates a web archive iterator with the given parameters.

Parameters

string $mix_timestamp

timestamp of the crawl mix to iterate over the pages of

string $result_timestamp

timestamp of the web archive bundle results are being stored in

getArchiveName()

getArchiveName(integer  $timestamp) 

Get the filename of the file that says information about the current archive iterator (such as whether the end of the iterator has been reached)

Parameters

integer $timestamp

of current archive crawl