$iterate_timestamp
$iterate_timestamp : integer
Timestamp of the archive that is being iterated over
Used to do an archive crawl based on the results of a crawl mix.
the query terms for this crawl mix will have site:any raw 1 appended to them
saveCheckpoint(array $info = array())
Saves the current state so that a new instantiation can pick up just after the last batch of pages extracted.
array | $info | data needed to restore where we are in the process of iterating through archive. By default save fields LIMIT and END_OF_ITERATOR |
weight( $site) : boolean
Estimates the importance of the site according to the weighting of the particular archive iterator
$site | an associative array containing info about a web page |
false we assume files were crawled roughly according to page importance so we use default estimate of doc rank
nextPages(integer $num, boolean $no_process = false) : array
Gets the next $num many docs from the iterator
integer | $num | number of docs to get |
boolean | $no_process | this flag is inherited from base class but does not do anything in this case |
associative arrays for $num pages
__construct(string $mix_timestamp, string $result_timestamp)
Creates a web archive iterator with the given parameters.
string | $mix_timestamp | timestamp of the crawl mix to iterate over the pages of |
string | $result_timestamp | timestamp of the web archive bundle results are being stored in |