$iterate_timestamp
$iterate_timestamp : integer
Timestamp of the archive that is being iterated over
Used to iterate through the records that result from an SQL query to a database
saveCheckpoint(array $info = array())
Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
array | $info | any extra info a subclass wants to save |
restoreCheckpoint() : array
Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.
the data serialized when saveCheckpoint was called
weight( $site) : boolean
Estimates the important of the site according to the weighting of the particular archive iterator
$site | an associative array containing info about a web page |
false we assume arc files were crawled according to OPIC and so we use the default doc_depth to estimate page importance
nextPages(integer $num, boolean $no_process = false) : array
Gets the next at most $num many docs from the iterator. It might return less than $num many documents if the partition changes or the end of the bundle is reached.
integer | $num | number of docs to get |
boolean | $no_process | do not do any processing on page data |
associative arrays for $num pages
__construct(string $iterate_timestamp, string $iterate_dir, string $result_timestamp, string $result_dir)
Creates an database archive iterator with the given parameters. This kind of iterator is used to cycle through the results of a SQL query to a database, so that the results might be indexed by Yioop.
string | $iterate_timestamp | timestamp of the arc archive bundle to iterate over the pages of |
string | $iterate_dir | folder of files to iterate over |
string | $result_timestamp | timestamp of the arc archive bundle results are being stored in |
string | $result_dir | where to write last position checkpoints to |