BUFFER_SIZE
BUFFER_SIZE
How many bytes at a time should be read from the current archive file into the buffer file. 8192 = BZip2BlockIteraror::BlOCK_SIZE
Used to iterate through the records of a collection of text or compressed text-oriented records
MAX_RECORD_SIZE
Estimate of the maximum size of a record stored in a text archive Data in archives is split into chunk of buffer size plus two record sizes. This is used to provide a two record overlap between successive chunks. This si further used to ensure that records that go over the basic chunk boundary of BUFFER_SIZE will be processed.
saveCheckpoint(array $info = array())
Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
array | $info | any extra info a subclass wants to save |
restoreCheckpoint() : array
Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.
the data serialized when saveCheckpoint was called
weight( $site) : boolean
Estimates the important of the site according to the weighting of the particular archive iterator
$site | an associative array containing info about a web page |
false we assume arc files were crawled according to OPIC and so we use the default doc_depth to estimate page importance
nextPages(integer $num, boolean $no_process = false) : array
Gets the next at most $num many docs from the iterator. It might return less than $num many documents if the partition changes or the end of the bundle is reached.
integer | $num | number of docs to get |
boolean | $no_process | if true then just an array of page strings found not any additional meta data. |
associative arrays for $num pages
__construct(string $iterate_timestamp, string $iterate_dir, string $result_timestamp, string $result_dir, array $ini = array())
Creates an text archive iterator with the given parameters.
string | $iterate_timestamp | timestamp of the arc archive bundle to iterate over the pages of |
string | $iterate_dir | folder of files to iterate over. If this iterator is used in a fetcher and the data is on a name server set this to false |
string | $result_timestamp | timestamp of the arc archive bundle results are being stored in |
string | $result_dir | where to write last position checkpoints to |
array | $ini | describes start_ and end_delimiter, file_extension, encoding, and compression method used for pages in this archive |
nextChunk() : array
Called to get the next chunk of BUFFER_SIZE + 2 MAX_RECORD_SIZE bytes of data from the text archive. This data is returned unprocessed in self::ARC_DATA together with ini and header information about the archive. This method is typically called in the name server setting from FetchController.
with contents as described above
updatePartition(\seekquarry\yioop\library\archive_bundle_iterators\array& $info)
Helper function for nextChunk to advance the parition if we are at the end of the current archive file
\seekquarry\yioop\library\archive_bundle_iterators\array& | $info | a struct with data about current chunk. will up start partition flag |
updateBuffer(string $buffer = "", boolean $return_string = false) : boolean
If reading from a gzbuffer file goes off the end of the current buffer, reads in the next block from archive file.
string | $buffer | |
boolean | $return_string |
whether successfully read in next block or not
saveCheckPoint(array $info = array())
Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
array | $info | any extra info a subclass wants to save |
restoreCheckPoint() : array
Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Text archive bundle iterator takes the unserialized data from the last check point and calls the compression specific restore checkpoint to further set up the iterator according to the given compression scheme.
the data serialized when saveCheckpoint was called
getNextTagsData(array $tags) : array
Used to extract data between two tags for the first tag found amongst the array of tags $tags. After operation $this->buffer has contents after the close tag.
array | $tags | array of tagnames to look for |
of two elements: the first element is a string consisting of start tag contents close tag of first tag found, the second has the name of the tag amongst $tags found