BUFFER_SIZE
BUFFER_SIZE
How many bytes at a time should be read from the current archive file into the buffer file. 8192 = BZip2BlockIteraror::BlOCK_SIZE
Used to iterate through the records of a collection of arc files stored in a WebArchiveBundle folder. Arc is the file format of the Internet Archive http://www.archive.org/web/researcher/ArcFileFormat.php. Iteration would be for the purpose making an index of these records
MAX_RECORD_SIZE
Estimate of the maximum size of a record stored in a text archive Data in archives is split into chunk of buffer size plus two record sizes. This is used to provide a two record overlap between successive chunks. This si further used to ensure that records that go over the basic chunk boundary of BUFFER_SIZE will be processed.
__construct(string $iterate_timestamp, string $iterate_dir, string $result_timestamp, string $result_dir)
Creates an arc archive iterator with the given parameters.
string | $iterate_timestamp | timestamp of the arc archive bundle to iterate over the pages of |
string | $iterate_dir | folder of files to iterate over |
string | $result_timestamp | timestamp of the arc archive bundle results are being stored in |
string | $result_dir | where to write last position checkpoints to |
nextChunk() : array
Called to get the next chunk of BUFFER_SIZE + 2 MAX_RECORD_SIZE bytes of data from the text archive. This data is returned unprocessed in self::ARC_DATA together with ini and header information about the archive. This method is typically called in the name server setting from FetchController.
with contents as described above
updatePartition(\seekquarry\yioop\library\archive_bundle_iterators\array& $info)
Helper function for nextChunk to advance the parition if we are at the end of the current archive file
\seekquarry\yioop\library\archive_bundle_iterators\array& | $info | a struct with data about current chunk. will up start partition flag |
updateBuffer(string $buffer = "", boolean $return_string = false) : boolean
If reading from a gzbuffer file goes off the end of the current buffer, reads in the next block from archive file.
string | $buffer | |
boolean | $return_string |
whether successfully read in next block or not
saveCheckPoint(array $info = array())
Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
array | $info | any extra info a subclass wants to save |
restoreCheckPoint() : array
Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Text archive bundle iterator takes the unserialized data from the last check point and calls the compression specific restore checkpoint to further set up the iterator according to the given compression scheme.
the data serialized when saveCheckpoint was called
getNextTagsData(array $tags) : array
Used to extract data between two tags for the first tag found amongst the array of tags $tags. After operation $this->buffer has contents after the close tag.
array | $tags | array of tagnames to look for |
of two elements: the first element is a string consisting of start tag contents close tag of first tag found, the second has the name of the tag amongst $tags found
saveCheckpoint(array $info = array())
Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
array | $info | any extra info a subclass wants to save |
restoreCheckpoint() : array
Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.
the data serialized when saveCheckpoint was called