BUFFER_SIZE
BUFFER_SIZE
How many bytes at a time should be read from the current archive file into the buffer file. 8192 = BZip2BlockIteraror::BlOCK_SIZE
Used to iterate through a collection of .xml.bz2 media wiki files stored in a WebArchiveBundle folder. Here these media wiki files contain the kinds of documents used by wikipedia. Iteration would be for the purpose making an index of these records
MAX_RECORD_SIZE
Estimate of the maximum size of a record stored in a text archive Data in archives is split into chunk of buffer size plus two record sizes. This is used to provide a two record overlap between successive chunks. This si further used to ensure that records that go over the basic chunk boundary of BUFFER_SIZE will be processed.
__construct(string $iterate_timestamp, string $iterate_dir, string $result_timestamp, string $result_dir)
Creates a media wiki archive iterator with the given parameters.
string | $iterate_timestamp | timestamp of the arc archive bundle to iterate over the pages of |
string | $iterate_dir | folder of files to iterate over |
string | $result_timestamp | timestamp of the arc archive bundle results are being stored in |
string | $result_dir | where to write last position checkpoints to |
nextChunk() : array
Called to get the next chunk of BUFFER_SIZE + 2 MAX_RECORD_SIZE bytes of data from the text archive. This data is returned unprocessed in self::ARC_DATA together with ini and header information about the archive. This method is typically called in the name server setting from FetchController.
with contents as described above
updatePartition(\seekquarry\yioop\library\archive_bundle_iterators\array& $info)
Helper function for nextChunk to advance the parition if we are at the end of the current archive file
\seekquarry\yioop\library\archive_bundle_iterators\array& | $info | a struct with data about current chunk. will up start partition flag |
updateBuffer(string $buffer = "", boolean $return_string = false) : boolean
If reading from a gzbuffer file goes off the end of the current buffer, reads in the next block from archive file.
string | $buffer | |
boolean | $return_string |
whether successfully read in next block or not
saveCheckPoint(array $info = array())
Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
array | $info | any extra info a subclass wants to save |
restoreCheckPoint() : array
Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. We also reset up our regex substitutions
the data serialized when saveCheckpoint was called
getNextTagsData(array $tags) : array
Used to extract data between two tags for the first tag found amongst the array of tags $tags. After operation $this->buffer has contents after the close tag.
array | $tags | array of tagnames to look for |
of two elements: the first element is a string consisting of start tag contents close tag of first tag found, the second has the name of the tag amongst $tags found
saveCheckpoint(array $info = array())
Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
array | $info | any extra info a subclass wants to save |
restoreCheckpoint() : array
Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.
the data serialized when saveCheckpoint was called
getTextContent(object $dom, $path) : string
Gets the text content of the first dom node satisfying the xpath expression $path in the dom document $dom
object | $dom | DOMDocument to get the text from |
$path | xpath expression to find node with text |
text content of the given node if it exists