BUFFER_SIZE
BUFFER_SIZE
How many bytes at a time should be read from the current archive file into the buffer file. 8192 = BZip2BlockIteraror::BlOCK_SIZE
Used to iterate through the records of a collection of one or more open directory RDF files stored in a WebArchiveBundle folder. Open Directory file can be found at http://rdf.dmoz.org/ . Iteration would be for the purpose making an index of these records
MAX_RECORD_SIZE
Estimate of the maximum size of a record stored in a text archive Data in archives is split into chunk of buffer size plus two record sizes. This is used to provide a two record overlap between successive chunks. This si further used to ensure that records that go over the basic chunk boundary of BUFFER_SIZE will be processed.
__construct(string $iterate_timestamp, string $iterate_dir, string $result_timestamp, string $result_dir)
Creates an open directory rdf archive iterator with the given parameters.
string | $iterate_timestamp | timestamp of the arc archive bundle to iterate over the pages of |
string | $iterate_dir | folder of files to iterate over |
string | $result_timestamp | timestamp of the arc archive bundle results are being stored in |
string | $result_dir | where to write last position checkpoints to |
weight( $site) : integer
Estimates the important of the site according to the weighting of the particular archive iterator
$site | an associative array containing info about a web page |
a 4-bit number based on the topic path of the odp entry (@see processTopic @see processExternalPage)
nextChunk() : array
Called to get the next chunk of BUFFER_SIZE + 2 MAX_RECORD_SIZE bytes of data from the text archive. This data is returned unprocessed in self::ARC_DATA together with ini and header information about the archive. This method is typically called in the name server setting from FetchController.
with contents as described above
updatePartition(\seekquarry\yioop\library\archive_bundle_iterators\array& $info)
Helper function for nextChunk to advance the parition if we are at the end of the current archive file
\seekquarry\yioop\library\archive_bundle_iterators\array& | $info | a struct with data about current chunk. will up start partition flag |
updateBuffer(string $buffer = "", boolean $return_string = false) : boolean
If reading from a gzbuffer file goes off the end of the current buffer, reads in the next block from archive file.
string | $buffer | |
boolean | $return_string |
whether successfully read in next block or not
saveCheckPoint(array $info = array())
Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
array | $info | any extra info a subclass wants to save |
restoreCheckPoint() : array
Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Text archive bundle iterator takes the unserialized data from the last check point and calls the compression specific restore checkpoint to further set up the iterator according to the given compression scheme.
the data serialized when saveCheckpoint was called
getNextTagsData(array $tags) : array
Used to extract data between two tags for the first tag found amongst the array of tags $tags. After operation $this->buffer has contents after the close tag.
array | $tags | array of tagnames to look for |
of two elements: the first element is a string consisting of start tag contents close tag of first tag found, the second has the name of the tag amongst $tags found
saveCheckpoint(array $info = array())
Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.
array | $info | any extra info a subclass wants to save |
restoreCheckpoint() : array
Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.
the data serialized when saveCheckpoint was called
getTextContent(object $dom, $path) : string
Gets the text content of the first dom node satisfying the xpath expression $path in the dom document $dom
object | $dom | DOMDocument to get the text from |
$path | xpath expression to find node with text |
text content of the given node if it exists
getAttributeValueAll(object $dom, $path, string $attribute) : array
Gets the value of the attribute $attribute for each dom node satisfying the xpath expression $path in the dom document $dom
object | $dom | DOMDocument to get the text from |
$path | xpath expression to find node with text |
|
string | $attribute | name of the attribute to get the values for |
of values of the given attribute
getAttributeValue(object $dom, $path, string $attribute) : string
Gets the value of the attribute $attribute of the first dom node satisfying the xpath expression $path in the dom document $dom
object | $dom | DOMDocument to get the text from |
$path | xpath expression to find node with text |
|
string | $attribute | name of the attribute to get the value for |
value of the given attribute
processTopic(object $dom, \seekquarry\yioop\library\archive_bundle_iterators\array& $site)
Computes an HTML page for a Topic tag parsed from the ODP RDF document
object | $dom | document object for one Topic tag tag |
\seekquarry\yioop\library\archive_bundle_iterators\array& | $site | a reference to an array of header and page info for an html page |
processExternalPage(object $dom, \seekquarry\yioop\library\archive_bundle_iterators\array& $site)
Computes an HTML page for an ExternalPage tag parsed from the ODP RDF document
object | $dom | document object for one Topic tag tag |
\seekquarry\yioop\library\archive_bundle_iterators\array& | $site | a reference to an array of header and page info for an html page |