seek_quarry
[ class tree: seek_quarry ] [ index: seek_quarry ] [ all elements ]

Class: OdpRdfArchiveBundleIterator

Source Location: /lib/archive_bundle_iterators/odp_rdf_bundle_iterator.php

Class Overview

ArchiveBundleIterator
   |
   --TextArchiveBundleIterator
      |
      --OdpRdfArchiveBundleIterator

Used to iterate through the records of a collection of one or more open


Author(s):

  • Chris Pollett

Implements interfaces:

Variables

Constants

Methods


Inherited Constants

Inherited Variables

Inherited Methods

Class: TextArchiveBundleIterator

TextArchiveBundleIterator::__construct()
Creates an text archive iterator with the given parameters.
TextArchiveBundleIterator::checkEof()
Checks if this object's archive's current partition is at an end of file
TextArchiveBundleIterator::checkFileHandle()
Checks if have a valid handle to object's archive's current partition
TextArchiveBundleIterator::fileClose()
Wrapper around particular compression scheme fclose function
TextArchiveBundleIterator::fileGets()
Acts as gzgets(), hiding the fact that buffering of the archive_file is being done to a buffer file
TextArchiveBundleIterator::fileOpen()
Wrapper around particular compression scheme fopen function
TextArchiveBundleIterator::fileRead()
Acts as gzread($num_bytes, $archive_file), hiding the fact that buffering of the archive_file is being done to a buffer file
TextArchiveBundleIterator::fileTell()
Returns the current position in the current iterator partition file for the given compression scheme.
TextArchiveBundleIterator::getFileBlock()
Reads and return the block of data from the current partition
TextArchiveBundleIterator::getNextTagData()
Used to extract data between two tags. After operation $this->buffer has contents after the close tag.
TextArchiveBundleIterator::getNextTagsData()
Used to extract data between two tags for the first tag found amongst the array of tags $tags. After operation $this->buffer has contents after the close tag.
TextArchiveBundleIterator::makeBuffer()
Reads in block $this->buffer_block_num of size self::BUFFER_SIZE from the archive file
TextArchiveBundleIterator::nextChunk()
Called to get the next chunk of BUFFER_SIZE + 2 MAX_RECORD_SIZE bytes
TextArchiveBundleIterator::nextPage()
Gets the next doc from the iterator
TextArchiveBundleIterator::nextPages()
Gets the next at most $num many docs from the iterator. It might return less than $num many documents if the partition changes or the end of the bundle is reached.
TextArchiveBundleIterator::reset()
Resets the iterator to the start of the archive bundle
TextArchiveBundleIterator::restoreCheckPoint()
Restores the internal state from the file iterate_status.txt in the
TextArchiveBundleIterator::saveCheckPoint()
Stores the current progress to the file iterate_status.txt in the result
TextArchiveBundleIterator::setIniInfo()
Mutator Method for controller how this text archive iterator behaves Normally, data, on compression, start, stop delimiter read from an ini file. This reads it from the supplied array.
TextArchiveBundleIterator::updateBuffer()
If reading from a gzbuffer file goes off the end of the current buffer, reads in the next block from archive file.
TextArchiveBundleIterator::updatePartition()
Helper function for nextChunk to advance the parition if we are at the end of the current archive file
TextArchiveBundleIterator::weight()
Estimates the important of the site according to the weighting of

Class: ArchiveBundleIterator

ArchiveBundleIterator::nextPages()
Gets the next $num many docs from the iterator
ArchiveBundleIterator::reset()
Resets the iterator to the start of the archive bundle
ArchiveBundleIterator::restoreCheckpoint()
Restores the internal state from the file iterate_status.txt in the
ArchiveBundleIterator::saveCheckpoint()
Stores the current progress to the file iterate_status.txt in the result
ArchiveBundleIterator::seekPage()
Advances the iterator to the $limit page, with as little additional processing as possible
ArchiveBundleIterator::weight()
Estimates the important of the site according to the weighting of

Class Details

[line 53]
Used to iterate through the records of a collection of one or more open

directory RDF files stored in a WebArchiveBundle folder. Open Directory file can be found at http://rdf.dmoz.org/ . Iteration would be for the purpose making an index of these records




Tags:

author:  Chris Pollett
see:  WebArchiveBundle


[ Top ]


Class Variables

$header =

[line 61]

Associative array containing global properties like base url of the

current open odp rdf file



Type:   array


[ Top ]



Class Methods


constructor __construct [line 80]

OdpRdfArchiveBundleIterator __construct( string $iterate_timestamp, string $iterate_dir, string $result_timestamp, string $result_dir)

Creates an open directory rdf archive iterator with the given parameters.



Overrides TextArchiveBundleIterator::__construct() (Creates an text archive iterator with the given parameters.)

Parameters:

string   $iterate_timestamp   timestamp of the arc archive bundle to iterate over the pages of
string   $iterate_dir   folder of files to iterate over
string   $result_timestamp   timestamp of the arc archive bundle results are being stored in
string   $result_dir   where to write last position checkpoints to

[ Top ]

method computeTopicLinks [line 285]

array computeTopicLinks( string $topic_path)

Computes links for prefix topics of an ODP topic path



Tags:

return:  url => text pairs for each prefix of path


Parameters:

string   $topic_path   to compute links for

[ Top ]

method getAttributeValue [line 163]

string getAttributeValue( object $dom, $path $path, string $attribute)

Gets the value of the attribute $attribute of the first dom node satisfying the xpath expression $path in the dom document $dom



Tags:

return:  value of the given attribute


Parameters:

object   $dom   DOMDocument to get the text from
string   $attribute   name of the attribute to get the value for
$path   $path   xpath expression to find node with text

[ Top ]

method getAttributeValueAll [line 137]

array getAttributeValueAll( object $dom, $path $path, string $attribute)

Gets the value of the attribute $attribute for each dom node satisfying the xpath expression $path in the dom document $dom



Tags:

return:  of values of the given attribute


Parameters:

object   $dom   DOMDocument to get the text from
string   $attribute   name of the attribute to get the values for
$path   $path   xpath expression to find node with text

[ Top ]

method getTextContent [line 117]

string getTextContent( object $dom, $path $path)

Gets the text content of the first dom node satisfying the xpath expression $path in the dom document $dom



Tags:

return:  text content of the given node if it exists


Parameters:

object   $dom   DOMDocument to get the text from
$path   $path   xpath expression to find node with text

[ Top ]

method linksToHtml [line 305]

string linksToHtml( array $links)

Makes an unordered HTML list out of an associative array of url => link_text pairs.



Tags:

return:  containing html for unorderlisted list of links


Parameters:

array   $links   url=>link_text pairs

[ Top ]

method nextPage [line 178]

array nextPage( [bool $no_process = false])

Gets the next doc from the iterator



Tags:

return:  associative array for doc or string if no_process true


Overrides TextArchiveBundleIterator::nextPage() (Gets the next doc from the iterator)

Parameters:

bool   $no_process   do not do any processing on page data

[ Top ]

method processExternalPage [line 258]

void processExternalPage( object $dom, array &$site)

Computes an HTML page for an ExternalPage tag parsed from the ODP RDF document



Parameters:

object   $dom   document object for one Topic tag tag
array   &$site   a reference to an array of header and page info for an html page

[ Top ]

method processTopic [line 220]

void processTopic( object $dom, array &$site)

Computes an HTML page for a Topic tag parsed from the ODP RDF document



Parameters:

object   $dom   document object for one Topic tag tag
array   &$site   a reference to an array of header and page info for an html page

[ Top ]

method weight [line 102]

int weight( $site &$site)

Estimates the important of the site according to the weighting of

the particular archive iterator




Tags:

return:  a 4-bit number based on the topic path of the odp entry (@see processTopic @see processExternalPage)


Overrides TextArchiveBundleIterator::weight() (Estimates the important of the site according to the weighting of)

Parameters:

$site   &$site   an associative array containing info about a web page

[ Top ]


Class Constants

BLOCK_SIZE =  1024

[line 68]

How many bytes to read into buffer from gzip stream in one go


[ Top ]



Documentation generated by phpDocumentor 1.4.3