\seekquarry\yioop\library\archive_bundle_iteratorsTextArchiveBundleIterator

Used to iterate through the records of a collection of text or compressed text-oriented records

Summary

Methods

Properties

Constants

saveCheckpoint()
restoreCheckpoint()
seekPage()
weight()
nextPages()
reset()
__construct()
setIniInfo()
nextChunk()
updatePartition()
nextPage()
getFileBlock()
fileRead()
fileGets()
updateBuffer()
makeBuffer()
checkFileHandle()
checkEof()
fileOpen()
fileClose()
fileTell()
saveCheckPoint()
restoreCheckPoint()
getNextTagData()
getNextTagsData()

$iterate_timestamp
$result_timestamp
$end_of_iterator
$result_dir
$iterate_dir
$num_partitions
$current_partition_num
$current_page_num
$current_offset
$partitions
$fh
$buffer
$start_delimiter
$end_delimiter
$status_filename
$buffer_fh
$buffer_block_num
$buffer_filename
$switch_partition_callback_name
$ini

BUFFER_SIZE
MAX_RECORD_SIZE

No protected methods found

No protected properties found

N/A

No private methods found

No private properties found

N/A

File: src/library/archive_bundle_iterators/TextArchiveBundleIterator.php
Package: Default
Class hierarchy: \seekquarry\yioop\library\archive_bundle_iterators\ArchiveBundleIterator

\seekquarry\yioop\library\archive_bundle_iterators\TextArchiveBundleIterator
See also: \seekquarry\yioop\library\archive_bundle_iterators\WebArchiveBundle

Constants

BUFFER_SIZE

BUFFER_SIZE

How many bytes at a time should be read from the current archive file into the buffer file. 8192 = BZip2BlockIteraror::BlOCK_SIZE

MAX_RECORD_SIZE

MAX_RECORD_SIZE

Estimate of the maximum size of a record stored in a text archive Data in archives is split into chunk of buffer size plus two record sizes. This is used to provide a two record overlap between successive chunks. This si further used to ensure that records that go over the basic chunk boundary of BUFFER_SIZE will be processed.

Properties

$iterate_timestamp

$iterate_timestamp : integer

Timestamp of the archive that is being iterated over

Type

integer

$result_timestamp

$result_timestamp : integer

Timestamp of the archive that is being used to store results in

Type

integer

$end_of_iterator

$end_of_iterator : boolean

Whether or not the iterator still has more documents

Type

boolean

$result_dir

$result_dir : string

The path to the directory where the iteration status is stored.

Type

string

$iterate_dir

$iterate_dir : string

The path to the directory containing the archive partitions to be iterated over.

Type

string

$num_partitions

$num_partitions : integer

The number of arc files in this arc archive bundle

Type

integer

$current_partition_num

$current_partition_num : integer

Counting in glob order for this arc archive bundle directory, the current active file number of the arc file being process.

Type

integer

$current_page_num

$current_page_num : integer

current number of pages into the current arc file

Type

integer

$current_offset

$current_offset : integer

current byte offset into the current arc file

Type

integer

$partitions

$partitions : array

Array of filenames of arc files in this directory (glob order)

Type

array

$fh

$fh : resource

File handle for current archive file

Type

resource

$buffer

$buffer : string

Used to buffer data from the currently opened file

Type

string

$start_delimiter

$start_delimiter : string

Starting delimiters for records

Type

string

$end_delimiter

$end_delimiter : string

Ending delimiters for records

Type

string

$status_filename

$status_filename : string

File name to write this archive iterator status messages to

Type

string

$buffer_fh

$buffer_fh : resource

If gzip is being used a buffer file is also employed to try to reduce the number of calls to gzseek. $buffer_fh is a filehandle for the buffer file

Type

resource

$buffer_block_num

$buffer_block_num : integer

Which block of self::BUFFER_SIZE from the current archive file is stored in the file $this->buffer_filename

Type

integer

$buffer_filename

$buffer_filename : string

Name of a buffer file to be used to reduce gzseek calls in the case where gzip compression is being used

Type

string

$switch_partition_callback_name

$switch_partition_callback_name : string

Name of function to be call whenever the partition is changed that the iterator is reading. The point of the callback is to read meta information at the start of the new partition

Type

string

$ini

$ini : array

Contains basic parameters of how this iterate works: compression, start and stop delimiter. Typically, this data is read from the arc_description.ini file

Type

array

Methods

saveCheckpoint()

saveCheckpoint(array  $info = array())

Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.

Parameters

array

$info

any extra info a subclass wants to save

restoreCheckpoint()

restoreCheckpoint() : array

Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.

Returns

array —

the data serialized when saveCheckpoint was called

seekPage()

seekPage(  $limit)

Advances the iterator to the $limit page, with as little additional processing as possible

Parameters

$limit

page to advance to

weight()

weight(  $site) : boolean

Estimates the important of the site according to the weighting of the particular archive iterator

Parameters

$site

an associative array containing info about a web page

Returns

boolean —

false we assume arc files were crawled according to OPIC and so we use the default doc_depth to estimate page importance

nextPages()

nextPages(integer  $num, boolean  $no_process = false) : array

Gets the next at most $num many docs from the iterator. It might return less than $num many documents if the partition changes or the end of the bundle is reached.

Parameters

integer	$num	number of docs to get
boolean	$no_process	if true then just an array of page strings found not any additional meta data.

Returns

array —

associative arrays for $num pages

reset()

reset()

Resets the iterator to the start of the archive bundle

__construct()

__construct(string  $iterate_timestamp, string  $iterate_dir, string  $result_timestamp, string  $result_dir, array  $ini = array())

Creates an text archive iterator with the given parameters.

Parameters

string	$iterate_timestamp	timestamp of the arc archive bundle to iterate over the pages of
string	$iterate_dir	folder of files to iterate over. If this iterator is used in a fetcher and the data is on a name server set this to false
string	$result_timestamp	timestamp of the arc archive bundle results are being stored in
string	$result_dir	where to write last position checkpoints to
array	$ini	describes start_ and end_delimiter, file_extension, encoding, and compression method used for pages in this archive

setIniInfo()

setIniInfo(array  $ini)

Mutator Method for controller how this text archive iterator behaves Normally, data, on compression, start, stop delimiter read from an ini file. This reads it from the supplied array.

Parameters

array

$ini

configuration settings for this archive iterator

nextChunk()

nextChunk() : array

Called to get the next chunk of BUFFER_SIZE + 2 MAX_RECORD_SIZE bytes of data from the text archive. This data is returned unprocessed in self::ARC_DATA together with ini and header information about the archive. This method is typically called in the name server setting from FetchController.

Returns

array —

with contents as described above

updatePartition()

updatePartition(\seekquarry\yioop\library\archive_bundle_iterators\array&  $info)

Helper function for nextChunk to advance the parition if we are at the end of the current archive file

Parameters

\seekquarry\yioop\library\archive_bundle_iterators\array&

$info

a struct with data about current chunk. will up start partition flag

nextPage()

nextPage(boolean  $no_process = false) : mixed

Gets the next doc from the iterator

Parameters

boolean

$no_process

if true then just return page string found not any additional meta data.

Returns

mixed —

associative array for doc or just string of doc

getFileBlock()

getFileBlock() : mixed

Reads and return the block of data from the current partition

Returns

mixed —

a uncompressed string from the current partitin or null if iterator not set up, or false if EOF reached.

fileRead()

fileRead(integer  $num_bytes) : string

Acts as gzread($num_bytes, $archive_file), hiding the fact that buffering of the archive_file is being done to a buffer file

Parameters

integer

$num_bytes

to read from archive file

Returns

string —

of length up to $num_bytes (less if eof occurs)

fileGets()

fileGets() : string

Acts as gzgets(), hiding the fact that buffering of the archive_file is being done to a buffer file

Returns

string —

from archive file up to next line ending or eof

updateBuffer()

updateBuffer(string  $buffer = "", boolean  $return_string = false) : boolean

If reading from a gzbuffer file goes off the end of the current buffer, reads in the next block from archive file.

Parameters

string	$buffer
boolean	$return_string

Returns

boolean —

whether successfully read in next block or not

makeBuffer()

makeBuffer(string  $buffer = "", boolean  $return_string = false) : mixed

Reads in block $this->buffer_block_num of size self::BUFFER_SIZE from the archive file

Parameters

string	$buffer
boolean	$return_string

Returns

mixed —

whether successfully read in block or not

checkFileHandle()

checkFileHandle() : boolean

Checks if have a valid handle to object's archive's current partition

Returns

boolean —

whether it has or not (true -it has)

checkEof()

checkEof() : boolean

Checks if this object's archive's current partition is at an end of file

Returns

boolean —

whether end of file has been reached (true -it has)

fileOpen()

fileOpen(string  $filename, boolean  $make_buffer_if_needed = true)

Wrapper around particular compression scheme fopen function

Parameters

string	$filename	name of file to open
boolean	$make_buffer_if_needed

fileClose()

fileClose()

Wrapper around particular compression scheme fclose function

fileTell()

fileTell() : integer

Returns the current position in the current iterator partition file for the given compression scheme.

Returns

integer —

a position into the currently being processed file of the iterator

saveCheckPoint()

saveCheckPoint(array  $info = array())

Parameters

array

$info

any extra info a subclass wants to save

restoreCheckPoint()

restoreCheckPoint() : array

Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Text archive bundle iterator takes the unserialized data from the last check point and calls the compression specific restore checkpoint to further set up the iterator according to the given compression scheme.

Returns

array —

the data serialized when saveCheckpoint was called

getNextTagData()

getNextTagData(string  $tag) : string

Used to extract data between two tags. After operation $this->buffer has contents after the close tag.

Parameters

string

$tag

tag name to look for

Returns

string —

data start tag contents close tag of name $tag

getNextTagsData()

getNextTagsData(array  $tags) : array

Used to extract data between two tags for the first tag found amongst the array of tags $tags. After operation $this->buffer has contents after the close tag.

Parameters

array

$tags

array of tagnames to look for

Returns

array —

of two elements: the first element is a string consisting of start tag contents close tag of first tag found, the second has the name of the tag amongst $tags found