\seekquarry\yioop\library\archive_bundle_iteratorsMediaWikiArchiveBundleIterator

Used to iterate through a collection of .xml.bz2 media wiki files stored in a WebArchiveBundle folder. Here these media wiki files contain the kinds of documents used by wikipedia. Iteration would be for the purpose making an index of these records

Summary

Methods

Properties

Constants

__construct()
setIniInfo()
weight()
reset()
nextChunk()
updatePartition()
nextPages()
nextPage()
getFileBlock()
fileRead()
fileGets()
updateBuffer()
makeBuffer()
checkFileHandle()
checkEof()
fileOpen()
fileClose()
fileTell()
saveCheckPoint()
restoreCheckPoint()
getNextTagData()
getNextTagsData()
saveCheckpoint()
restoreCheckpoint()
seekPage()
readMediaWikiHeader()
initializeSubstitutions()
getTextContent()

$iterate_dir
$num_partitions
$current_partition_num
$current_page_num
$current_offset
$partitions
$fh
$buffer
$start_delimiter
$end_delimiter
$status_filename
$buffer_fh
$buffer_block_num
$buffer_filename
$switch_partition_callback_name
$ini
$iterate_timestamp
$result_timestamp
$end_of_iterator
$result_dir
$parser

BUFFER_SIZE
MAX_RECORD_SIZE
WIKI_PAGE_STYLES

No protected methods found

No protected properties found

N/A

No private methods found

No private properties found

N/A

File: src/library/archive_bundle_iterators/MediaWikiArchiveBundleIterator.php
Package: Default
Class hierarchy: \seekquarry\yioop\library\archive_bundle_iterators\ArchiveBundleIterator

\seekquarry\yioop\library\archive_bundle_iterators\TextArchiveBundleIterator

\seekquarry\yioop\library\archive_bundle_iterators\MediaWikiArchiveBundleIterator
See also: \seekquarry\yioop\library\archive_bundle_iterators\WebArchiveBundle

Constants

BUFFER_SIZE

BUFFER_SIZE

How many bytes at a time should be read from the current archive file into the buffer file. 8192 = BZip2BlockIteraror::BlOCK_SIZE

MAX_RECORD_SIZE

MAX_RECORD_SIZE

Estimate of the maximum size of a record stored in a text archive Data in archives is split into chunk of buffer size plus two record sizes. This is used to provide a two record overlap between successive chunks. This si further used to ensure that records that go over the basic chunk boundary of BUFFER_SIZE will be processed.

WIKI_PAGE_STYLES

WIKI_PAGE_STYLES

Used to define the styles we put on cache wiki pages

Properties

$iterate_dir

$iterate_dir : string

The path to the directory containing the archive partitions to be iterated over.

Type

string

$num_partitions

$num_partitions : integer

The number of arc files in this arc archive bundle

Type

integer

$current_partition_num

$current_partition_num : integer

Counting in glob order for this arc archive bundle directory, the current active file number of the arc file being process.

Type

integer

$current_page_num

$current_page_num : integer

current number of pages into the current arc file

Type

integer

$current_offset

$current_offset : integer

current byte offset into the current arc file

Type

integer

$partitions

$partitions : array

Array of filenames of arc files in this directory (glob order)

Type

array

$fh

$fh : resource

File handle for current archive file

Type

resource

$buffer

$buffer : string

Used to buffer data from the currently opened file

Type

string

$start_delimiter

$start_delimiter : string

Starting delimiters for records

Type

string

$end_delimiter

$end_delimiter : string

Ending delimiters for records

Type

string

$status_filename

$status_filename : string

File name to write this archive iterator status messages to

Type

string

$buffer_fh

$buffer_fh : resource

If gzip is being used a buffer file is also employed to try to reduce the number of calls to gzseek. $buffer_fh is a filehandle for the buffer file

Type

resource

$buffer_block_num

$buffer_block_num : integer

Which block of self::BUFFER_SIZE from the current archive file is stored in the file $this->buffer_filename

Type

integer

$buffer_filename

$buffer_filename : string

Name of a buffer file to be used to reduce gzseek calls in the case where gzip compression is being used

Type

string

$switch_partition_callback_name

$switch_partition_callback_name : string

Name of function to be call whenever the partition is changed that the iterator is reading. The point of the callback is to read meta information at the start of the new partition

Type

string

$ini

$ini : array

Contains basic parameters of how this iterate works: compression, start and stop delimiter. Typically, this data is read from the arc_description.ini file

Type

array

$iterate_timestamp

$iterate_timestamp : integer

Timestamp of the archive that is being iterated over

Type

integer

$result_timestamp

$result_timestamp : integer

Timestamp of the archive that is being used to store results in

Type

integer

$end_of_iterator

$end_of_iterator : boolean

Whether or not the iterator still has more documents

Type

boolean

$result_dir

$result_dir : string

The path to the directory where the iteration status is stored.

Type

string

$parser

$parser : object

Used to hold a WikiParser object that will be used for parsing

Type

object

Methods

__construct()

__construct(string  $iterate_timestamp, string  $iterate_dir, string  $result_timestamp, string  $result_dir)

Creates a media wiki archive iterator with the given parameters.

Parameters

string	$iterate_timestamp	timestamp of the arc archive bundle to iterate over the pages of
string	$iterate_dir	folder of files to iterate over
string	$result_timestamp	timestamp of the arc archive bundle results are being stored in
string	$result_dir	where to write last position checkpoints to

setIniInfo()

setIniInfo(array  $ini)

Mutator Method for controller how this text archive iterator behaves Normally, data, on compression, start, stop delimiter read from an ini file. This reads it from the supplied array.

Parameters

array

$ini

configuration settings for this archive iterator

weight()

weight(  $site) : integer

Estimates the important of the site according to the weighting of the particular archive iterator

Parameters

$site

an associative array containing info about a web page

Returns

integer —

a 4-bit number based on the log_2 size - 10 of the wiki entry (@see nextPage).

reset()

reset()

Resets the iterator to the start of the archive bundle

nextChunk()

nextChunk() : array

Called to get the next chunk of BUFFER_SIZE + 2 MAX_RECORD_SIZE bytes of data from the text archive. This data is returned unprocessed in self::ARC_DATA together with ini and header information about the archive. This method is typically called in the name server setting from FetchController.

Returns

array —

with contents as described above

updatePartition()

updatePartition(\seekquarry\yioop\library\archive_bundle_iterators\array&  $info)

Helper function for nextChunk to advance the parition if we are at the end of the current archive file

Parameters

\seekquarry\yioop\library\archive_bundle_iterators\array&

$info

a struct with data about current chunk. will up start partition flag

nextPages()

nextPages(integer  $num, boolean  $no_process = false) : array

Gets the next $num many docs from the iterator

Parameters

integer	$num	number of docs to get
boolean	$no_process	do not do any processing on page data

Returns

array —

associative arrays for $num pages

nextPage()

nextPage(boolean  $no_process = false) : array

Gets the next doc from the iterator

Parameters

boolean

$no_process

do not do any processing on page data

Returns

array —

associative array for doc or string if no_process true

getFileBlock()

getFileBlock() : mixed

Reads and return the block of data from the current partition

Returns

mixed —

a uncompressed string from the current partitin or null if iterator not set up, or false if EOF reached.

fileRead()

fileRead(integer  $num_bytes) : string

Acts as gzread($num_bytes, $archive_file), hiding the fact that buffering of the archive_file is being done to a buffer file

Parameters

integer

$num_bytes

to read from archive file

Returns

string —

of length up to $num_bytes (less if eof occurs)

fileGets()

fileGets() : string

Acts as gzgets(), hiding the fact that buffering of the archive_file is being done to a buffer file

Returns

string —

from archive file up to next line ending or eof

updateBuffer()

updateBuffer(string  $buffer = "", boolean  $return_string = false) : boolean

If reading from a gzbuffer file goes off the end of the current buffer, reads in the next block from archive file.

Parameters

string	$buffer
boolean	$return_string

Returns

boolean —

whether successfully read in next block or not

makeBuffer()

makeBuffer(string  $buffer = "", boolean  $return_string = false) : mixed

Reads in block $this->buffer_block_num of size self::BUFFER_SIZE from the archive file

Parameters

string	$buffer
boolean	$return_string

Returns

mixed —

whether successfully read in block or not

checkFileHandle()

checkFileHandle() : boolean

Checks if have a valid handle to object's archive's current partition

Returns

boolean —

whether it has or not (true -it has)

checkEof()

checkEof() : boolean

Checks if this object's archive's current partition is at an end of file

Returns

boolean —

whether end of file has been reached (true -it has)

fileOpen()

fileOpen(string  $filename, boolean  $make_buffer_if_needed = true)

Wrapper around particular compression scheme fopen function

Parameters

string	$filename	name of file to open
boolean	$make_buffer_if_needed

fileClose()

fileClose()

Wrapper around particular compression scheme fclose function

fileTell()

fileTell() : integer

Returns the current position in the current iterator partition file for the given compression scheme.

Returns

integer —

a position into the currently being processed file of the iterator

saveCheckPoint()

saveCheckPoint(array  $info = array())

Stores the current progress to the file iterate_status.txt in the result dir such that a new instance of the iterator could be constructed and return the next set of pages without having to process all of the pages that came before. Each iterator should make a call to saveCheckpoint after extracting a batch of pages.

Parameters

array

$info

any extra info a subclass wants to save

restoreCheckPoint()

restoreCheckPoint() : array

Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. We also reset up our regex substitutions

Returns

array —

the data serialized when saveCheckpoint was called

getNextTagData()

getNextTagData(string  $tag) : string

Used to extract data between two tags. After operation $this->buffer has contents after the close tag.

Parameters

string

$tag

tag name to look for

Returns

string —

data start tag contents close tag of name $tag

getNextTagsData()

getNextTagsData(array  $tags) : array

Used to extract data between two tags for the first tag found amongst the array of tags $tags. After operation $this->buffer has contents after the close tag.

Parameters

array

$tags

array of tagnames to look for

Returns

array —

of two elements: the first element is a string consisting of start tag contents close tag of first tag found, the second has the name of the tag amongst $tags found

saveCheckpoint()

saveCheckpoint(array  $info = array())

Parameters

array

$info

any extra info a subclass wants to save

restoreCheckpoint()

restoreCheckpoint() : array

Restores the internal state from the file iterate_status.txt in the result dir such that the next call to nextPages will pick up from just after the last checkpoint. Each iterator should make a call to restoreCheckpoint at the end of the constructor method after the instance members have been initialized.

Returns

array —

the data serialized when saveCheckpoint was called

seekPage()

seekPage(  $limit)

Advances the iterator to the $limit page, with as little additional processing as possible

Parameters

$limit

page to advance to

readMediaWikiHeader()

readMediaWikiHeader()

Reads the siteinfo tag of the mediawiki xml file and extract data that will be used in constructing page summaries.

initializeSubstitutions()

initializeSubstitutions(string  $base_address)

Used to initialize the arrays of match/replacements used to format wikimedia syntax into HTML (not perfectly since we are only doing regexes)

Parameters

string

$base_address

base url for link substitutions

getTextContent()

getTextContent(object  $dom,   $path) : string

Gets the text content of the first dom node satisfying the xpath expression $path in the dom document $dom

Parameters

object	$dom	DOMDocument to get the text from
	$path	xpath expression to find node with text

Returns

string —

text content of the given node if it exists