\seekquarry\yioop\libraryWebArchiveBundle

A web archive bundle is a collection of web archives which are managed together.It is useful to split data across several archive files rather than just store it in one, for both read efficiency and to keep filesizes from getting too big. In some places we are using 4 byte int's to store file offsets which restricts the size of the files we can use for wbe archives.

Summary

Methods
Properties
Constants
__construct()
addPages()
setWritePartition()
getPage()
getPartition()
initCountIfNotExists()
addCount()
getArchiveInfo()
setArchiveInfo()
getParamModifiedTime()
$dir_name
$partition
$count
$write_partition
$description
$compressor
$read_only_archive
$version
No constants found
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Properties

$dir_name

$dir_name : string

Folder name to use for this WebArchiveBundle

Type

string

$partition

$partition : array

Used to contain the WebArchive paritions of the bundle

Type

array

$count

$count : integer

Total number of page objects stored by this WebArchiveBundle

Type

integer

$write_partition

$write_partition : integer

The index of the partition to which new documents will be added

Type

integer

$description

$description : string

A short text name for this WebArchiveBundle

Type

string

$compressor

$compressor : object

How Compressor object used to compress/uncompress data stored in the bundle

Type

object

$read_only_archive

$read_only_archive : boolean

Controls whether the archive was opened in read only mode

Type

boolean

$version

$version : integer

What version of web archive bundle this is

Type

integer

Methods

__construct()

__construct(string  $dir_name, boolean  $read_only_archive = true, integer  $num_docs_per_partition = \seekquarry\yioop\configs\NUM_DOCS_PER_GENERATION, string  $description = null, string  $compressor = "GzipCompressor") 

Makes or initializes an existing WebArchiveBundle with the given characteristics

Parameters

string $dir_name

folder name of the bundle

boolean $read_only_archive

whether to open archive in a read only mode suitable for obtaining search results to open it in a read write mode as used during a crawl

integer $num_docs_per_partition

number of documents before the web archive is changed

string $description

a short text name/description of this WebArchiveBundle

string $compressor

the Compressor object used to compress/uncompress data stored in the bundle

addPages()

addPages(string  $offset_field, \seekquarry\yioop\library\array&  $pages) : integer

Add the array of $pages to the WebArchiveBundle pages being stored in the partition according to write partition and the field used to store the resulting offsets given by $offset_field.

Parameters

string $offset_field

field used to record offsets after storing

\seekquarry\yioop\library\array& $pages

data to store

Returns

integer —

the write_partition the pages were stored in

setWritePartition()

setWritePartition(integer  $i) 

Sets the write partition to the provided value and if this is not a read only archive stores, this value persistently to archive info

Parameters

integer $i

the number of the current write partition

getPage()

getPage(integer  $offset, integer  $partition) : array

Gets a page using in WebArchive $partition using the provided byte $offset and using existing $file_handle if possible.

Parameters

integer $offset

byte offset of page data

integer $partition

which WebArchive to look in

Returns

array —

desired page

getPartition()

getPartition(integer  $index, boolean  $fast_construct = true) : object

Gets an object encapsulating the $index the WebArchive partition in this bundle.

Parameters

integer $index

the number of the partition within this bundle to return

boolean $fast_construct

tells the constructor of the WebArchive avoid reading in its info block.

Returns

object —

the WebArchive file which was requested

initCountIfNotExists()

initCountIfNotExists(string  $field = "COUNT") 

Creates a new counter to be maintained in the description.txt file if the counter doesn't exist, leaves unchanged otherwise

Parameters

string $field

field of info struct to add a counter for

addCount()

addCount(integer  $num, string  $field = "COUNT") 

Updates the description file with the current count for the number of items in the WebArchiveBundle. If the $field item is used counts of additional properties (visited urls say versus total urls) can be maintained.

Parameters

integer $num

number of items to add to current count

string $field

field of info struct to add to the count of

getArchiveInfo()

getArchiveInfo(string  $dir_name) : array

Gets information about a WebArchiveBundle out of its description.txt file

Parameters

string $dir_name

folder name of the WebArchiveBundle to get info for

Returns

array —

containing the name (description) of the WebArchiveBundle, the number of items stored in it, and the number of WebArchive file partitions it uses.

setArchiveInfo()

setArchiveInfo(string  $dir_name, array  $info) 

Sets the archive info (DESCRIPTION, COUNT, NUM_DOCS_PER_PARTITION) for this web archive

Parameters

string $dir_name

folder with archive bundle

array $info

struct with above fields

getParamModifiedTime()

getParamModifiedTime(string  $dir_name) 

Returns the mast time the archive info of the bundle was modified.

Parameters

string $dir_name

folder with archive bundle