$dir_name
$dir_name : string
Folder name to use for this WebArchiveBundle
A web archive bundle is a collection of web archives which are managed together.It is useful to split data across several archive files rather than just store it in one, for both read efficiency and to keep filesizes from getting too big. In some places we are using 4 byte int's to store file offsets which restricts the size of the files we can use for wbe archives.
__construct(string $dir_name, boolean $read_only_archive = true, integer $num_docs_per_partition = \seekquarry\yioop\configs\NUM_DOCS_PER_GENERATION, string $description = null, string $compressor = "GzipCompressor")
Makes or initializes an existing WebArchiveBundle with the given characteristics
string | $dir_name | folder name of the bundle |
boolean | $read_only_archive | whether to open archive in a read only mode suitable for obtaining search results to open it in a read write mode as used during a crawl |
integer | $num_docs_per_partition | number of documents before the web archive is changed |
string | $description | a short text name/description of this WebArchiveBundle |
string | $compressor | the Compressor object used to compress/uncompress data stored in the bundle |
addPages(string $offset_field, \seekquarry\yioop\library\array& $pages) : integer
Add the array of $pages to the WebArchiveBundle pages being stored in the partition according to write partition and the field used to store the resulting offsets given by $offset_field.
string | $offset_field | field used to record offsets after storing |
\seekquarry\yioop\library\array& | $pages | data to store |
the write_partition the pages were stored in
getPage(integer $offset, integer $partition) : array
Gets a page using in WebArchive $partition using the provided byte $offset and using existing $file_handle if possible.
integer | $offset | byte offset of page data |
integer | $partition | which WebArchive to look in |
desired page
getPartition(integer $index, boolean $fast_construct = true) : object
Gets an object encapsulating the $index the WebArchive partition in this bundle.
integer | $index | the number of the partition within this bundle to return |
boolean | $fast_construct | tells the constructor of the WebArchive avoid reading in its info block. |
the WebArchive file which was requested
addCount(integer $num, string $field = "COUNT")
Updates the description file with the current count for the number of items in the WebArchiveBundle. If the $field item is used counts of additional properties (visited urls say versus total urls) can be maintained.
integer | $num | number of items to add to current count |
string | $field | field of info struct to add to the count of |
getArchiveInfo(string $dir_name) : array
Gets information about a WebArchiveBundle out of its description.txt file
string | $dir_name | folder name of the WebArchiveBundle to get info for |
containing the name (description) of the WebArchiveBundle, the number of items stored in it, and the number of WebArchive file partitions it uses.