\seekquarry\yioop\libraryDoubleIndexBundle

A DoubleIndexBundle encapsulates and provided methods for two IndexArchiveBundle used to store a repeating crawl. One one thse bundles is used to handle current search queries, while the other is used to store an ongoing crawl, once the crawl time has been reach the roles of the two bundles are swapped

Summary

Methods

Properties

Constants

__construct()
swapActiveBundle()
stopIndexingBundle()
swapTimeReached()
addPages()
addIndexData()
initGenerationToAdd()
addAdvanceGeneration()
addCurrentShardDictionary()
getCurrentShard()
setCurrentShard()
getPage()
forceSave()
countWordKeys()
setStartSchedule()
getStartSchedule()
getArchiveInfo()
setArchiveInfo()
getParamModifiedTime()

$repeat_frequency
$repeat_time
$swap_count
$active_archive
$active_archive_num
$description
$num_docs_per_generation

No constants found

No protected methods found

No protected properties found

N/A

No private methods found

No private properties found

N/A

File: src/library/DoubleIndexBundle.php
Package: Default
Class hierarchy: \seekquarry\yioop\library\DoubleIndexBundle
Implements: \seekquarry\yioop\library\CrawlConstants

Properties

$repeat_frequency

$repeat_frequency : integer

How frequency the live and ongoing archive should be swapped in seconds

Type

integer

$repeat_time

$repeat_time : integer

Last time live and ongoing archives were switched

Type

integer

$swap_count

$swap_count : integer

The number of times live and ongoing archives have swapped

Type

integer

$active_archive

$active_archive : \seekquarry\yioop\library\IndexArchiveBundle

The internal IndexArchiveBundle which is active

Type

\seekquarry\yioop\library\IndexArchiveBundle

$active_archive_num

$active_archive_num : \seekquarry\yioop\library\IndexArchiveBundle

The number of the internal IndexArchiveBundle which is active

Type

\seekquarry\yioop\library\IndexArchiveBundle

$description

$description : string

A short text name for this DoubleIndexBundle

Type

string

$num_docs_per_generation

$num_docs_per_generation : integer

Number of docs before a new generation is started for an IndexArchiveBundle in this DoubleIndexBundle

Type

integer

Methods

__construct()

__construct(string  $dir_name, boolean  $read_only_archive = true, string  $description = null, integer  $num_docs_per_generation = \seekquarry\yioop\configs\NUM_DOCS_PER_GENERATION, integer  $repeat_frequency = 3600)

Makes or initializes an DoubleIndexBundle with the provided parameters

Parameters

string	$dir_name	folder name to store this bundle
boolean	$read_only_archive	whether to open archive only for reading or reading and writing
string	$description	a text name/serialized info about this IndexArchiveBundle
integer	$num_docs_per_generation	the number of pages to be stored in a single shard
integer	$repeat_frequency	how often the crawl should be redone in seconds (has no effect if $read_only_archive is true)

swapActiveBundle()

swapActiveBundle()

Switches which of the two bundles is the the one new index data will be written. Before switching closes old bundle properly.

stopIndexingBundle()

stopIndexingBundle()

Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.

swapTimeReached()

swapTimeReached() : boolean

Checks if the amount of time since the two IndexArchiveBundles in this DoubleIndexBundle roles have been swapped has exceeded the swap time for this buundle.

Returns

boolean —

true if the swap time has been exceeded

addPages()

addPages(integer  $generation, string  $offset_field, \seekquarry\yioop\library\array&  $pages, integer  $visited_urls_count)

Add the array of $pages to the summaries WebArchiveBundle pages of the active IndexArchiveBundle, storing in the partition $generation and the field used to store the resulting offsets given by $offset_field.

Parameters

integer	$generation	field used to select partition
string	$offset_field	field used to record offsets after storing
\seekquarry\yioop\library\array&	$pages	data to store
integer	$visited_urls_count	number to add to the count of visited urls (visited urls is a smaller number than the total count of objects stored in the index).

addIndexData()

addIndexData(object  $index_shard)

Adds the provided mini inverted index data to the active IndexArchiveBundle Expects initGenerationToAdd to be called before, so generation is correct

Parameters

object

$index_shard

a mini inverted index of word_key=>doc data to add to this IndexArchiveBundle

initGenerationToAdd()

initGenerationToAdd(integer  $add_num_docs, object  $callback = null, boolean  $blocking = false) : integer

Determines based on its size, if index_shard should be added to the active generation or in a new generation should be started.

If so, a new generation is started, the old generation is saved, and the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed

Parameters

integer	$add_num_docs	number of docs in the shard about to be added
object	$callback	object with join function to be called if process is taking too long
boolean	$blocking	whether there is an ongoing merge tiers operation occurring, if so don't do anything and return -1

Returns

integer —

the active generation after the check and possible change has been performed

addAdvanceGeneration()

addAdvanceGeneration(object  $callback = null)

Starts a new generation, the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed. This function may be called by initGenerationToAdd as well as when resuming a crawl rather than loading the periodic index of save of a too large shard.

Parameters

object

$callback

object with join function to be called if process is taking too long

addCurrentShardDictionary()

addCurrentShardDictionary(object  $callback = null)

Adds the words from this shard to the dictionary

Parameters

object

$callback

object with join function to be called if process is taking too long

getCurrentShard()

getCurrentShard(boolean  $force_read = false) : object

Returns the shard which is currently being used to read word-document data from the bundle. If one wants to write data to the bundle use getActiveShard() instead. The point of this method is to allow for lazy reading of the file associated with the shard.

Parameters

boolean

$force_read

whether to force no advance generation and merge dictionary side effects

Returns

object —

the currently being index shard

setCurrentShard()

setCurrentShard(  $i,   $disk_based = false)

Sets the current shard to be the $i th shard in the index bundle.

Parameters

	$i	which shard to set the current shard to be
	$disk_based	whether to read the whole shard in before using or leave it on disk except for pages need

getPage()

getPage(integer  $offset, integer  $generation = -1) : array

Gets the page out of the summaries WebArchiveBundle with the given offset and generation

Parameters

integer	$offset	byte offset in partition of desired page
integer	$generation	which generation WebArchive to look up in defaults to the same number as the current shard

Returns

array —

desired page

forceSave()

forceSave()

Forces the current shard to be saved

countWordKeys()

countWordKeys(array  $word_keys) : array

Computes the number of occurrences of each of the supplied list of word_keys

Parameters

array

$word_keys

keys to compute counts for

Returns

array —

associative array of key => count values.

setStartSchedule()

setStartSchedule(string  $dir_name, integer  $channel)

The start schedule is the first schedule a queue server makes when a crawl is just started. To facilitate switching between IndexArchiveBundles when doing a crawl with a DoubleIndexBundle this start schedule is stored in the DoubleIndexBundle, when the IndexArchiveBundles' roles (query and crawl) are swapped, the DoubleIndexBundle copy is used to start the crawl from the beginning again. This method copies the start schedule from the schedule folder to the DoubleIndexBundle at the start of a crawl for later use to do this swapping

Parameters

string	$dir_name	folder in the bundle where the schedule should be stored
integer	$channel	channel that is being used to do the current double index crawl. Typical yioop instance might have several ongoing crawls each with a different channel

getStartSchedule()

getStartSchedule(string  $dir_name, integer  $channel)

The start schedule is the first schedule a queue server makes when a crawl is just started. To facilitate switching between IndexArchiveBundles when doing a crawl with a DoubleIndexBundle this start schedule is stored in the DoubleIndexBundle, when the IndexArchiveBundles' roles (query and crawl) are swapped, this method copies the start schedule from the DoubleIndexBundle to the schedule folder to restart the crawl

Parameters

string	$dir_name	folder in the bundle where the schedule is stored
integer	$channel	channel that is being used to do the current double index crawl. Typical yioop instance might have several ongoing crawls each with a different channel

getArchiveInfo()

getArchiveInfo(string  $dir_name) : array

Gets information about a DoubleIndexBundle out of its status.txt file

Parameters

string

$dir_name

folder name of the DoubleIndexBundle to get info for

Returns

array —

containing the name (description) of the DouleIndexBundle, the number of items stored in it, and the number of WebArchive file partitions it uses.

setArchiveInfo()

setArchiveInfo(string  $dir_name, array  $info)

Sets the archive info struct for the index archive and web archive bundles associated with this double index bundle. This struct has fields like: DESCRIPTION (serialied store of global parameters of the crawl like seed sites, timestamp, etc), COUNT (num urls seen + pages seen stored for the index archive in use for crawling), VISITED_URLS_COUNT (number of pages seen for the index archive in use for crawling), QUERY_COUNT (num urls seen + pages seen stored for the index archive in use for querying, not crawling), QUERY_VISITED_URLS_COUNT number of pages seen for the index archive in use for querying not crawling), NUM_DOCS_PER_PARTITION (how many doc/web archive in bundle).

Parameters

string	$dir_name	folder with archive bundle
array	$info	struct with above fields

getParamModifiedTime()

getParamModifiedTime(string  $dir_name)

Returns the last time the archive info of the bundle was modified.

Parameters

string

$dir_name

folder with archive bundle