\seekquarry\yioop\libraryDoubleIndexBundle

A DoubleIndexBundle encapsulates and provided methods for two IndexArchiveBundle used to store a repeating crawl. One one thse bundles is used to handle current search queries, while the other is used to store an ongoing crawl, once the crawl time has been reach the roles of the two bundles are swapped

Summary

Methods
Properties
Constants
__construct()
swapActiveBundle()
stopIndexingBundle()
swapTimeReached()
addPages()
addIndexData()
initGenerationToAdd()
addAdvanceGeneration()
addCurrentShardDictionary()
getCurrentShard()
setCurrentShard()
getPage()
forceSave()
countWordKeys()
setStartSchedule()
getStartSchedule()
getArchiveInfo()
setArchiveInfo()
getParamModifiedTime()
$repeat_frequency
$repeat_time
$swap_count
$active_archive
$active_archive_num
$description
$num_docs_per_generation
No constants found
No protected methods found
No protected properties found
N/A
No private methods found
No private properties found
N/A

Properties

$repeat_frequency

$repeat_frequency : integer

How frequency the live and ongoing archive should be swapped in seconds

Type

integer

$repeat_time

$repeat_time : integer

Last time live and ongoing archives were switched

Type

integer

$swap_count

$swap_count : integer

The number of times live and ongoing archives have swapped

Type

integer

$description

$description : string

A short text name for this DoubleIndexBundle

Type

string

$num_docs_per_generation

$num_docs_per_generation : integer

Number of docs before a new generation is started for an IndexArchiveBundle in this DoubleIndexBundle

Type

integer

Methods

__construct()

__construct(string  $dir_name, boolean  $read_only_archive = true, string  $description = null, integer  $num_docs_per_generation = \seekquarry\yioop\configs\NUM_DOCS_PER_GENERATION, integer  $repeat_frequency = 3600) 

Makes or initializes an DoubleIndexBundle with the provided parameters

Parameters

string $dir_name

folder name to store this bundle

boolean $read_only_archive

whether to open archive only for reading or reading and writing

string $description

a text name/serialized info about this IndexArchiveBundle

integer $num_docs_per_generation

the number of pages to be stored in a single shard

integer $repeat_frequency

how often the crawl should be redone in seconds (has no effect if $read_only_archive is true)

swapActiveBundle()

swapActiveBundle() 

Switches which of the two bundles is the the one new index data will be written. Before switching closes old bundle properly.

stopIndexingBundle()

stopIndexingBundle() 

Used when a crawl stops to perform final dictionary operations to produce a working stand-alone index.

swapTimeReached()

swapTimeReached() : boolean

Checks if the amount of time since the two IndexArchiveBundles in this DoubleIndexBundle roles have been swapped has exceeded the swap time for this buundle.

Returns

boolean —

true if the swap time has been exceeded

addPages()

addPages(integer  $generation, string  $offset_field, \seekquarry\yioop\library\array&  $pages, integer  $visited_urls_count) 

Add the array of $pages to the summaries WebArchiveBundle pages of the active IndexArchiveBundle, storing in the partition $generation and the field used to store the resulting offsets given by $offset_field.

Parameters

integer $generation

field used to select partition

string $offset_field

field used to record offsets after storing

\seekquarry\yioop\library\array& $pages

data to store

integer $visited_urls_count

number to add to the count of visited urls (visited urls is a smaller number than the total count of objects stored in the index).

addIndexData()

addIndexData(object  $index_shard) 

Adds the provided mini inverted index data to the active IndexArchiveBundle Expects initGenerationToAdd to be called before, so generation is correct

Parameters

object $index_shard

a mini inverted index of word_key=>doc data to add to this IndexArchiveBundle

initGenerationToAdd()

initGenerationToAdd(integer  $add_num_docs, object  $callback = null, boolean  $blocking = false) : integer

Determines based on its size, if index_shard should be added to the active generation or in a new generation should be started.

If so, a new generation is started, the old generation is saved, and the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed

Parameters

integer $add_num_docs

number of docs in the shard about to be added

object $callback

object with join function to be called if process is taking too long

boolean $blocking

whether there is an ongoing merge tiers operation occurring, if so don't do anything and return -1

Returns

integer —

the active generation after the check and possible change has been performed

addAdvanceGeneration()

addAdvanceGeneration(object  $callback = null) 

Starts a new generation, the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed. This function may be called by initGenerationToAdd as well as when resuming a crawl rather than loading the periodic index of save of a too large shard.

Parameters

object $callback

object with join function to be called if process is taking too long

addCurrentShardDictionary()

addCurrentShardDictionary(object  $callback = null) 

Adds the words from this shard to the dictionary

Parameters

object $callback

object with join function to be called if process is taking too long

getCurrentShard()

getCurrentShard(boolean  $force_read = false) : object

Returns the shard which is currently being used to read word-document data from the bundle. If one wants to write data to the bundle use getActiveShard() instead. The point of this method is to allow for lazy reading of the file associated with the shard.

Parameters

boolean $force_read

whether to force no advance generation and merge dictionary side effects

Returns

object —

the currently being index shard

setCurrentShard()

setCurrentShard(  $i,   $disk_based = false) 

Sets the current shard to be the $i th shard in the index bundle.

Parameters

$i

which shard to set the current shard to be

$disk_based

whether to read the whole shard in before using or leave it on disk except for pages need

getPage()

getPage(integer  $offset, integer  $generation = -1) : array

Gets the page out of the summaries WebArchiveBundle with the given offset and generation

Parameters

integer $offset

byte offset in partition of desired page

integer $generation

which generation WebArchive to look up in defaults to the same number as the current shard

Returns

array —

desired page

forceSave()

forceSave() 

Forces the current shard to be saved

countWordKeys()

countWordKeys(array  $word_keys) : array

Computes the number of occurrences of each of the supplied list of word_keys

Parameters

array $word_keys

keys to compute counts for

Returns

array —

associative array of key => count values.

setStartSchedule()

setStartSchedule(string  $dir_name, integer  $channel) 

The start schedule is the first schedule a queue server makes when a crawl is just started. To facilitate switching between IndexArchiveBundles when doing a crawl with a DoubleIndexBundle this start schedule is stored in the DoubleIndexBundle, when the IndexArchiveBundles' roles (query and crawl) are swapped, the DoubleIndexBundle copy is used to start the crawl from the beginning again. This method copies the start schedule from the schedule folder to the DoubleIndexBundle at the start of a crawl for later use to do this swapping

Parameters

string $dir_name

folder in the bundle where the schedule should be stored

integer $channel

channel that is being used to do the current double index crawl. Typical yioop instance might have several ongoing crawls each with a different channel

getStartSchedule()

getStartSchedule(string  $dir_name, integer  $channel) 

The start schedule is the first schedule a queue server makes when a crawl is just started. To facilitate switching between IndexArchiveBundles when doing a crawl with a DoubleIndexBundle this start schedule is stored in the DoubleIndexBundle, when the IndexArchiveBundles' roles (query and crawl) are swapped, this method copies the start schedule from the DoubleIndexBundle to the schedule folder to restart the crawl

Parameters

string $dir_name

folder in the bundle where the schedule is stored

integer $channel

channel that is being used to do the current double index crawl. Typical yioop instance might have several ongoing crawls each with a different channel

getArchiveInfo()

getArchiveInfo(string  $dir_name) : array

Gets information about a DoubleIndexBundle out of its status.txt file

Parameters

string $dir_name

folder name of the DoubleIndexBundle to get info for

Returns

array —

containing the name (description) of the DouleIndexBundle, the number of items stored in it, and the number of WebArchive file partitions it uses.

setArchiveInfo()

setArchiveInfo(string  $dir_name, array  $info) 

Sets the archive info struct for the index archive and web archive bundles associated with this double index bundle. This struct has fields like: DESCRIPTION (serialied store of global parameters of the crawl like seed sites, timestamp, etc), COUNT (num urls seen + pages seen stored for the index archive in use for crawling), VISITED_URLS_COUNT (number of pages seen for the index archive in use for crawling), QUERY_COUNT (num urls seen + pages seen stored for the index archive in use for querying, not crawling), QUERY_VISITED_URLS_COUNT number of pages seen for the index archive in use for querying not crawling), NUM_DOCS_PER_PARTITION (how many doc/web archive in bundle).

Parameters

string $dir_name

folder with archive bundle

array $info

struct with above fields

getParamModifiedTime()

getParamModifiedTime(string  $dir_name) 

Returns the last time the archive info of the bundle was modified.

Parameters

string $dir_name

folder with archive bundle