$repeat_frequency
$repeat_frequency : integer
How frequency the live and ongoing archive should be swapped in seconds
A DoubleIndexBundle encapsulates and provided methods for two IndexArchiveBundle used to store a repeating crawl. One one thse bundles is used to handle current search queries, while the other is used to store an ongoing crawl, once the crawl time has been reach the roles of the two bundles are swapped
$active_archive : \seekquarry\yioop\library\IndexArchiveBundle
The internal IndexArchiveBundle which is active
$active_archive_num : \seekquarry\yioop\library\IndexArchiveBundle
The number of the internal IndexArchiveBundle which is active
__construct(string $dir_name, boolean $read_only_archive = true, string $description = null, integer $num_docs_per_generation = \seekquarry\yioop\configs\NUM_DOCS_PER_GENERATION, integer $repeat_frequency = 3600)
Makes or initializes an DoubleIndexBundle with the provided parameters
string | $dir_name | folder name to store this bundle |
boolean | $read_only_archive | whether to open archive only for reading or reading and writing |
string | $description | a text name/serialized info about this IndexArchiveBundle |
integer | $num_docs_per_generation | the number of pages to be stored in a single shard |
integer | $repeat_frequency | how often the crawl should be redone in seconds (has no effect if $read_only_archive is true) |
addPages(integer $generation, string $offset_field, \seekquarry\yioop\library\array& $pages, integer $visited_urls_count)
Add the array of $pages to the summaries WebArchiveBundle pages of the active IndexArchiveBundle, storing in the partition $generation and the field used to store the resulting offsets given by $offset_field.
integer | $generation | field used to select partition |
string | $offset_field | field used to record offsets after storing |
\seekquarry\yioop\library\array& | $pages | data to store |
integer | $visited_urls_count | number to add to the count of visited urls (visited urls is a smaller number than the total count of objects stored in the index). |
addIndexData(object $index_shard)
Adds the provided mini inverted index data to the active IndexArchiveBundle Expects initGenerationToAdd to be called before, so generation is correct
object | $index_shard | a mini inverted index of word_key=>doc data to add to this IndexArchiveBundle |
initGenerationToAdd(integer $add_num_docs, object $callback = null, boolean $blocking = false) : integer
Determines based on its size, if index_shard should be added to the active generation or in a new generation should be started.
If so, a new generation is started, the old generation is saved, and the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed
integer | $add_num_docs | number of docs in the shard about to be added |
object | $callback | object with join function to be called if process is taking too long |
boolean | $blocking | whether there is an ongoing merge tiers operation occurring, if so don't do anything and return -1 |
the active generation after the check and possible change has been performed
addAdvanceGeneration(object $callback = null)
Starts a new generation, the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed. This function may be called by initGenerationToAdd as well as when resuming a crawl rather than loading the periodic index of save of a too large shard.
object | $callback | object with join function to be called if process is taking too long |
getCurrentShard(boolean $force_read = false) : object
Returns the shard which is currently being used to read word-document data from the bundle. If one wants to write data to the bundle use getActiveShard() instead. The point of this method is to allow for lazy reading of the file associated with the shard.
boolean | $force_read | whether to force no advance generation and merge dictionary side effects |
the currently being index shard
getPage(integer $offset, integer $generation = -1) : array
Gets the page out of the summaries WebArchiveBundle with the given offset and generation
integer | $offset | byte offset in partition of desired page |
integer | $generation | which generation WebArchive to look up in defaults to the same number as the current shard |
desired page
setStartSchedule(string $dir_name, integer $channel)
The start schedule is the first schedule a queue server makes when a crawl is just started. To facilitate switching between IndexArchiveBundles when doing a crawl with a DoubleIndexBundle this start schedule is stored in the DoubleIndexBundle, when the IndexArchiveBundles' roles (query and crawl) are swapped, the DoubleIndexBundle copy is used to start the crawl from the beginning again. This method copies the start schedule from the schedule folder to the DoubleIndexBundle at the start of a crawl for later use to do this swapping
string | $dir_name | folder in the bundle where the schedule should be stored |
integer | $channel | channel that is being used to do the current double index crawl. Typical yioop instance might have several ongoing crawls each with a different channel |
getStartSchedule(string $dir_name, integer $channel)
The start schedule is the first schedule a queue server makes when a crawl is just started. To facilitate switching between IndexArchiveBundles when doing a crawl with a DoubleIndexBundle this start schedule is stored in the DoubleIndexBundle, when the IndexArchiveBundles' roles (query and crawl) are swapped, this method copies the start schedule from the DoubleIndexBundle to the schedule folder to restart the crawl
string | $dir_name | folder in the bundle where the schedule is stored |
integer | $channel | channel that is being used to do the current double index crawl. Typical yioop instance might have several ongoing crawls each with a different channel |
getArchiveInfo(string $dir_name) : array
Gets information about a DoubleIndexBundle out of its status.txt file
string | $dir_name | folder name of the DoubleIndexBundle to get info for |
containing the name (description) of the DouleIndexBundle, the number of items stored in it, and the number of WebArchive file partitions it uses.
setArchiveInfo(string $dir_name, array $info)
Sets the archive info struct for the index archive and web archive bundles associated with this double index bundle. This struct has fields like: DESCRIPTION (serialied store of global parameters of the crawl like seed sites, timestamp, etc), COUNT (num urls seen + pages seen stored for the index archive in use for crawling), VISITED_URLS_COUNT (number of pages seen for the index archive in use for crawling), QUERY_COUNT (num urls seen + pages seen stored for the index archive in use for querying, not crawling), QUERY_VISITED_URLS_COUNT number of pages seen for the index archive in use for querying not crawling), NUM_DOCS_PER_PARTITION (how many doc/web archive in bundle).
string | $dir_name | folder with archive bundle |
array | $info | struct with above fields |