NO_LOAD_SIZE
NO_LOAD_SIZE
Threshold hold beyond which we don't load old index shard when restarting and instead just advance to a new shard
Encapsulates a set of web page summaries and an inverted word-index of terms from these summaries which allow one to search for summaries containing a particular word.
The basic file structures for an IndexArchiveBundle are:
__construct(string $dir_name, boolean $read_only_archive = true, string $description = null, integer $num_docs_per_generation = \seekquarry\yioop\configs\NUM_DOCS_PER_GENERATION)
Makes or initializes an IndexArchiveBundle with the provided parameters
string | $dir_name | folder name to store this bundle |
boolean | $read_only_archive | whether to open archive only for reading or reading and writing |
string | $description | a text name/serialized info about this IndexArchiveBundle |
integer | $num_docs_per_generation | the number of pages to be stored in a single shard |
addPages(integer $generation, string $offset_field, \seekquarry\yioop\library\array& $pages, integer $visited_urls_count)
Add the array of $pages to the summaries WebArchiveBundle pages being stored in the partition $generation and the field used to store the resulting offsets given by $offset_field.
integer | $generation | field used to select partition |
string | $offset_field | field used to record offsets after storing |
\seekquarry\yioop\library\array& | $pages | data to store |
integer | $visited_urls_count | number to add to the count of visited urls (visited urls is a smaller number than the total count of objects stored in the index). |
addIndexData(object $index_shard)
Adds the provided mini inverted index data to the IndexArchiveBundle Expects initGenerationToAdd to be called before, so generation is correct
object | $index_shard | a mini inverted index of word_key=>doc data to add to this IndexArchiveBundle |
initGenerationToAdd(integer $add_num_docs, object $callback = null, boolean $blocking = false) : integer
Determines based on its size, if index_shard should be added to the active generation or in a new generation should be started.
If so, a new generation is started, the old generation is saved, and the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed
integer | $add_num_docs | number of docs in the shard about to be added |
object | $callback | object with join function to be called if process is taking too long |
boolean | $blocking | whether there is an ongoing merge tiers operation occurring, if so don't do anything and return -1 |
the active generation after the check and possible change has been performed
addAdvanceGeneration(object $callback = null)
Starts a new generation, the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed. This function may be called by initGenerationToAdd as well as when resuming a crawl rather than loading the periodic index of save of a too large shard.
object | $callback | object with join function to be called if process is taking too long |
getCurrentShard(boolean $force_read = false) : object
Returns the shard which is currently being used to read word-document data from the bundle. If one wants to write data to the bundle use getActiveShard() instead. The point of this method is to allow for lazy reading of the file associated with the shard.
boolean | $force_read | whether to force no advance generation and merge dictionary side effects |
the currently being index shard
getPage(integer $offset, integer $generation = -1) : array
Gets the page out of the summaries WebArchiveBundle with the given offset and generation
integer | $offset | byte offset in partition of desired page |
integer | $generation | which generation WebArchive to look up in defaults to the same number as the current shard |
desired page
getArchiveInfo(string $dir_name) : array
Gets the description, count of summaries, and number of partitions of the summaries store in the supplied directory. If the file arc_description.txt exists, this is viewed as a dummy index archive for the sole purpose of allowing conversions of downloaded data such as arc files into Yioop! format.
string | $dir_name | path to a directory containing a summaries WebArchiveBundle |
summary of the given archive
setArchiveInfo(string $dir_name, array $info)
Sets the archive info struct for the web archive bundle associated with this bundle. This struct has fields like: DESCRIPTION (serialied store of global parameters of the crawl like seed sites, timestamp, etc), COUNT (num urls seen + pages seen stored), VISITED_URLS_COUNT (number of pages seen while crawling), NUM_DOCS_PER_PARTITION (how many doc/web archive in bundle).
string | $dir_name | folder with archive bundle |
array | $info | struct with above fields |