NO_LOAD_SIZE
NO_LOAD_SIZE
Threshold hold beyond which we don't load old index shard when restarting and instead just advance to a new shard
Subclass of IndexArchiveBundle with bloom filters to make it easy to check if a news feed item has been added to the bundle already before adding it
The basic file structures for an IndexArchiveBundle are:
$filter_a : \seekquarry\yioop\library\BloomFilterFile
Used to store unique identifiers of feed itemms that have been stored in this FeedArchiveBundle. This filter_a is used for checking if items are already in the archive, when it has URL_FILTER_SIZE/2 items filter_b is added to as well as filter_a. When filter_a is of size URL_FILTER_SIZE filter_a is deleted, filter_b is renamed to filter_a and the process is repeated.
$filter_b : \seekquarry\yioop\library\BloomFilterFile
Auxiliary BloomFilterFile used in checking if feed items are in this archive or not.
@see $filter_a
__construct(string $dir_name, boolean $read_only_archive = true, string $description = null, integer $num_docs_per_generation = \seekquarry\yioop\configs\NUM_DOCS_PER_GENERATION)
Makes or initializes an FeedArchiveBundle with the provided parameters
string | $dir_name | folder name to store this bundle |
boolean | $read_only_archive | whether to open archive only for reading or reading and writing |
string | $description | a text name/serialized info about this IndexArchiveBundle |
integer | $num_docs_per_generation | the number of pages to be stored in a single shard |
addPages(integer $generation, string $offset_field, \seekquarry\yioop\library\array& $pages, integer $visited_urls_count)
Add the array of $pages to the summaries WebArchiveBundle pages being stored in the partition $generation and the field used to store the resulting offsets given by $offset_field.
integer | $generation | field used to select partition |
string | $offset_field | field used to record offsets after storing |
\seekquarry\yioop\library\array& | $pages | data to store |
integer | $visited_urls_count | number to add to the count of visited urls (visited urls is a smaller number than the total count of objects stored in the index). |
addIndexData(object $index_shard)
Adds the provided mini inverted index data to the IndexArchiveBundle Expects initGenerationToAdd to be called before, so generation is correct
object | $index_shard | a mini inverted index of word_key=>doc data to add to this IndexArchiveBundle |
initGenerationToAdd(integer $add_num_docs, object $callback = null, boolean $blocking = false) : integer
Determines based on its size, if index_shard should be added to the active generation or in a new generation should be started.
If so, a new generation is started, the old generation is saved, and the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed
integer | $add_num_docs | number of docs in the shard about to be added |
object | $callback | object with join function to be called if process is taking too long |
boolean | $blocking | whether there is an ongoing merge tiers operation occurring, if so don't do anything and return -1 |
the active generation after the check and possible change has been performed
addAdvanceGeneration(object $callback = null)
Starts a new generation, the dictionary of the old shard is copied to the bundles dictionary and a log-merge performed if needed. This function may be called by initGenerationToAdd as well as when resuming a crawl rather than loading the periodic index of save of a too large shard.
object | $callback | object with join function to be called if process is taking too long |
getCurrentShard(boolean $force_read = false) : object
Returns the shard which is currently being used to read word-document data from the bundle. If one wants to write data to the bundle use getActiveShard() instead. The point of this method is to allow for lazy reading of the file associated with the shard.
boolean | $force_read | whether to force no advance generation and merge dictionary side effects |
the currently being index shard
getPage(integer $offset, integer $generation = -1) : array
Gets the page out of the summaries WebArchiveBundle with the given offset and generation
integer | $offset | byte offset in partition of desired page |
integer | $generation | which generation WebArchive to look up in defaults to the same number as the current shard |
desired page
getArchiveInfo(string $dir_name) : array
Gets the description, count of summaries, and number of partitions of the summaries store in the supplied directory. If the file arc_description.txt exists, this is viewed as a dummy index archive for the sole purpose of allowing conversions of downloaded data such as arc files into Yioop! format.
string | $dir_name | path to a directory containing a summaries WebArchiveBundle |
summary of the given archive
setArchiveInfo(string $dir_name, array $info)
Sets the archive info struct for the web archive bundle associated with this bundle. This struct has fields like: DESCRIPTION (serialied store of global parameters of the crawl like seed sites, timestamp, etc), COUNT (num urls seen + pages seen stored), VISITED_URLS_COUNT (number of pages seen while crawling), NUM_DOCS_PER_PARTITION (how many doc/web archive in bundle).
string | $dir_name | folder with archive bundle |
array | $info | struct with above fields |
addPagesAndSeenKeys(integer $generation, string $offset_field, string $key_field, \seekquarry\yioop\library\array& $pages, integer $visited_urls_count)
Add the array of $pages to the summaries WebArchiveBundle pages being stored in the partition $generation and the field used to store the resulting offsets given by $offset_field.
integer | $generation | field used to select partition |
string | $offset_field | field used to record offsets after storing |
string | $key_field | field used to store unique identifier for a each page item. |
\seekquarry\yioop\library\array& | $pages | data to store |
integer | $visited_urls_count | number to add to the count of visited urls (visited urls is a smaller number than the total count of objects stored in the index). |
addFilters(string $key)
Adds the key (often GUID) of a feed item to the bloom filter pair associated with this archive. This always adds to filter a, if filter a is more than half full it adds to filter b. If filter a is full it is deletedand filter b is renamed filter a and te process continues where a new filter b is created when this becomee half full.
string | $key | unique identifier of a feed item |